Transform-Domain Rate-Distortion Optimization Accelerator for H.264/AVC Video Encoding

Size: px

Start display at page:

Download "Transform-Domain Rate-Distortion Optimization Accelerator for H.264/AVC Video Encoding"

Judith Barbra Wilcox
5 years ago
Views:

1 Transorm-Domain Rate-Distortion Optimization Accelerator or H.64/AVC Video Encoding Mohammed Golam Sarwer, Lai Man Po, Kai Guo and Q.M. Jonathan Wu Abstract In H.64/AVC video encoding, rate-distortion optimization or mode selection plays a signiicant role to achieve outstanding perormance in compression eiciency and video quality. However, this mode selection process also makes the encoding process extremely complex, especially in the computation o the ratedistortion cost unction, which includes the computations o the sum o squared dierence (SSD) between the original and reconstructed image blocks and context-based entropy coding o the block. In this paper, a transorm-domain rate-distortion optimization accelerator based on ast SSD (FSSD) and VLC-based rate estimation algorithm is proposed. This algorithm could signiicantly simpliy the hardware architecture or the rate-distortion cost computation with only ignorable perormance degradation. An eicient hardware structure or implementing the proposed transorm-domain rate-distortion optimization accelerator is also proposed. Simulation results demonstrated that the proposed algorithm reduces about 47% o total encoding time with negligible degradation o coding perormance. The proposed method can be easily applied to many mobile video application areas such as a digital camera and a DMB (Digital Multimedia Broadcasting) phone. Keywords Context-adaptive variable length coding (CAVLC), H.64/AVC, rate-distortion optimization (RDO), sum o squared dierence (SSD). H I. INTRODUCTION.64/AVC is the newest hybrid video coding standard developed by ITU-T Video Coding Expert Group (VCEG) and ISO/IEC MPEG Video Group named Joint Video Group (JVT) [1, ]. H.64/AVC has proved its superiority in coding eiciency over its precedents, e.g., it shows a more than 40% rate reduction over H.63 [3]. The rate-distortion optimization (RDO) is one o the essential parts o the H.64/AVC encoder to achieve the much better coding perormance. However, the computational complexity o the RDO technique is extremely high. Hence, the cost unction Mohammed Golam Sarwer was with City University o Hong Kong, Hong Kong. He is now with the University o Windsor, Windsor, ON, Canada ( sarwer@uwindsor.ca). Lai Man Po and Kai guo are with the Department o Electronic Engineering, City University o Hong Kong, Kowloon, Hong Kong, ( eelmpo@cityu.edu.hk). Q. M. Jonathan Wu is with the Department o Electrical and Computer Engineering, University o Windsor, Windsor, ON, Canada( jwu@uwindsor.ca). computation makes H.64/AVC impossible to be realized in real time applications without high computing hardware. In order to accelerate H.64/AVC video coding, a number o researches have been made to explore ast algorithms in motion estimation [4-5] and mode decision [6-7]. To reduce complexity, a new cost unction or intra 4x4 mode decisions was proposed in [9]. In this cost unction, sum o absolute integer transorm dierence (SAITD) is used in distortion part and a rate prediction algorithm is used in rate part. A new transorm domain rate estimation and distortion measure was proposed to reduce the rate-distortion cost unction computation with ignorable coding perormance degradation [10]. But the computation reduction is low. In order to avoid the entropy coding method during mode decision process, several methods [11, 1, 13] are introduced to predict the bit rate o 4x4 block. To reduce the complexity o distortion part o RDO technique, a new ast sum o squared dierence (FSSD) computation algorithm with use o an iterative tablelookup quantization process was proposed in [14]. This FSSD algorithm was based on the theoretical equivalent o the SSDs in spatial and transorm domain. This approach avoided the inverse quantization/transorm and pixel reconstruction processes with nearby no rate-distortion perormance degradation. In this paper, a transorm-domain rate-distortion optimization accelerator is developed based on ast sum o squared dierence (FSSD) and rate estimation algorithm. A method to estimate the rate o context-adaptive variable length coding (CAVLC) is proposed. An eicient hardware structure or implementing the proposed transorm-domain RDO accelerator is also introduced. The remainder o this paper is organized as ollows. Section provides the review o rate distortion optimized mode decision technique. In section 3, transorm domain ast sum o squared dierence (FSSD) calculation is described. Section 4 introduces the ast rate estimation method. In section 5, algorithm o proposed rate distortion optimization accelerator is presented. Hardware architecture o proposed RDO accelerator is proposed in section 6. The simulation results o the proposed method are presented in section 7. Finally, section 8 concludes the paper. 38

2 II. RATE DISTORTION OPTIMIZED MODE DECISION OF H.64/AVC To take the ull advantages o all modes, the H.64/AVC encoder can determine the mode that meets the best RD tradeo using RDO mode decision scheme. The best mode is the one with minimum rate-distortion (RD) cost and this cost is deined as J RD = SSD( S, C) λ R (1) where λ is the Lagrange multiplier, R is the number o coded bits or each macroblock which can be written as ollows: R = Rheader Rmotion Rres () where R header, R motion and Rres means the number o bits need to encode the header inormation, motion vectors and quantized residual block, respectively. The SSD(S,C) in (1) is the sum o the squared dierence (SSD) between the original blocks S and the reconstructed block C, and it can be expressed as N 1 N 1 SSD(, ) = S C ( S C ) (3) i= 0 j= 0 where S and C are the (i, j)th elements o the current original block S and the reconstructed block C, respectively. In order to compute RD cost or each mode, same operation o orward and inverse transorm/quantization and entropy coding is repetitively perormed. All o these processing explains the high complexity o RD cost computation. International Journal o Signal Processing 5:3 009 From (4), it is shown that to calculate the SSD we need 16 multipliers and 16 square operators. Actually the q is the constant value. The value o q at dierent pixel position is given in Table 1. In the irst region o Table 1 multiplication by q can easily by implemented by shiting right by two bits. To avoid the multiplication operation in second region q is 1 1 approximated as. Hence multiplication in this region 10 3 is replaced by shiting right by three bits. Similarly, or region III, q is approximated as ( ) From simulations, it is shown that these approximations do not signiicantly aect the SSD calculation as well as rate distortion perormance. It should be noted that no computation is necessary or shiting operator. In order to avoid the square operation, we have used a square table. It should be noted again that is the coeicients beore quantization and ˆ is the coeicients ater inverse quantization. So the sign o and ˆ is always same. Additionally, the dierences between size and. Mathematically, ˆ must be smaller than quantization step ˆ ˆ < < q III. FAST SUM OF ABSOLUTE DIFFERENCE CALCULATION The cause o dierence between original block and reconstructed block is due to the quantization error. Hence the spatial domain SSD and transorm domain SSD is theoretically equivalent [14]. Mathematically, where, block, N = 1 N 1 SSD ( S, C) ( q [ ˆ ]) (4) i= 0 j= 0 is the (i,j)th element o integer cosine transormed ˆ is the (i,j)th element o inverse quantized transorm block and q is the (i,j)th element o scaled transorm matrix. The relationship between deined as [14] ˆ 1 Q ( Q( )) = round( and = ˆ is = / ) Z (5) where = / q represents the new quantization step size deine in [14] and is the quantization step size o original H.64 encoder. In order to avoid the multiplication and division operation in (5), we proposed an iterative tablelookup quantization process in [14]. ˆ q ( ) < (6) TABLE I q AT DIFFERENT POSITION Region Position (i, j) I (0,0),(0,),(,0),(,) II (1,1),(1,3),(3,1),(3,3) III Others For example, the value o is 10, 16, 6, 40 and 64 or QP is 4, 8, 3, 36, and 40, respectively. I we choose QP=4, the let hand side o (6) is always less that 10. Additionally, the values o let hand sides are always integer. I we store the squares o all integer values which are less than 10 (or QP=4), then 16 square operation can be saved. In this way, we need additional memory requirement. But the size o memory is not large. We have to store only 10 elements in case o QP=4. Thus, the computational intensive multiplication and square operations can be avoided. q 1 a = 4 b = ab 1 = 4 5

3 IV. FAST RATE ESTIMATION Context adaptive variable length coding (CAVLC) uses ive syntax elements to encode the 4x4 block. To estimate the bits or quantized transorm coeicients, we estimate the number o bits or each o ive dierent types o symbols o CAVLC separately. A. Coeicient Token Both the total number o nonzero coeicients (Ncoe) and the number o trailing /- 1s ( T L ) are coded as combined event. For all elements, except chroma DC, a choice between three tables and one ixed length codeword is made [15]. N T is a parameter used or table selection. Assume Nu and N L is the number o non-zero coeicients o upper and let block, respectively. I both upper and let blocks are available then N T = ( Nu N L )/. I only upper block available N T =Nu; i only let block available N T =N L ; i neither is available N T =0. Since to encode coeicient token, three dierent VLC tables are used, it needs more memory to store all the possible symbols. This is not eicient or hardware implementation in terms o memory requirement. It is well known that neighboring blocks are highly spatially correlated. Hence there is high correlation between parameter N T and number o nonzero coeicients o current block (Ncoe). Thereore, we can replace N T by Ncoe or selection o VLC tables. The algorithm or table selection o proposed method given below: i ( 0<= N coe <): Num-VLC0 i ( <= N coe <4): Num-VLC1 i ( 4<= N coe <=7): Num-VLC i (N coe >7): 6 bit ixed length codeword TABLE II PROBABILITY OF WRONG TABLE SELECTION Sequence QP Akiyo Claire Foreman Container Stean Miss_America Silent Salesman In order to justiy the above algorithm, we have done several experiments and calculated the probability o wrong table selection o dierent video sequences with dierent QP values. Table shows the probability o wrong table selection. From Table, we have seen that this probability is very low or low motion sequences such as Akiyo, Claire, and Miss America. Although this probability or medium and high motion sequences is large, it does not greatly inluence the rate-distortion perormance. This is because number o bits or a particular coeicient token is almost similar in all VLC tables. Let us assume a symbol with number o non-zero coeicient is 4 and number o trailing ones is 3. The number o bits using VLC tables 0 is 6, while the number o bits is 4 or both VLC tables 1 and [15]. Similarly when the number o non-zero coeicients is 3 and number o trailing one is 0, International Journal o Signal Processing 5: number o bits to encode this symbol are 9, 7and 6 or VLC tables 0, 1 and, respectively. By combing the mentioned algorithm with VLC tables, we have developed a rate table to encode the coeicient token. Rate table is generated by collecting the length o codeword rom the dierent tables. For example, when Ncoe is either 0 or 1, data is collected rom VLC table0 and while the value o Ncoe is either or 3, data is collected rom VLC table 1. From observation o VLC tables, we have seen that the rate is exactly equal to 6 while number o non-zero coeicient o current block (Ncoe ) is greater than 7. Otherwise, rate o 4x4 block ( Rcoe) is estimated based on a new rate table deined in Table 3. Thereore, algorithm or proposed rate estimation or coeicient token o 4x4 block is given as ollows: i (Number o nonzero coeicients> 7) R coe =6; else search rate rom Table 3; TABLE III RATE TABLE FOR COEFFICIENT TOKEN OF 4X4 BLOCK Trailing ones ( T L) Total Coeicients ( N Coe) B. Sign o Trailing ones One bit is used to signal sign inormation o trailing /- 1s. 0 is used or positive and 1 is used or negative. So bit consumption to encode the sign o trailing ones ( R trail1 ) is exactly equal to the number o trailing ones (N L ). C. Encode the Level CAVLC makes the choice o the VLC look-up table or the level adaptive in a way where the choice depends on the recently coded levels [16]. One out o 7 level VLC tables is used to encode the level inormation [15]. Level VLC0 has its own structure while the other tables, level VLCN, N=1 to 6, share common structure. From the observation o VLC tables, we have seen that bit consumption to encode a coeicient is increasing with increasing the absolute value o that coeicient. Additionally, the bit consumption also depends on the absolute value o previously coded level coeicient. From the observation, rate or level is estimated as ollows R = L level( i) i or highest requency component (7.a) α β R or other component (7.b) level( i) = 1 Li 1 Li 1 α1 β1 where L i-1 is the recently encoded level o i th level. In order to set the weighting actors, we have done several experiments or dierent video sequences (Akiyo, Foreman, Stean, and Mobile) with QCIF ormat at dierent QP values. We have observed RD perormance o these video sequences at

4 dierent combinations o weighting actor. Better RD perormance was ound at α 1 = 3 and β 1 = 1. D. Encode Total Zeros The codeword Total_zeros is the number o zeros between the last non-zero coeicient o the zig-zag scan and its start. In case o mode decision process, only number o bits to encode the symbol is necessary. I only the length o codeword (instead o codeword) is retrieved rom the total_zero table, then we can avoid the extra calculation or codeword generation o CAVLC during mode decision process. E. Encode Runs Since ater transormation and quantization, blocks typically contain mostly zeros, CAVLC algorithm uses run-level coding to represent strings o zeros compactly. Run means the number o preceding zeros beore each non-zero coeicient and Zeros_let called the number o zeros let beore each non-zero coeicient. Based on Run and Zeros_let, a run VLC tables is used to encode this element [15]. From observation o run VLC table, it is shown that no o bits to encode this coeicient is increasing with the value o run. The estimated rate or run is proposed as R run( i) = α Runi β (8) Here α and β are the weighting actors. From observation o run table, we have seen that value o α and β are depends on the Run. Table 4 shows the value o α and β with dierent value o run. TABLE IV VALUE OF α AND β R run α β 0,1,and 1 1 3,4,5,and >6 1-3 International Journal o Signal Processing 5:3 009 Mathematically, P Rres( est) = Rcoe Rtrail1 Rlevel( i) Rzero Rrun( i) i= 1 i= 1 (9) where P and Q are the number o level and run, respectively. Fig. 1 shows the comparison o our proposed method with actual rate or the Foreman sequence. Data is collected rom the irst 100 4x4 block. QP actor is set as 8. It is shown that the proposed method is very closely matched with actual rate. V. RATE-DISTORTION OPTIMIZATION ACCELERATOR The overall rate-distortion cost computation process using the proposed method combined with FSSD [14] can be summarized as: Step 1: Compute the predicted block using inter or intra rame prediction: P Step : Compute the residual (dierence) block: D = S P Step 3: ICT transorm the residual block: F = ICT (D) Step 4: or i=0 to N-1 and j=0 to N-1 determine the quantized coeicients Z by the ollowing iterative tablelookup quantization process: Step I: Set k=0 and i < 0 then set sign=-1, otherwise sign=1. Step II: I k, then k=k1 and goto Step II; Step III: Set Z = sign k and ˆ = sign k. ; Step 5: Calculate SSD; Step 6: Estimate the number o bits R e = Rheader Rmotion Rres(est) Step 7: Calculate the R-D cost : J RD = SSD R e Q Actual Rate Estimated Rate Fig. 1 Estimated and the actual rates o Foreman (X-axis: Block number, Y-axis: number o bits) Thereore, the number o bits to encode the 4x4 block is calculated by summing the estimated bits o all symbols. Fig. Block diagram o proposed RDO The block diagram o proposed RDO accelerator is shown in Fig.. The major dierence is that the conventional quantization process is replaced by the iterative table-lookup quantization and the SSD calculation is preormed in the ICT transorm domain. Additionally, entropy coding technique is replaced by ast rate estimation algorithm. We can obtain the rate-distortion cost without computations o two scale transorms, quantization, inverse quantization, inverse ICT transorm and entropy coding. 41

5 ˆ 00 ˆ 01 ˆ Fig. 3. Hardware architecture or iterative table-lookup quantization process VI. HARDWARE ARCHITECTURE OF PROPOSED RDO ACCELERATOR Fig. 3 describes a possible hardware structure or the iterative table-lookup quantization implementation [14]. As each coeicient does not depend on the others, the quantization process can be perormed in parallel. In the other words, we can quantize all the coeicients o simultaneously. At irst, the quantization points and boundary points are generated by quantization table generator at the beginning o encoding process and then stored in the RAM, in which there is a pointer that points to the current boundary point. Then the current boundary point and quantization point are loaded in a register, which will compare with ater the sign o judges that is abstracted by a sign selector. I the comparator is larger, then Loop signal takes eect which tells counter to add one and the pointer to shit orward and point to the next boundary point. When is smaller than the current boundary point, Out signal takes eect which resets the pointer to the initial position, asks the counter to output the accumulated value z and asks the register to output the current quantization value, ˆ. From the structure in Fig. 3, we can ind that the proposed table-lookup quantization process is very suitable or hardware implementation, since it does not require complicated operation modules and can execute in the parallel mode. Fig. 4 shows the architecture o SSD calculator. It is shown that 16 multiplication operations are replaced by 4 shiting operators. It should be noted that no computation is necessary or shiting operator. In order to avoid the square operator, we have used a square table. In the square table, square o all integers within quantization step size are stored. In this way, we need additional memory requirement. But the size o memory (ROM) is not large because o the limited value o. For example encoder has to store 64 values i the quantization parameter QP is 40. Architecture o shiter or region III is given in Fig. 5. Here multiplier is replaced by two shit right operators. 4

6 00 ˆ 00 0 ˆ 0 0 ˆ 0 ˆ 11 ˆ ˆ 13 ˆ ˆ 01 ˆ ˆ ˆ 10 1 ˆ 1 1 ˆ 1 3 ˆ 3 30 ˆ 30 3 ˆ >> >> >> >> >>3 >>3 >>3 >>3 ROM (Square Table) SSD Fig. 4. Architecture o SSD calculator >> 3 >> 5 Fig. 5 Architecture o shiter or region III The hardware architecture or proposed ast rate estimation algorithm is shown in Fig. 6. The proposed architecture contains a number o counters and register to store the inormation or the block. Non-zero coeicients counter is used to store the number o non-zero coeicients ( N coe ). Trailing one counter is used to store the number o trailing /- 1s. Total zero counter is used to store the total number o zeros beore the last non-zero coeicient. Run beore register is used to store the number o zeros preceding each non-zero coeicient. The level detector checks i the coeicient is either level or not. The proposed architecture begins by reading the coeicients rom the input buer in reverse zig-zag order. In each cycle, it reads one coeicient rom the input buer analyzes the coeicient and updates the inormation stored in the related counter and register. Reverse zig-zag scanning enables us to determine the necessary inormation or encoding 4x4 block. It takes 16 cycles to read the coeicients and stored the corresponding inormation in the counter and register ile or each 4x4 block. Based on the number o non-zero coeicients, comparator 1 select either the rate table or coeicient token (ROM1) should be used or not. Comparator 1 compares the output o non-zero coeicient counter with content o register 4. Register 4 always store a ixed value 7. Comparator1 sent a signal 1 when the number o non-zero coeicients is greater than the ixed value 7. Then the register 1 is enabled. I comparator 1 sent the signal 0, rate table or coeicient token is activated. In this case, rate is ound by searching the rate table based on the number o non-zero coeicients and number o trailing ones. Similarly number o bits to encode the total zero is calculated by rate table or total zeros (ROM). The level detector checks i the coeicient is zero or not. I the level is non-zero, it will be stored into the level-irst-in-irst-out buer (FIFO), and the corresponding run inormation will be also stored to the run-fifo. Based on the content o FIFO, level and run rate estimators predict the rate o level and run, respectively. 43

7 Register 4 Load: 7 Register 1 Load: 6 Zig-zag scan address generator Non-zero coeicient counter Trailing one counter Comparator 1 Rate table or coeicient token (ROM1) M U X Input register Total-zero counter Run beore register FIFO Rate-table or total zero (ROM) Run rate estimator A D D E R Estimated rate Level Detector FIFO Level rate estimator Fig. 6. Hardware structure o proposed rate estimator VII. SIMULATION RESULTS To evaluate the perormance o the proposed method, JM 9.6 [8] reerence sotware is used in simulation. Dierent types o video sequences are used as test materials. A group o experiments were carried out on the test sequences with dierent quantization parameters. All the simulations are conducted under Windows Vista operating system, with Pentium 4. G CPU and 1 G RAM. The conditions o the simulation are listed in Table 5. The comparison results are produced and tabulated based on the average dierence o total encoding time ( T %), the average PSNR dierences ( PSNR ), and the average bit rate dierences ( Rate %). PSNR and bit rate dierences are calculated according to the numerical averages between RD curves derived rom the original and proposed algorithm, respectively. The detail procedure to calculate these dierences can be ound in [17]. In order to evaluate the complexity reduction, T (%) is deined as ollows Toriginal T proposed T % = 100% (10) Toriginal where, T original denotes the total encoding time o the JM 9.6 encoder and T proposed is total encoding time o the encoder with proposed method. TABLE V SIMULATION CONDITIONS GOP Structure IPPP Quantization Parameters 4/8/3/36 Number o Reerence Frame 1 Hadamard Transorm OFF Search range 16 Symbol mode CAVLC/CABAC RDO optimization ON Frame rate 30 ps Number o rames 50 Fast Motion estimation Enabled A. Results with CAVLC entropy coding method In this experiment, ater selecting the best mode by proposed technique, CAVLC is used or inal entropy coding o best mode. The rate-distortion perormance comparison o the conventional RDO and proposed RDO in terms o PSNR, bitrate, and encoding time are shown in Tables 6. When only rate estimation is used, the degradation o RD perormance is very negligible. We have seen that while only rate estimation is used, the average PSNR degradation is db and the average bit rate increment is 0.615%. When both two algorithms are used, the rate-distortion perormance dierences are always less than. % and 0.07 db in terms o bit rate and PSNR, respectively. Fig. 7 shows the comparison o rate-distortion curves o proposed accelerator with original RD optimization method. The perormance o proposed 44

8 TABLE VI EXPERIMENTAL OF PROPOSED METHOD WHILE CAVLC IS USED AS ENTROPY CODING Sequence only FSSD Rate Estimation PSNR Rate% T % PSNR Rate% T % Foreman ( CIF) Flower ( CIF) Bus (CIF) Akiyo (QCIF) Mobile (QCIF) Salesman (QCIF) Stean ( QCIF) Average TABLE VII EXPERIMENTAL OF PROPOSED METHOD WHILE CABAC IS USED AS ENTROPY CODING Sequence only FSSD Rate Estimation PSNR Rate% T % PSNR Rate% T % Foreman ( CIF) Flower ( CIF) Bus (CIF) Akiyo (QCIF) Mobile (QCIF) Salesman (QCIF) Stean ( QCIF) Average Sequence QP PSNR in db TABLE VIII COMPARISON WITH OTHER RATE ESTIMATION METHODS Method in [13] Method in [11] Method in [9] Bit rate (%) Complexity Reducti on (%) PSNR in db Bit rate (%) Complexity Reduction (%) PSNR in db Bit rate (%) Complexity Reduction (%) Flower (CIF) Bus (CIF) Foreman (CIF) Average method is similar with actual RD curves. As shown in Table 6, the proposed accelerator can reduce the encoding time by around 47% (on average), in which about 14% reduction is achieved by the FSSD and about % reduction is achieved by rate estimation. B. Results with CABAC entropy coding method Although the proposed rate-distortion accelerator is developed based on CAVLC, it is also suitable during mode decision when Content Adaptive Binary Arithmetic Coding (CABAC) is used as entropy coding method. This is because the proposed algorithm is used only or RD optimized mode selection process. Ater selection o best mode, true entropy coding is used. As shown in Table 7, PSNR reduction and bit rate increment are about 0.08 db and.5% on average, respectively. Fig. 8 shows the comparison o RD optimized curves. As shown in Fig. 8, the rate distortion perormance o proposed method is much closed with actual RD optimized curve. C. Comparison with other methods In this experiment, the proposed rate estimation method is compared with other rate estimation methods. Only rate estimation method is enabled and CAVLC is used as inal entropy coding method. Complexity reduction is calculated by (11). Tre T proposed Complexity reduction (%) = 100% (11) Tre 45

9 where, T re is the total encoding time o the other conventional method and T proposed is the total encoding time with proposed rate estimation method only. International Journal o Signal Processing 5: FSSDRate Estimation (a) Akiyo ( QCIF) 34 FSSDRate Estimation (b) Foreman (CIF) 9 FSSDRate Estimation FSSDRate Estimation (c) Mobile (QCIF) (d) Flower (CIF) FSSDRate Estimation FSSDRate Estimation (e) Stean (QCIF) () Bus (CIF) Fig.7. RD perormance o proposed accelerator with CAVLC ( X-axis: Bit rate in kbps, Y-axis: PSNR in db) 46

10 FSSDRate Estimation (a) Akiyo ( QCIF) 34 FSSDRate Estimation (b) Foreman (CIF) 9 FSSDRate Estimation (c) Mobile (QCIF) FSSDRate Estimation (d) Flower (CIF) FSSDRate Estimation FSSDRate Estimation (e) Stean (QCIF) () Bus (CIF) Fig. 8. RD perormance o proposed accelerator with CABAC ( X-axis: Bit rate in kbps, Y-axis: PSNR in db) 47

11 Comparison results are presented in Table 8. Positive values or the PSNR and bit rate dierences indicate increments, and negative values indicate decrements. Negative values o complexity reductions indicate the increment o complexity o proposed method. We have seen that the RD perormance o proposed method is always better than the other methods mentioned in Table 8. As compared to [11], the proposed method increases the video quality by 0.09 db and reduces the bit rate o 1.05% on average with more similar complexity reduction. With compared to [13], the proposed algorithm achieves a better complexity saving which is about 5% and improves the RD perormance slightly. The proposed method achieves more PSNR increment (about 0.10dB) and bit rate reduction (about 3.10%) o [9] with the expanse o a little computational load. VIII. CONCLUSION In this paper, ast rate-distortion optimization method or mode decision o H.64/AVC is proposed. FSSD algorithm avoids conventional quantization, inverse quantization and inverse cosine transorm. Additionally, entropy coding is replaced by ast rate estimation method. The proposed rate distortion accelerator can be used or both CAVLC and CABAC in H.64 encoding. O course, the accuracy or CAVLC should be better but our simulation results show that the perormance or CABAC with use o proposed accelerator in mode selection is similar to ull RD cost encoding using CABAC. The proposed technique reduces total encoding time by about 47 % on average with negligible degradation o coding perormance. The proposed method is suitable or many video application areas such as a digital camera and a mobile phone. ACKNOWLEDGMENT This work was substantially supported by a CERG grant rom Hong Kong Government, Hong Kong SAR, China. [Project No ]. REFERENCES [1] ISO/IEC , Inormation Technology-Coding o audio-visual objects-part:10: Advanced Video Coding. ISO/IEC JTC1/SC9/WG11 (004) [] Weigand T., Sullivan, G. Bjontegaard, and G.,Luthra, A., Overview o H.64/AVC Video Coding Standard, IEEE Trans. Circuits and Syst. Video Technol., vol. 13, pp , July 003. [3] T.Wiegand, H.Schwarz, A.Joch, F. Kossentini, and G.J. Sullivan, Rateconstrained coder control and comparison o video coding standards, IEEE Trans. Circuits and Syst. Video Technol., vol 13, no. 7, pp , July 003 [4] Zhibo Chen, Peng Zhou, Yun He, Fast Integer Pel and Fractional Pel Motion Estimation or JVT, Joint Video Team (JVT) Docs, JVT-F017, Dec. 00. [5] Yang L., Yu K., Li J., Li S. An eective variable block size early termination algorithm or H.64 video coding, IEEE Trans. Circuits and Syst. Video Technol., vol. 15, no. 6, pp , June 005. [6] Feng Pan, Xiao Lin, Susanto Rahardja, Keng Pang Lim, Z.G. Li, Dajun Wu, and Si WU, Fast Mode Decision Algorithm or Intra-prediction in H.64/AVC Video Coding, IEEE Trans. Circuits and Syst. Video Technol., vol. 15, no. 7, pp-813-8, July 005. [7] K.P.Lim, S. Wu, D. J. Wu, S. Rahardja, X.Lin, F.Pan, and Z.G.Li, Fast inter mode selection, Joint Video Team(JVT), Doc. JVT-I00, Sep International Journal o Signal Processing 5: [8] Joint Video Team (JVT) reerence sotware version 9.6, [9] C.H. Tseng, H.M. Wang, and J. F. Yang, Enhanced Intra 4x4 Mode Decision or H.64/AVC Coders, IEEE Trans. Circuits and Syst. Video Technol., vol. 16, no. 8, pp , August 006. [10] Yu-Kuang Tu, Jar-Ferr Yang and Ming-Ting Sun, Eicient Rate- Distortion Estimation or H.64/AVC Coders, IEEE Trans. Circuits and Syst. Video Technol., vol, 16, pp , May 006. [11] Mohammed Golam Sarwer, and Lai Man Po, Fast bit rate estimation or mode decision o H.64/AVC, IEEE Trans. Circuits and Syst. Video Technol., vol, 17, no. 10, October 007, pp [1] Q. Chen and Y. He, A ast bit estimation method or rate distortion optimization in H.64/AVC, in Proceeding o Picture Coding Symp., 004. [13] Zhenyu Wei and King Ngi Ngan, A ast rate-distortion optimization algorithm or H.64/AVC, IEEE ICASSP 007, pp. I-1157-I [14] Lai-Man Po and Kai Guo, Transorm-Domain ast sum o the squared dierence computation or H.64/AVC Rate-Distortion Optimization, IEEE Trans. Circuits and Syst. Video Technol., vol, 17, no 66, pp , June 007. [15] ITU-T Rec. H.64/ISO/IEC , Advanced Video Coding, Final Committee Drat, Documents JVT-E0, September 00. [16] Iain E. G. Richardson, H.64 and MPEG-4 Video Compression video coding or next generation multimedia, John Wily & Sons, pp , 003. [17] G. Bjontegaard, Calculation o average PSNR dierences between RDcurves, presented at the 13 th VCEG-M Meeting, Austin, TX, April 001. Mohammed Golam Sarwer was born in Comilla, Bangladesh in He received B.Sc. and M.Phil degree both in electronic engineering rom Khulna University o Engineering and Technology and City University o Hong Kong in 001 and 007, respectively. Mr. Sarwer was working as a ull time aculty member in Khulna University o Engineering and Technology rom 001 to 005. Currently, he is PhD candidate in the department o electrical and computer engineering, University o Windsor, Canada. He is the member o International Association o Engineers (IAENG). In 003, he won the prime minister gold medal due to outstanding academic perormance o undergraduate level Lai-Man Po (M 9) received the B.Sc. and Ph.D. degrees, both in electronic Engineering, rom the City University o Hong Kong in 1988 and 1991, respectively. Since November 1991, he has been with the Department o Electronic Engineering, City University o Hong Kong, and is currently Associate Proessor. His research interests are in vector quantization, motion estimation and image/video compression. Dr. Po was awarded First Prize (1988) in the Paper Contest or Students and Non-corporate Members organized by the Institute o Electronics and Radio Engineers o Hong Kong, and a Postgraduate Fellowship o the Sir Edward Youde Memorial Council (1988 to 1991). Q.M. Jonathan Wu (M 9) received his Ph.D. degree in electrical engineering rom the University o Wales, Swansea, U.K., in From 1995, he worked at the National Research Council o Canada (NRC) or 10 years where he became a senior research oicer and group leader. He is currently a Proessor in the Department o Electrical and Computer Engineering at the University o Windsor, Canada. Dr. Wu holds the Tier 1 Canada Research Chair (CRC) in Automotive Sensors and Sensing Systems. He has published more than 100 peer-reviewed papers in areas o computer vision, image processing, intelligent systems, robotics, micro-sensors and actuators, and integrated micro-systems. His current research interests include 3D image analysis, active video object tracking and extraction, vision-guided robotics, sensor analysis and usion, wireless sensor network, and integrated micro-systems. Dr. Wu is an Associate Editor or IEEE Transaction on Systems, Man, and Cybernetics (part A). Dr. Wu has served on the Technical Program Committees and International Advisory Committees or many prestigious conerences.

THE newest video coding standard is known as H.264/AVC

THE newest video coding standard is known as H.264/AVC IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 6, JUNE 2007 765 Transform-Domain Fast Sum of the Squared Difference Computation for H.264/AVC Rate-Distortion Optimization