VIDEO DISTORTION ANALYSIS AND SYSTEM DESIGN FOR WIRELESS VIDEO COMMUNICATION

Size: px

Start display at page:

Download "VIDEO DISTORTION ANALYSIS AND SYSTEM DESIGN FOR WIRELESS VIDEO COMMUNICATION"

Beatrix Melton
5 years ago
Views:

1 VIDEO DISTORTION ANALYSIS AND SYSTEM DESIGN FOR WIRELESS VIDEO COMMUNICATION By ZHIFENG CHEN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2010

2 c 2010 Zhifeng Chen 2

3 I dedicate this dissertation to my father. 3

4 ACKNOWLEDGMENTS First and the foremost, I wold like to express my deepest gratitde to my advisor Prof. Dapeng W for his gidance and help in the development of my research. This work wold not have been possible withot his enlightening instrction, constrctive advice, and willingness to provide fnding. His extensive knowledge, strong analytical skills, and commitment to the excellence of research are trly treasres to his stdents. I wold also like to thank Prof. John Harris, Prof. Tao Li, and Prof. Shigang Chen for serving on my dissertation committee and providing valable sggestions on this dissertation. I have been fortnate to be a stdent of Prof. John M. Shea, who is one of the best teachers that I have had in my life. His deep knowledge, responsible attitde and impressive kindness have helped me to develop the fndamental and essential academic competence. I am indebted to Taoran L for her explanation of my qestions when I first encontered challenges in stdying signal processing. I grateflly acknowledge the help of Xiaochen Li for my nderstanding in commnication theory. I especially thank Jn X for his valable discssions when I began my research on video coding. My work also owes mch to Qian Chen for her help in the se of correct grammar, which improves the presentation of this dissertation. I wold like to take this opportnity to thank Xiha Dong, Qin Chen, Lei Yang, Bing Han, Wenxing Ye, Zongri Ding, Yakn H, and Jiangping Wang for many fritfl discssions related to this work. I wish to express my special appreciation to Peshala Pahalawatta and Alexis Michael Torapis for their help in solving my qestions abot the H.264/AVC JM reference software and assisting me with more rigoros expression of many ideas in this work. Last bt not least, I need to express my warmest thanks to my parents and my wife for their contined encoragement and spport. 4

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS LIST OF TABLES LIST OF FIGURES ABSTRACT CHAPTER 1 INTRODUCTION Problem Statement Theoretical Backgrond Challenges in the Practical System Contribtions of This Dissertation Strctre of the Dissertation PREDICTION OF TRANSMISSION DISTORTION FOR WIRELESS VIDEO COMMUNICATION: ANALYSIS Backgrond on Transmission Distortion Prediction System Description Strctre of a Wireless Video Commnication System Clipping Noise Definition of Transmission Distortion Limitations of the Existing Transmission Distortion Models Transmission Distortion Formlae Overview of the Approach to Analyzing PTD and FTD Analysis of Distortion Cased by RCE Pixel-level distortion cased by RCE Frame-level distortion cased by RCE Analysis of Distortion Cased by MVCE Pixel-level distortion cased by MVCE Frame-level distortion cased by MVCE Analysis of Distortion Cased by Propagated Error Pls Clipping Noise Pixel-level distortion cased by propagated error pls clipping noise Frame-level distortion cased by propagated error pls clipping noise Analysis of Correlation Cased Distortion Pixel-level correlation cased distortion Frame-Level correlation cased distortion

6 2.3.6 Smmary Pixel-Level transmission distortion Frame-Level transmission distortion Relationship between Theorem 2.2 and Existing Transmission Distortion Models Case 1: Only the (k 1)-th Frame Has Error, and the Sbseqent Frames are All Correctly Received Case 2: Brst Errors in Consective Frames Case 3: Modeling Transmission Distortion as an Otpt of an LTI System with PEP as inpt PTD and FTD nder Mlti-Reference Prediction Pixel-level Distortion nder Mlti-Reference Prediction Frame-level Distortion nder Mlti-Reference Prediction PREDICTION OF TRANSMISSION DISTORTION FOR WIRELESS VIDEO COMMUNICATION: ALGORITHM AND APPLICATION A Literatre Review on Estimation Algorithms of Transmission Distortion Algorithms for Estimating FTD FTD Estimation withot Feedback Acknowledgement Estimation of residal cased distortion Estimation of MV cased distortion Estimation of propagation and clipping cased distortion Estimation of correlation-cased distortion Smmary FTD Estimation with Feedback Acknowledgement Pixel-level Transmission Distortion Estimation Algorithm Estimation of PTD Calclation of Ê[ ζ k ] Calclation of Ê[ ζ k j + k +mv { r, m}] and ˆD k k (p) Smmary Pixel-level End-to-end Distortion Estimation Algorithm Applying RMPC-PEED Algorithm to H.264 Prediction Mode Decision Rate-distortion Optimized Prediction Mode Decision Complexity of RMPC-MS, ROPE, and LLN Algorithm RMPC-MS algorithm ROPE algorithm LLN algorithm Experimental Reslts Estimation Accracy and Robstness Experiment setp Estimation accracy of different estimation algorithms Robstness of different estimation algorithms R-D Performance of Mode Decision Algorithms Experiment setp

7 R-D performance nder no interpolation filter and no deblocking filter R-D performance with interpolation filter and deblocking filter THE EXTENDED RMPC ALGORITHM FOR ERROR RESILIENT RATE DISTORTION OPTIMIZED MODE DECISION An Overview on Sbpixel-level End-to-end Distortion Estimation for a Practical Video Codec The Extended RMPC Algorithm for Mode Decision Sbpixel-level Distortion Estimation A New Theorem for Calclating the Second Moment of a Weighted Sm of Correlated Random Variables The Extended RMPC Algorithm for Mode Decision Merits and Limitations of ERMPC Algorithm Merits Limitations Experimental Reslts Experiment Setp R-D Performance sbjective Performance Discssion Effect of clipping noise on the mode decision Effect of transmission errors on mode decision RATE-DISTORTION OPTIMIZED CROSS-LAYER RATE CONTROL IN WIRELESS VIDEO COMMUNICATION An Literatre Review on Rate Distortion Models in Wireless Video Commnication Systems Problem Formlation Derivation of Bit Rate Fnction, Qantization Distortion Fnction and Transmission Distortion Fnction Derivation of Sorce Coding Bit Rate Fnction The entropy of qantized transform coefficients for I.I.D. zero-mean Laplacian sorce nder niform qantizer Improve with rn length model Practical consideration of Laplacian assmption Improvement by considering the model inaccracy Sorce coding bit rate estimation for the H.264 encoder Derivation of Qantization Distortion Fnction Derivation of Transmission Distortion Fnction Transmission distortion as a fnction of PEP PEP as a fnction of SNR, transmission rate, and channel coding rate in a fading channel

8 Transmission distortion as a fnction of SNR, transmission rate, and channel coding rate in a fading channel Rate-Distortion Optimized Cross-layer Rate Control and Algorithm Design Optimization of Cross-layer Rate Control Problem Algorithm Design Experimental Reslts Model Accracy Bit rate model Qantization distortion model PEP model Performance Comparison Experiment setp PSNR performance Sbjective performance CONCLUSION Smmary of the Dissertation Ftre Work APPENDIX A PROOFS IN CHAPTER A.1 Proof of Lemma A.2 Proof of Proposition A.3 Proof of Lemma A.4 Proof of Lemma A.5 Proof of Lemma A.6 Lemma 5 and Its Proof A.7 Lemma 6 and Its Proof A.8 Proof of Corollary B PROOFS IN CHAPTER B.1 Proof of Proposition B.2 Proof of Theorem B.3 Proof of Proposition C PROOFS IN CHAPTER D PROOFS IN CHAPTER D.1 Proof of Eqation (5 5) D.2 Calclation of Entropy for Different Qantized Transform Coefficients D.3 Proof of Proposition REFERENCES

9 BIOGRAPHICAL SKETCH

10 Table LIST OF TABLES page 2-1 Notations An example that shows the effect of clipping noise on transmission distortion Complexity Comparison Average PSNR gain (in db) of RMPC-MS over ROPE and LLN Average PSNR gain (in db) of RMPC-MS over ROPE and LLN nder interpolation filtering Average PSNR gain (in db) of ERMPC over RMPC, LLN and ROPE RCPC encoder parameters

11 Figre LIST OF FIGURES page 1-1 Theoretical system model Theoretical system model with separate sorce coding and channel coding Practical system model of a wireless video commnication system System strctre, where T, Q, Q 1, and T 1 denote transform, qantization, inverse qantization, and inverse transform, respectively The effect of clipping noise on distortion propagation Temporal correlation between the residals in one trajectory Temporal correlation matrix between residal and MVCE in one trajectory Temporal correlation matrix between MVCEs in one trajectory Comparison between measred and estimated correlation coefficients Transmission distortion D k vs. frame index k for foreman : (a) good channel, (b) poor channel Transmission distortion D k vs. frame index k for stefan : (a) good channel, (b) poor channel Transmission distortion D k vs. PEP for foreman Transmission distortion D k vs. PEP for stefan Transmission distortion D k vs. frame index k for foreman nder imperfect knowledge of PEP: (a) good channel, (b) poor channel Transmission distortion D k vs. frame index k for stefan nder imperfect knowledge of PEP: (a) good channel, (b) poor channel PSNR vs. bit rate for foreman, with no interpolation filter and no deblocking filter: (a) PEP=2%, (b) PEP=5% PSNR vs. bit rate for football, with no interpolation filter and no deblocking filter: (a) PEP=2%, (b) PEP=5% PSNR vs. bit rate for foreman, with interpolation and no deblocking: (a) PEP=2%, (b) PEP=5% PSNR vs. bit rate for football, with interpolation and no deblocking: (a) PEP=2%, (b) PEP=5%

12 3-11 PSNR vs. bit rate for foreman, with interpolation and deblocking: (a) PEP=2%, (b) PEP=5% PSNR vs. bit rate for football, with interpolation and deblocking: (a) PEP=2%, (b) PEP=5% PSNR vs. bit rate for foreman : (a) PEP=0.5%, (b) PEP=2% PSNR vs. bit rate for mobile : (a) PEP=0.5%, (b) PEP=2% (a) ERMPC at the 84-th frame, (b) RMPC at the 84-th frame, (c) LLN at the 84-th frame, (d) ROPE at the 84-th frame, (e) ERMPC at the 99-th frame, (f) RMPC at the 99-th, (g) LLN at the 99-th frame, (h) ROPE at the 99-th frame PSNR vs. bit rate for foreman : (a) PEP=0.5%, (b) PEP=2% PSNR vs. bit rate for mobile : (a) PEP=0.5%, (b) PEP=2% Channel model Variance model bpp vs. Frame index: (a) foreman, (b) mobile Qantization vs. Frame index: (a) foreman, (b) mobile PEP nder different RCPC coding rates PSNR vs. average SNR: (a) foreman, (b) mobile PSNR vs. bandwidth: (a) foreman, (b) mobile A random channel sample nder average SNR=10dB and bit rate=1000kbps: (a) A random SNR sample, (b) Distortion vs. Frame index for foreman cif nder this channel For the 10-th frame: (a) original, (b) CLRC, (c) proposed-constant-pep, (d) constant-pep-qp-limit; for the 11-th frame: (e) original, (f) CLRC, (g) proposed-constant-pep, (h) constant-pep-qp-limit A-1 Comparison of Φ 2 (x, y) and x

13 Abstract of Dissertation Presented to the Gradate School of the University of Florida in Partial Flfillment of the Reqirements for the Degree of Doctor of Philosophy VIDEO DISTORTION ANALYSIS AND SYSTEM DESIGN FOR WIRELESS VIDEO COMMUNICATION By Zhifeng Chen December 2010 Chair: Dapeng W Major: Electrical and Compter Engineering In this dissertation, we address the problem of minimizing the end-to-end distortion in wireless video commnication. We first analytically derive transmission distortion as a fnction of video statistics, channel conditions and system parameters for wireless video commnication systems. Then we design practical algorithms to estimate the system parameters and video statistics. Given the channel condition, we may accrately predict the instantaneos transmission distortion by or formlae and estimation algorithms. We also prove a new theorem to extend or algorithms to spport rate-distortion optimized mode decision in practical video codecs. Finally, we derive a more accrate sorce bit rate model and qantization distortion model than existing parametric models. Or models help s to design a rate-distortion optimized cross-layer rate control algorithm for minimizing the end-to-end distortion nder resorce constraints in wireless video commnication systems. Or reslts achieve remarkable performance gains over existing soltions. 13

14 CHAPTER 1 INTRODUCTION 1.1 Problem Statement Both mltimedia technology and mobile commnications have experienced massive growth and commercial sccess in recent years. As these two technologies converge, wireless video, sch as videophone calls and mobile TV in 3G/4G systems, is expected to achieve nprecedented growth and worldwide sccess. Therefore, how to improve the video qality reprodced at the video decoder in a wireless video commnication system becomes more compelling Theoretical Backgrond A theoretical system model for video transmission over a wireless channel is shown in Fig. 1-1, where V n is the inpt video seqence and Ṽ n is the otpt video seqence after V n passing throgh the wireless channel. The target of the transmission is to convey the video information at the inpt side to the otpt side as mch as possible. However, 1) the video seqence is sally highly redndant, which cases the waste of resorces if it is transmitted withot removing any redndancy; and 2) the sorce bit stream is sally not well distingishable at the otpt side after passing throgh the channel, which cases serios distortion. Therefore, to convey maximm distingishable video information while consming minimm resorces for the information to be transmitted from the transmitter to the receiver, we need 1) compressing the inpt sing as few bits as possible, that is, sorce coding; and 2) mapping the sorce bit stream into a better, in the bit error sense, bit stream for transmission, that is, channel coding. Now the problems are 1) what is the minimm reqirement of resorces for reliably transmitting the given sorce? 2) for the given channel, how mch information at most can be reliably transmitted? and 3) what is the minimm distortion which may happen if the information contained in the given sorce is more than the information the channel 14

15 n V p( v n ) channel ( ~ n n p v v ) V ~ n D = E[ d( V d( v n, v~ n ) n n + d : V V = R ~, n V n )] Figre 1-1. Theoretical system model. may convey? In 1948, Shannon pblished his seminal work A Mathematical Theory of Commnication in Bell Systems Technical Jornal [1]. In this paper, Shannon mathematically defines the measre of information by entropy, which is expressed by the average nmber of bits needed for storage or commnication. In this seminal work the answers are given, for the first time, to the first two aforementioned qestions, that is, 1) the minimm nmber of bits reqired for reliably transmitting the given sorce is its entropy; 2) the maximm nmber of bits can be reliably transmitted for the given channel is the channel capacity.althogh not rigorosly proved, the answer to the third qestion is also presented in Ref. [1] (his Theorem 21). That is 3) the minimm distortion for the given sorce and channel is the minimm distortion achieved by lossy sorce coding nder the condition that the encoded sorce rate is less than the channel capacity. In 1959, Shannon pblished another famos work Coding Theorems for a Discrete Sorce With a Fidelity Criterion [2], where the rate-distortion fnction is first coined and the greatest lower bond of rate for a given distortion is proved. The joint sorce channel coding theorem proves that the optimal performance can be achieved by the sorce channel separation theorem as stated in Ref. [3] The sorce channel separation theorem shows that we can design the sorce code and the channel code separately and combine the reslts to achieve optimal performance. Based on the sorce channel 15

16 n V Sorce encoding X n Channel encoding n Y channel n Y ~ Channel decoding n X ~ Sorce decoding n V ~ 1948: Channel capacity 1959: R-D theory Figre 1-2. Theoretical system model with separate sorce coding and channel coding. separation theorem, the theoretical system model is to separate sorce coding and channel coding separately and seqentially as in Fig. 1-2 However, this theorem is derived nder the condition that all transmission errors can be corrected by the channel coding to an arbitrarily low probability. That is, it implicitly assmes that there is no distortion cased by the transmission error in the system model. Althogh decreasing the channel protection, i.e., redndant bits will increase the transmission error, it also redces the distortion cased by lossy sorce coding given the same channel capacity. Therefore, it is still not clear that what is the minimm distortion for the given sorce and channel if the restriction of arbitrarily low probability of transmission error is lifted. In addition, the channel capacity is derived based on the assmptions of infinite block length, random coding and stationary channel. On the other hand, the rate-distortion (R-D) bond is derived based on the assmptions of infinite block length, random coding, and stationary sorces. These assmptions in both channel capacity and R-D bond incr infinite delay, infinitely high complexity, and mismatch between theoretical and practical sorce and channel models Challenges in the Practical System In a practical wireless video commnication system, the resorces are very limited. There are sally for kinds of resorces, that is, time, bandwidth, power and space, which can be tilized to improve wireless video performance. However, all of these 16

17 for resorces are sally limited. Specifically, 1) the end-to-end delay, sm of sorce coding delay and transmission delay, for the video signal to be reprodced by video decoder is nder certain delay bond; 2) the achievable data rate, sm of information rate and redndant rate, is nder certain bandwidth limit; 3) the total power consmed by video encoding and by transmission are nder certain constraint; 4) the channel gain in a wireless fading channel statistically depends on the geographical position and environment. Therefore, de to the limited resorces, the probability of transmission error cannot be arbitrarily low in the practical system. Instead, a more desirable system design is to minimize the end-to-end distortion nder the resorce constraints by allowing the transmission error at a certain level. In a practical wireless commnication system, modlation and error control coding are designed to mitigate the bit error dring transmission throgh an error-prone channel. In application layer, error-resilient coding at the encoder and error concealment at the decoder are designed to redce the distortion cased by sch transmission error. We call the distortion cased by the transmission error as transmission distortion, denoted by D t. In a practical video coding system, the predictive coding, qantization, transform, and entropy coding are adopted together to compress the bits. Sch a sorce coding scheme prodces error dring qantization 1. We call the distortion cased by qantization error as qantization distortion, denoted by D q. As a reslt, the distortion between the original video and the reconstrcted video at the video decoder is cased by both the qantization error and transmission error. We call them together as end-toend distortion, denoted by D ete. The practical system model is shown in Fig. 1-3 On the one hand, the transmission distortion is a fnction of the transmission error, which is again a fnction of signal-to-noise ratio (SNR), bandwidth, delay reqirement 1 In modern video codec, e.g. H.264 codec, the transform is so designed that it is reversible. 17

18 Delay Complexity Predictive coding Qantization Transform Entropy coding Commnication Data rate theory Error control coding Error-resilient coding PER Error concealment Distortion Bandwidth, SNR Figre 1-3. Practical system model of a wireless video commnication system. and channel protection parameters, e.g., modlation order and channel coding rate. On the other hand, the qantization distortion is a fnction of available sorce data rate, complexity reqirement, sorce encoder strctre and sorce coding parameters, e.g., the allowable finite set for qantization. Now the problem in a practical system can be formlated by Given the sorce, channel, resorces and system strctre, how to tne the system parameters to minimize the end-to-end distortion. This problem is very challenging since 1) the statistical properties of the video sorce is nknown and the sorce is sally not stationary; 2) the wireless channel is time varying; 3) all resorces are limited; 4) the system is a complex system, e.g., non-linear; 5) the system parameters in different layers are sally copled; for example, increasing the the channel coding rate in the transmitter will decrease the sorce data rate for compression. To tackle this complex problem, we need to follow the following steps: 1) Finding stable video statistics for qantization distortion and deriving the qantization distortion as a fnction of sorce rate constraint (R s ), complexity constraint (C s ), video codec strctre and those stable video statistics ( θ); 2) Finding stable video statistics for transmission distortion and deriving the transmission distortion as a fnction of packet error probability (PEP), codec strctre and those stable video statistics ( θ); 3) Deriving 18

19 the PEP as a fnction of SNR, channel coding rate (R c ), bandwidth, and transmission delay (d t ); 4) Minimizing the end-to-end distortion nder the resorce constraints. Thanks to the sorce channel separation theorem, the sorce coding and channel coding has been extensively stdied separately. In other words, the first step has been extensively stdied by the sorce coding society and the third step has been extensively stdied by the commnication society separately. In the Open System Interconnection Reference Model (OSI Reference Model or OSI Model), the first step belongs to the application layer problem, and the third step belongs to the lower layers. Althogh they are relatively extensively researched, they still need to be frther investigated in order to design a practical system with the minimm end-to-end distortion. On the other hand, the second step, which in fact is a cross-layer problem, has long be omitted by both societies. Untill now, there is still no well accepted theoretical analysis for this cross-layer problem. If we can find transmission distortion as a closed-form fnction of PEP, we may be able to analytically derive the minimm end-to-end distortion for most existing wireless video commnication systems, which are designed based on the sorce channel separation theorem. 1.2 Contribtions of This Dissertation The major contribtions of or work are smmarized as follows: 1. We analytically derive the transmission distortion formlae as a fnction of PEP and video statistics for wireless video commnication systems. 2. With consideration of spatio-temporal correlation, nonlinear codec and time-varying channel, or formlae provide, for the first time, the following capabilities: spport of distortion prediction at different levels (e.g., pixel/frame/gop level). spport of mlti-reference pictre motion compensated prediction. spport of slice data partitioning. spport of arbitrary slice-level packetization with FMO mechanism. being applicable to time-varying channels. one nified formla for both I-MB and P-MB. 19

20 spport of both low motion and high motion video seqences. 3. Besides deriving the transmission distortion formlae, we also identified two important properties of transmission distortion for the first time: clipping noise, prodced by non-linear clipping, cases decay of propagated error. the correlation between motion vector concealment error and propagated error is negative, and has dominant impact on transmission distortion, among all the correlations between any two of the for components in transmission error. 4. We also discssed the relationship between or formla and existing models, and specify the conditions, nder which those existing models are accrate. 5. We design algorithms to estimate correlation ratio and propagation factor, which facilitates the design of low complexity algorithm for estimating the frame-level transmission distortion (FTD). 6. By sing the formlae analytically derived and the parameter estimated by statistics, or FTD estimation algorithm, called RMPC-FTD, is more accrate and more robst than existing FTD algorithms. 7. Another advantage of or RMPC-FTD algorithm is that all parameters in the formlae can be estimated by sing the instantaneos video frame statistics and channel conditions, which allows the video frame statistics to be time-varying and the transmission error processes to be non-stationary. As a reslt, or RMPC-FTD algorithm is more sitable for real-time video commnication. 8. We also design the estimation algorithm, called RMPC-PTD, for pixel-level transmission distortion (PTD) by tilizing the known vales of the MV and corresponding residal to frther improve the estimation accracy and decrease the estimation complexity. 9. We also extend RMPC-PTD to estimate pixel-level end-to-end distortion (PEED) by the algorithm called RMPC-PEED. Or RMPC-PEED algorithm provides not only more accrate estimation bt also lower complexity and higher degree of extensibility than the existing methods. 10. We apply or RMPC-PEED algorithm to prediction mode decision in H.264; the reslting algorithm is called RMPC-MS. Experimental reslts show that or RMPC-MS algorithm achieves more than 1dB gain than existing algorithms. 11. To facilitate the design of sbpixel-level Mean Sqare Error (MSE) distortion estimation for mode decision in H.264 video encoders, we prove a general theorem 20

21 for calclating the second moment of a weighted sm of correlated random variables withot the reqirement of their probability distribtion. 12. We apply or theorem to the design of a very low-complexity algorithm, which we call ERMPC algorithm, for mode decision in H.264. Experimental reslts show that, ERMPC frther achieves 0.25dB PSNR gain over the RMPC-MS algorithm. 13. We derive more accrate sorce bit rate model and qantization distortion model than existing parametric models. 14. We improve the performance bond for channel coding with convoltional codes and a Viterbi decoder, and derive its performance nder Rayleigh block fading channel. 15. We design a R-D optimized cross-layer rate control (CLRC) algorithm by jointly choosing qantization step size and channel coding rate based on the given instantaneos channel condition, e.g., SNR and channel bandwidth. 1.3 Strctre of the Dissertation In Chapter 2, we analytically derive the transmission distortion formlae as a fnction of PEP and video statistics for wireless video commnication systems. We explain the limitations in existing transmission distortion models, where the significant effect of clipping noise on the transmission distortion has long been omitted. We then derive both the PTD and FTD with considering the clipping noise in the system. We also discssed the relationship between or formla and existing models; we specify the conditions, nder which those existing models are accrate. In Chapter 3, we design practical algorithms to estimate the system parameters, and from the estimated parameters, we may calclate the FTD by sing the formlae derived in Chapter 2. For PTD, we tilize the known vales, e.g. residal, in some video codec replacing the statistics of the corresponding random variables to simplify the PTD estimation and design a low-complexity and high-accracy PTD estimation algorithm. We also extend RMPC-PTD algorithm to estimate PEED with high degree of extensibility. we then apply or RMPC-PEED algorithm for mode decision in H.264 to achieve the minimm R-D cost. The complexity and memory reqirement of or RMPC-MS algorithm and existing mode selection algorithms are careflly compared in 21

22 this chapter. Experimental reslts are given to compare the estimation accracy, robst, R-D performance and extensibility between or algorithms and existing algorithms. In Chapter 4, we extend or RMPC-MS algorithm designed in Chapter 3 to spport some performance-enhanced parts, e.g. interpolation filter, in H.264 codec. We first prove a new theorem for calclating the second moment of a weighted sm of correlated random variables withot the reqirement of their probability distribtion. Then, we apply the theorem to extend the design of previos RMPC-MS algorithm to spport the interpolation filtering in H.264. We call the new algorithm as ERMPC algorithm. We also discss the merits and limitations of or ERMPC algorithm. Experimental reslts are given to compare the R-D performance and sbjective performance between ERMPC and existing algorithms. In Chapter 5, we aim to design a rate-distortion optimized cross-layer rate control (CLRC) algorithm for wireless video commnication. To this end, we derive a more accrate sorce bit rate model and qantization distortion model than existing parametric models. We also improve the performance bond of channel coding with convoltional codes and a Viterbi decoder, and derive its performance nder Rayleigh block fading channels. Given the instantaneos channel condition, i.e. SNR and bandwidth, we design the rate-distortion optimized CLRC algorithm by jointly choosing qantization step size and channel coding rate. Experimental reslts are given to compare the models accracy between ors and existing models. We also compare the R-D performance and sbjective performance between or algorithms and existing algorithms in this chapter. Finally, Chapter 6 concldes the dissertation and provides an otlook for or ftre work. 22

23 CHAPTER 2 PREDICTION OF TRANSMISSION DISTORTION FOR WIRELESS VIDEO COMMUNICATION: ANALYSIS In this chapter, we analytically derived the transmission distortion formlae for wireless video commnication systems. We also discssed the relationship between or formla and existing models. 2.1 Backgrond on Transmission Distortion Prediction Transmission distortion is cased by packet errors dring the transmission of a video seqence, and it is the major part of the end-to-end distortion in delay-sensitive wireless video commnication 1 nder high packet error probability (PEP), e.g., in a wireless fading channel. The capability of predicting transmission distortion at the transmitter can assist in designing video encoding and transmission schemes that achieve maximm video qality nder resorce constraints. Specifically, transmission distortion prediction can be sed in the following three applications in video encoding and transmission: 1) mode selection, which is to find the best intra/inter-prediction mode for encoding an macroblock (MB) with the minimm rate-distortion (R-D) cost given the instantaneos PEP, 2) cross-layer rate control, which is to control the instantaneosly encoded bit rate for a real-time encoder to minimize the frame-level end-to-end distortion given the instantaneos PEP, e.g., in video conferencing, 3) packet schedling, which chooses a sbset of packets of the pre-coded video to transmit and intentionally discards the remaining packets to minimize the GOP-level (Grop of Pictre) end-to-end distortion given the average PEP and average brst length, e.g., in streaming pre-coded video over networks. All the three applications reqire a formla for predicting how transmission distortion is affected by their respective control policy, in order to choose the optimal mode or encoding rate or transmission schedle. 1 Delay-sensitive wireless video commnication sally does not allow retransmission to correct packet errors since retransmission may case long delay. 23

24 However, predicting transmission distortion poses a great challenge de to the spatio-temporal correlation inside the inpt video seqence, the nonlinearity of both the encoder and the decoder, and varying PEP in time-varying channels. In a typical video codec, the temporal correlation among consective frames and the spatial correlation among the adjacent pixels of one frame are exploited to improve the coding efficiency. Nevertheless, sch a coding scheme brings mch difficlty in predicting transmission distortion becase a packet error will degrade not only the video qality of the crrent frame bt also the following frames de to error propagation. In addition, as we will see in Section 2.3, the nonlinearity of both the encoder and the decoder makes the instantaneos transmission distortion not eqal to the sm of distortions cased by individal error events. Frthermore, in a wireless fading channel, the PEP is time-varying, which makes the error process a non-stationary random process and hence, as a fnction of the error process, the distortion process is also a non-stationary random process. According to the aforementioned three applications, the existing algorithms for estimating transmission distortion can be categorized into the following three classes: 1) pixel-level or block-level algorithms (applied to mode selection), e.g., Recrsive Optimal Per-pixel Estimate (ROPE) algorithm [4] and Law of Large Nmber (LLN) algorithm [5, 6]; 2) frame-level or packet-level or slice-level algorithms (applied to cross-layer rate control) [7 11]; 3) GOP-level or seqence-level algorithms (applied to packet schedling) [12 16]. Althogh the existing distortion estimation algorithms work at different levels, they share some common properties, which come from the inherent characteristics of wireless video commnication system, that is, spatio-temporal correlation, nonlinear codec and time-varying channel. In this chapter, we se the divide-and-conqer approach to decompose complicated transmission distortion into for components, and analyze their effects on transmission distortion individally. This 24

25 divide-and-conqer approach enables s to identify the governing law that describes how the transmission distortion process evolves over time. Sthlmller et al. [8] observed that the distortion cased by the propagated error decays over time de to spatial filtering and intra coding of MBs, and analytically derived a formla for estimating transmission distortion nder spatial filtering and intra coding. The effect of spatial filtering is analyzed nder the implicit assmption that MVs are always correctly received at the receiver, while the effect of intra coding is modeled as a linear decay nder another implicit assmption that the I-MBs are also always correctly received at the receiver. However, these two assmptions are sally not valid in realistic delay-sensitive wireless video commnication. To address this, this chapter derives the transmission distortion formla nder the condition that both I-MBs and MVs may be erroneos at the receiver. In addition, we observe an interesting phenomenon that even withot sing the spatial filtering and intra coding, the distortion cased by the propagated error still decays! We identify, for the first time, that this decay is cased by non-linear clipping, which is sed to clip those ot-of-range 2 reconstrcted pixel after motion compensation; this is the first of the two properties identified in this chapter. While sch ot-of-range vales prodced by the inverse transform of qantized transform coefficients is negligible at the encoder, its conterpart prodced by transmission error at the decoder has significant impact on transmission distortion. Some existing works [8, 9] estimate transmission distortion based on a linear time-invariant (LTI) system model, which regards packet error as inpt and transmission distortion as otpt. The LTI model simplifies the analysis of transmission distortion. However, it sacrifices accracy in distortion estimation since it neglects the effect of correlation between newly indced error and propagated error. Liang et al. [16] 2 A reconstrcted pixel vale may be ot of the range of the original pixel vale, e.g., [0, 255]. 25

26 stdied the effect of correlation and observed that the LTI models [8, 9] nderestimate transmission distortion de to the positive correlation between two adjacent erroneos frames; however, they did not consider the effect of motion vector (MV) error on transmission distortion and their algorithm was not tested with high motion videos. To address these isses and find the root case of that nderestimation, this chapter classifies the transmission reconstrcted error into three independent random errors, namely, Residal Concealment Error (RCE), MV Concealment Error (MVCE), and propagated error; the first two types of error are called newly indced error. We identify, for the first time, that MVCE is negatively correlated with propagated error and this correlation has dominant impact on transmission distortion, among all the correlations between any two of the three error types, for high motion videos; this is the second of the two properties identified in this chapter. For this reason, as long as MV transmission errors exist in high motion videos, the LTI model over-estimates transmission distortion. We also qantifies the effect of individal error types and their correlations on transmission distortion in this chapter. Thanks to the analysis of correlation effect, or distortion formla is accrate for both low motion video and high motion video as verified by experimental reslts. Another merit of considering the effect of MV error on transmission distortion is the applicability of or reslts to video commnication with slice data partitioning, where the residal and MV cold be transmitted nder Uneqal Error Protection (UEP). Refs. [4, 5, 10, 11] proposed some models to estimate transmission distortion nder the consideration that both MV and I-MB may experience transmission errors. However, the parameters in the linear models [10, 11] can only be acqired by experimentally crve-fitting over mltiple frames, which forbids the models from estimating instantaneos distortion. In addition, the linear models [10, 11] still assme there is no correlation between the newly indced error and propagated error. In Ref. [4], the ROPE algorithm considers the correlation between MV concealment error and 26

27 propagated error by recrsively calclating the second moment of the reconstrcted pixel vale. However, ROPE neglects the non-linear clipping fnction and therefore over-estimates the distortion. In addition, the extension of ROPE algorithm [17] to spport the averaging operations, sch as interpolation and deblocking filtering in H.264, reqires intensive comptation of correlation coefficients de to the high correlation between reconstrcted vales of adjacent pixels, and thereby prohibiting it from applying to H.264. In H.264 reference code JM14.0 3, the LLN algorithm [5] is adopted since it is capable of spporting both clipping and averaging operations. However, in order to predict transmission distortion, all possible error events for each pixel in all frames shold be simlated at the encoder, which significantly increases the complexity of the encoder. Different from Ref. [4, 5], the divide-and-conqer approach in this chapter enables or formla to provide not only more accrate prediction bt also lower complexity and higher degree of extensibility. The mltiple reference pictre motion compensated prediction extended from the single reference is analyzed in Section 2.5, and, for the first time, the effect of mltiple references on transmission distortion is qantified. In addition, the transmission distortion formla derived in this chapter is nified for both I-MBs and P-MBs, in contrast to two different formlae in Refs. [4, 10, 11]. Different from wired channels, wireless channels sffer from mltipath fading, which can be regarded as mltiplicative random noise. Fading leads to time-varying PEP and brst errors in wireless video commnication. Ref. [8] ses a two-state stationary Markov chain to model brst errors. However, even if the channel gain is stationary, packet error process is a non-stationary random process. Specifically, since PEP is a fnction of the channel gain [18], which is not constant in a wireless fading channel, instantaneos PEP is also not constant. This means the probability 3 jm/jm14.0.zip 27

28 distribtion of packet error state is time-varying in wireless fading channels, that is, the packet error process is a non-stationary random process. Hence the Markov chain in Ref. [8] is neither stationary, nor ergodic for wireless fading channel. As a reslt, averaging the brst length and PEP as in Ref. [8] cannot accrately predict instantaneos distortion. To address this, this chapter derives the formla for Pixel-level Transmission Distortion (PTD) by considering non-stationarity over time. Regarding the Frame-level Transmission Distortion (FTD), since two adjacent MBs may be assigned to two different packets, nder the slice-level packetization and FMO mechanism in H.264 [19, 20], their error probability cold be different. However, existing frame-level distortion models [8 11] assme all pixels in the same frame experience the same channel condition. As a reslt, the applicable scope of those models are limited to video with small resoltion. In contrast, this chapter derives the formla for FTD by considering non-stationarity over space. De to consideration of non-stationarity over time and over space, or formla provides an accrate prediction of transmission distortion in a time-varying channel. The rest of the chapter is organized as follows. Section 2.2 presents the preliminaries of the system nder stdy to facilitate the derivations in the later sections, and illstrates the limitations of existing transmission distortion models. In Section 2.3, we derive the transmission distortion formla as a fnction of video statistics, channel condition, and codec system parameters. Section 2.4 discsses the relationship between or formla and the existing models. In Section 2.5, we extend formlae for PTD and FTD from single-reference to mlti-reference. 2.2 System Description Strctre of a Wireless Video Commnication System Fig. 2-1 shows the strctre of a typical wireless video commnication system. It consists of an encoder, two channels and a decoder where residal packets and MV packets are transmitted over their respective channels. If residal packets or MV packets 28

29 are erroneos, the error concealment modle will be activated. In typical video encoders sch as H.263/264 and MPEG-2/4 encoders, the fnctional blocks can be divided into two classes: 1) basic parts, sch as predictive coding, transform, qantization, entropy coding, motion compensation, and clipping; and 2) performance-enhanced parts, sch as interpolation filtering, deblocking filtering, B-frame, mlti-reference prediction, etc. Althogh the p-to-date video encoder incldes more and more performance-enhanced parts, the basic parts do not change. In this chapter, we se the strctre in Fig. 2-1 for transmission distortion analysis. Note that in this system, both residal channel and MV channel are application-layer channels; specifically, both channels consist of entropy coding and entropy decoding, networking layers 4, and physical layer (inclding channel encoding, modlation, wireless fading channel, demodlation, channel decoding). Althogh the residal channel and MV channel sally share the same physical-layer channel, the two application-layer channels may have different parameter settings (e.g., different channel code-rate) for the slice data partitioning nder UEP. For this reason, or formla obtained from the strctre in Fig. 2-1 can be sed to estimate transmission distortion for an encoder with slice data partitioning Clipping Noise In this sbsection, we examine the effect of clipping noise on the reconstrction pixel vale along each pixel trajectory over time (frames). All pixel positions in a video seqence form a three-dimensional spatio-temporal domain, i.e., two dimensions in spatial domain and one dimension in temporal domain. Each pixel can be niqely represented by k in this three-dimensional time-space, where k means the k-th frame in temporal domain and is a two-dimensional vector in spatial domain. The philosophy behind inter-coding of a video seqence is to represent the video seqence by virtal motion of each pixel, i.e., each pixel recrsively moves from position v k 1 to position k. 4 Here, networking layers can inclde any layers other than physical layer. 29

30 Inpt - Encoder k e T/Q Q -1 /T -1 ˆ + k e Channel Residal Channel S(r) 0 1 Decoder Q -1 /T -1 Residal Error Concealment k eˆ ( k e + k e~ Otpt Video captre k f ˆ k f 1 k + mv Motion compensation fˆ k 1 Clipping k fˆ Memory 0 ~ k 1 f + ~ k mv Motion compensation ~ k 1 f Clipping Memory ~ k f Video display k mv Motion estimation MV Channel S(m) 1 k m ( v MV Error concealment Figre 2-1. System strctre, where T, Q, Q 1, and T 1 denote transform, qantization, inverse qantization, and inverse transform, respectively. The difference between these two positions is a two-dimensional vector called MV of pixel k, i.e., mv k = v k 1 k. The difference between the pixel vales of these two positions is called residal of pixel k, that is, e k = f k ˆf k 1 5. Recrsively, each pixel +mv k in the k-th frame has one and only one reference pixel trajectory backward towards the latest I-frame. At the encoder, after transform, qantization, inverse qantization, and inverse transform for the residal, the reconstrcted pixel vale may be ot-of-range and shold be clipped as ˆf k = Γ(ˆf k 1 +mv k + ê k ), (2 1) 5 For simplicity of notation, we move the sperscript k of to the sperscript k of f whenever is the sbscript of f. 30

31 where Γ( ) fnction is a clipping fnction defined by γ L, x < γ L Γ(x) = x, γ L x γ H (2 2) γ H, x > γ H, where γ L and γ H are ser-specified low threshold and high threshold, respectively. Usally, γ L = 0 and γ H = 255. The residal and MV at the decoder may be different from their conterparts at the encoder becase of channel impairments. Denote mv k and ẽ k the MV and residal at the decoder, respectively. Then, the reference pixel position for k at the decoder is ṽ k 1 = k + mv k, and the reconstrcted pixel vale for k at the decoder is f k = Γ( f k 1 + ẽ + mv k k ). (2 3) In error-free channels, the reconstrcted pixel vale at the receiver is exactly the same as the reconstrcted pixel vale at the transmitter, becase there is no transmission error and hence no transmission distortion. However, in error-prone channels, we know from (2 3) that f k ẽ k, the received MV mv k, and the propagated error is a fnction of three factors: the received residal f k 1 + mv k. The received residal ẽ k depends on three factors, namely, 1) the transmitted residal ê k, 2) the residal packet error state, which depends on instantaneos residal channel condition, and 3) the residal error concealment algorithm if the received residal packet is erroneos. Similarly, the received MV mv k depends on 1) the transmitted mv k, 2) the MV packet error state, which depends on instantaneos MV channel condition, and 3) the MV error concealment algorithm if the received MV packet is erroneos. The propagated error f k 1 + mv k incldes the error propagated from the reference frames, and therefore depends on all samples in the previos frames indexed by i < k and their reception error states as well as concealment algorithms. 31

32 Table 2-1. Notations k : Three-dimensional vector that denotes a pixel position in an video seqence f k : Vale of the pixel k e k : Residal of the pixel k mv k : MV of the pixel k k : Clipping noise of the pixel k ε k : Residal concealment error of the pixel k ξ k : MV concealment error of the pixel k ζ k : Transmission reconstrcted error of the pixel k S k : Error state of the pixel k P k : Error probability of the pixel k D k : Transmission distortion of the pixel k D k : Transmission distortion of the k-th frame V k : Set of all the pixels in the k-th frame V : Nmber of elements in set V (cardinality of V) α k : Propagation factor of the k-th frame β k : Percentage of I-MBs in the k-th frame λ k : Correlation ratio of the k-th frame w k (j): pixel percentage of sing frame k j as reference in the k-th frame The non-linear clipping fnction within the pixel trajectory makes the distortion estimation more challenging. However, it is interesting to observe that clipping actally redces transmission distortion. In Section 2.3, we will qantify the effect of clipping on transmission distortion. Table 2-1 lists notations sed in this chapter. All vectors are in bold font. Note that the encoder needs to reconstrct the compressed video for predictive coding; hence the encoder and the decoder have a similar strctre for pixel vale reconstrction. To distingish the variables in the reconstrction modle of the encoder from those in the reconstrction modle of the decoder, we add ˆ onto the variables at the encoder and add onto the variables at the decoder Definition of Transmission Distortion In this sbsection, we define PTD and FTD to be derived in Section 2.3. To calclate FTD, we need some notations from set theory. In a video seqence, all pixel positions in the k-th frame form a two-dimensional vector set V k, and we denote 32

33 the nmber of elements in set V k by V k. So, for any pixel at position in the k-th frame, i.e., V k, its reference pixel position is chosen from set V k 1 for single-reference. Usally, the set V k in a video seqence is the same for all frame k, i.e., V 1 = = V k for all k > 1. Hence, we remove the frame index k and denote the set of pixel positions of an arbitrary frame by V. Note that in H.264, a reference pixel may be in a position ot of pictre bondary; however, the set of reference pixels, which is larger than the inpt pixel set, is still the same for all frame k. For a transmitter with feedback acknowledgement of whether a packet is correctly received at the receiver (called acknowledgement feedback), f k at the decoder side can be perfectly reconstrcted by the transmitter, as long as the transmitter knows the error concealment algorithm sed by the receiver. Then, the transmission distortion for the k-th frame can be calclated by mean sqared error (MSE) as MSE k = 1 V [(ˆf k V f k ) 2 ]. (2 4) For the encoder, every pixel intensity f k of the random inpt video seqence is a random variable. For any encoder with hybrid coding (see Fig. 2-1), the residal ê k, MV mv k, and reconstrcted pixel vale ˆf k are fnctions of f k ; so they are also random variables before motion estimation 6. Given the Probability Mass Fnction (PMF) of ˆf k and f k, we define the transmission distortion for pixel k or PTD by D k E[(ˆf k f k ) 2 ], (2 5) and we define the transmission distortion for the k-th frame or FTD by D k E[ 1 V (ˆf k V f k ) 2 ]. (2 6) 6 In applications sch as cross-layer encoding rate control, distortion estimation for rate-distortion optimized bit allocation is reqired before motion estimation. 33

34 It is easy to prove that the relationship between FTD and PTD is characterized by D k = 1 V D k. (2 7) If the nmber of bits sed to compress a frame is too large to be contained in one packet, the bits of the frame are split into mltiple packets. In a time-varying channel, different packet of the same frame may experience different packet error probability V (PEP). If pixel k and pixel v k belong to different packets, the PMF of f k may be different from the PMF of f k v even if ˆf k and ˆf k v are identically distribted. In other words, D k may be different from D k v even if pixel k and pixel v k are in the neighboring MBs when FMO is activated. As a reslt, FTD D k in (2 7) may be different from PTD D k in (2 5). For this reason, we will derive formlae for both PTD and FTD, respectively. Note that most existing frame-level distortion models [8 11] assme that all pixels in the same frame experience the same channel condition and simply se (2 5) for FTD; however this assmption is not valid for high-resoltion/high-qality video transmission over a time-varying channel. In fact, (2 7) is a general form for distortions of all levels. If V = 1, (2 7) redces to (2 5). For slice/packet-level distortion, V is the set of the pixels contained in a slice/packet. For GOP-level distortion, V is the set of the pixels contained in a GOP. In this chapter, we only show how to derive formlae for PTD and FTD. Or methodology is also applicable to deriving formlae for slice/packet/gop-level distortion by sing appropriate V Limitations of the Existing Transmission Distortion Models In this sbsection, we show that clipping noise has significant impact on transmission distortion, and neglect of clipping noise in existing models reslts in inaccrate estimation of transmission distortion. We define the clipping noise for pixel k at the encoder as ˆ k (ˆf k 1 + ê +mv k k ) Γ(ˆf k 1 + ê +mv k k ), (2 8) 34

35 and the clipping noise for pixel k at the decoder as k ( f k 1 + ẽ + mv k k ) Γ( f k 1 + ẽ + mv k k ). (2 9) Using (2 1), Eq. (2 8) becomes ˆf k = ˆf k 1 +mv k + ê k ˆ k, (2 10) and sing (2 3), Eq. (2 9) becomes f k = k 1 f + ẽ + mv k k k, (2 11) where ˆ k only depends on the video content and encoder strctre, e.g., motion estimation, qantization, mode decision and clipping fnction; and k depends on not only the video content and encoder strctre, bt also channel conditions and decoder strctre, e.g., error concealment and clipping fnction. ˆf k In most existing works, both ˆ k and k are neglected, i.e., these works assme = ˆf k 1 + ê +mv k k and f k = k 1 f + ẽ + mv k k. However, this assmption is only valid for stored video or error-free commnication. For error-prone commnication, decoder clipping noise k has a significant impact on transmission distortion and hence shold not be neglected. To illstrate this, Table 2-2 shows an example for the system in Fig. 2-1, where only the residal packet in the (k 1)-th frame is erroneos at the decoder (i.e., ê k 1 v is erroneos), and all other residal packets and all the MV packets are error-free. Sppose the trajectory of pixel k in the (k 1)-th frame and (k 2)-th frame is specified by v k 1 = k + mv k and w k 2 = v k 1 + mv k 1 v. Since êv k 1 is erroneos, the decoder needs to conceal the error; a simple concealment scheme is to let ẽ k 1 v = 0. From this example, we see that neglect of clipping noise (e.g., k = 45) reslts in highly inaccrate estimate of distortion, e.g., the estimated distortion ˆD k = 2500 (withot considering clipping) is mch larger than the tre distortion D k = 25. Note that if an MV is erroneos at the decoder, the pixel trajectory at the decoder will be different from the trajectory at 35

36 Table 2-2. An example that shows the effect of clipping noise on transmission distortion. Encoder Transmitted ˆf w k 2 = 250 êv k 1 = 50 (erroneos) ê k = 50 Reconstrcted ˆf w k 2 = 250 ˆf v k 1 = Γ(ˆf w k 2 + êv k 1 ) = 200 ˆf k = Γ(ˆf v k 1 + ê k ) = 250 k 2 Received f w = 250 ẽv k 1 = 0 (concealed) ẽ k = 50 k 2 k 1 Decoder Reconstrcted f w = 250 f v = Γ( f w k 2 + ẽv k 1 k ) = 250 f = Γ( f v k 1 + ẽ k ) = 255 Clipping noise Distortion k 2 w = 0 k 1 v = 0 k = 45 Dw k 2 = 0 Dv k 1 = (ˆf v k 1 k 1 f v ) 2 = 2500 D k = (ˆf k f k ) 2 = 25 Prediction Received f k 2 w = 250 ẽ k 1 v = 0 (concealed) ẽ k = 50 k 2 k 1 k 2 withot Reconstrcted f w = 250 f v = f w + ẽv k 1 k = 250 f = f k 1 v + ẽ k = 300 clipping Distortion ˆD w k 2 = 0 ˆD v k 1 = (ˆf v k 1 k 1 f v ) 2 = 2500 ˆD k = (ˆf k f k ) 2 = 2500 the encoder; then the reslting clipping noise k may be mch larger than 45 as in this example, and hence the distortion estimation of the existing models withot considering clipping may be mch more inaccrate. On the other hand, the encoder clipping noise ˆ k has negligible effect on qantization distortion and transmission distortion. This is de to two reasons: 1) the probability that ˆ k = 0, is close to one, since the probability that γ L ˆf k 1 +mv k one; 2) in case that ˆ k + ê k γ H, is close to 0, ˆ k sally takes a vale that is mch smaller than the residals. Since ˆ k is negligible, the clipping fnction can be removed at the encoder if only qantization distortion needs to be considered, e.g., for stored video or error-free commnication. Since ˆ k is very likely to be a very small vale, we wold neglect it and assme ˆ k = 0 in deriving or formla for transmission distortion. 2.3 Transmission Distortion Formlae In this section, we derive formlae for PTD and FTD. The section is organized as below. Section presents an overview of or approach to analyzing PTD and FTD. Then we elaborate on the derivation details in Section throgh Section Specifically, Section qantifies the effect of RCE on transmission distortion; Section qantifies the effect of MVCE on transmission distortion; Section qantifies the effect of propagated error and clipping noise on transmission distortion; 36

37 Section qantifies the effect of correlations (between any two of the error sorces) on transmission distortion. Finally, Section smmarizes the key reslts of this chapter, i.e., the formlae for PTD and FTD Overview of the Approach to Analyzing PTD and FTD To analyze PTD and FTD, we take a divide-and-conqer approach. We first divide transmission reconstrcted error into for components: three independent random errors (RCE, MVCE and propagated error) based on their explicitly different root cases, and clipping noise, which is a non-linear fnction of those three random errors. This error decomposition allows s to frther decompose transmission distortion into for terms, i.e., distortion cased by 1) RCE, 2) MVCE, 3) propagated error pls clipping noise, and 4) correlations between any two of the error sorces, respectively. This distortion decomposition facilitates the derivation of a simple and accrate closed-form formla for each of the for distortion terms. Next, we elaborate on error decomposition and distortion decomposition. Define transmission reconstrcted error for pixel k by ζ k and (2 11), we obtain ˆf k f k. From (2 10) ζ k = (ê k + ˆf k 1 ˆ k +mv ) (ẽ k k 1 k + f k + mv ) k = (ê k ẽ k ) + (ˆf k 1 ˆf k 1 ) + (ˆf k 1 k 1 f ) (ˆ k +mv k + mv k + mv k + mv k k ). (2 12) Define RCE ε k by ε k ê k ẽ k, and define MVCE ξ k by ξ k ˆf k 1 +mv k Note that ˆf k 1 k 1 f + mv k + mv k = ˆf k 1. + mv k k 1 ζ, which is the transmission reconstrcted error of the + mv k concealed reference pixel in the reference frame; we call ζ k 1 + mv k propagated error. As mentioned in Section 2.2.4, we assme ˆ k = 0. Therefore, (2 12) becomes ζ k = ε k + ξ k + (2 13) is or proposed error decomposition. k 1 ζ + k + mv. (2 13) k 37

38 Combining (2 5) and (2 13), we have D k = E[( ε k + ξ k + k 1 ζ + k + mv ) 2 ] k = E[( ε k ) 2 ] + E[( ξ k ) 2 ] + E[( ζ k 1 + k + mv ) 2 ] k (2 14) + 2E[ ε k ξ k ] + 2E[ ε k ( ζ k 1 + k + mv )] + 2E[ ξ k k ( ζ k 1 + k + mv )]. k Denote D k (r) E[( ε k ) 2 ], D k (m) E[( ξ k ) 2 ], D k (P) E[( ζ k 1 + k + mv ) 2 ] and k D k (c) 2E[ ε k ξ k ] + 2E[ ε k ( ζ k 1 + k + mv )] + 2E[ ξ k k ( ζ k 1 + k + mv )]. Then, (2 14) becomes k D k = D k (r) + D k (m) + D k (P) + D k (c). (2 15) (3 15) is or proposed distortion decomposition for PTD. The reason why we combine propagated error and clipping noise into one term (called clipped propagated error) is becase clipping noise is mainly cased by propagated error and sch decomposition will simplify the formlae. There are three major reasons for or decompositions in (2 13) and (3 15). First, if we directly sbstitte the terms in (2 5) by (2 10) and (2 11), it will prodce 5 second moments and 10 cross-correlation terms (assming ˆ k = 0); since there are 8 possible error events de to three independent random errors, there are a total of 8 (5+10) = 120 terms for PTD, making the analysis highly complicated. In contrast, or decompositions in (2 13) and (3 15) significantly simplify the analysis. Second, each term in (2 13) and (3 15) has a clear physical meaning, and therefore can be accrately estimated with low complexity. Third, sch decompositions allow or formlae to be easily extended for spporting advanced video codec with more performance-enhanced parts, e.g., mlti-reference prediction and interpolation filtering. To derive the formla for FTD, from (2 7) and (3 15), we obtain D k = D k (r) + D k (m) + D k (P) + D k (c), (2 16) 38

39 where D k (r) = 1 V D k (r), (2 17) V D k (m) = 1 V D k (m), (2 18) V D k (P) = 1 V D k (P), (2 19) V D k (c) = 1 V D k (c). (2 20) (2 16) is or proposed distortion decomposition for FTD. Next, we present the derivation of a closed-form formla for each of the for distortion terms in Section throgh Section Analysis of Distortion Cased by RCE In this sbsection, we first derive the pixel-level residal cased distortion D k (r). Then we derive the frame-level residal cased distortion D k (r) Pixel-level distortion cased by RCE We denote S k as the state indicator of whether there is transmission error for pixel k after channel decoding. Note that as mentioned in Section 2.2.1, both the residal channel and the MV channel contain channel decoding; hence in this chapter, the transmission error in the residal channel or the MV channel is meant to be the error ncorrectable by the channel decoding. To distingish the residal error state and the MV error state, here we se S k (r) to denote the residal error state for pixel k. That is, S k (r) = 1 if ê k is received with error, and S k (r) = 0 if ê k is received withot error. At the receiver, if there is no residal transmission error for pixel, ẽ k is eqal to ê k. However, if the residal packets are received with error, we need to conceal the residal error at V the receiver. Denote ě k the concealed residal when S k (r) = 1, and we have, ě ẽ k k, S k (r) = 1 = ê k, S k (r) = 0. (2 21) 39

40 Note that ě k depends on ê k and the residal concealment method, bt does not depend on the channel condition. From the definition of ε k and (2 21), we have ε k = (ê k ě k ) S k (r) + (ê k ê k ) (1 S k (r)) = (ê k ě k ) S k (r). (2 22) ê k depends on the inpt video seqence and the encoder strctre, while S k (r) depends on commnication system parameters sch as delay bond, channel coding rate, transmission power, channel gain of the wireless channel. Under or framework shown in Fig. 2-1, the inpt video seqence and the encoder strctre are independent of commnication system parameters. Since ê k and S k (r) are solely cased by independent sorces, we assme ê k and S k (r) are independent. That is, we make the following assmption. Assmption 1. S k (r) is independent of ê k. Assmption 1 means that whether ê k will be correctly received or not, does not depend on the vale of ê k. Denote ε k ê k ě k ; we have ε k = ε k S k (r). Denote P k (r) as the residal pixel error probability (XEP) for pixel k, that is, P k (r) P{S k (r) = 1}. Then, from (2 22) and Assmption 1, we have D k (r) = E[( ε k ) 2 ] = E[(ε k ) 2 ] E[(S k (r)) 2 ] = E[(ε k ) 2 ] (1 P k (r)) = E[(ε k ) 2 ] P k (r). Hence, or formla for the pixel-level residal cased distortion is (2 23) D k (r) = E[(ε k ) 2 ] P k (r). (2 24) Frame-level distortion cased by RCE To derive the frame-level residal cased distortion, the encoder needs to know the second moment of RCE for each pixel in that frame. However, if encoder knows the characteristics of residal process and concealment method, the formlae will be mch simplified. One simple concealment method is to let ě k = 0 for all erroneos pixels. 40

41 A more general concealment method is to se the neighboring pixels to conceal an erroneos pixel. So we make the following assmption. Assmption 2. The residal ê k is stationary with respect to (w.r.t.) 2D variable in the same frame. In addition, ě k only depends on {ê k v : v N } where N is a fixed neighborhood of. In other words, Assmption 2 assmes that 1) ê k is a 2D stationary stochastic process and the distribtion of ê k is the same for all V k, and 2) ě k is also a 2D stationary stochastic process since it only depends on the neighboring ê k. Hence, ê k ě k is also a 2D stationary stochastic process, and its second moment E[(ê k ě k ) 2 ] = E[(ε k ) 2 ] is the same for all V k. Therefore, we can drop from the notation, and let E[(ε k ) 2 ] = E[(ε k ) 2 ] for all V k. Denote N k i (r) as the nmber of pixels contained in the i-th residal packet of the k-th frame; denote P k i (r) as PEP of the i-th residal packet of the k-th frame; denote N k (r) as the total nmber of residal packets of the k-th frame. Since for all pixels in the same packet, the residal XEP is eqal to its PEP, from (2 17) and (3 16), we have D k (r) = 1 V = 1 V E[(ε k ) 2 ] P k (r) (2 25) V k E[(ε k ) 2 ] P k (r) (2 26) V k (a) = E[(εk ) 2 N ] k (r) (Pi k (r) Ni k (r)) (2 27) V i=1 (b) = E[(ε k ) 2 ] P k (r). (2 28) where (a) is de to P k (r) = P k i (r) for pixel in the i-th residal packet; (b) is de to P k (r) 1 N k (r) (Pi k (r) Ni k (r)). (2 29) V i=1 P k (r) is a weighted average over PEPs of all residal packets in the k-th frame, in which different packets may contain different nmbers of pixels. Hence, or formla for the 41

42 frame-level residal cased distortion is D k (r) = E[(ε k ) 2 ] P k (r). (2 30) Analysis of Distortion Cased by MVCE Similar to the derivations in Section , in this sbsection, we derive the formla for the pixel-level MV cased distortion D k (m), and the frame-level MV cased distortion D k (m) Pixel-level distortion cased by MVCE Denote the MV error state for pixel k by S k (m), and denote the concealed MV by ˇmv k when S k (m) = 1. Therefore, we have ˇmv k, S mv k k (m) = 1 = mv k, S k (m) = 0. (2 31) Here, we se the temporal error concealment [21] to conceal MV errors. Denote ξ k ˆf k 1 +mv k ˆf k 1 + ˇmv k ; we have ξ k = ξ k S k (m), where ξ k depends on the accracy of MV concealment, and the spatial correlation between reference pixel and concealed reference pixel at the encoder. Denote P k (m) as the MV XEP for pixel k, that is, P k (m) P{S k (m) = 1}. We make the following assmption. Assmption 3. S k (m) is independent of ξ k. Following the same deriving process in Section , we can obtain D k (m) = E[(ξ k ) 2 ] P k (m). (2 32) Note that ξ k depends on mv k and the MV concealment method, bt does not depend on the channel condition. In most cases, given the concealment method, the statistics of ξ k can be easily obtained at the encoder. From the experiments, we observe that ξ k follows a zero-mean Laplacian distribtion. 42

43 Note that in H.264 specification, there is no slice data partitioning for an instantaneos decoding refresh (IDR) frame [22]; so S k (r) and S k (m) are flly correlated in an IDR-frame, that is, S k (r) = S k (m), and hence P k (r) = P k (m). This is also tre for I-MB, and P-MB withot slice data partitioning. For P-MB with slice data partitioning in H.264, S k (r) and S k (m) are partially correlated. In other words, if the packet of slice data partition A, which contains MV information, is lost, the corresponding packet of slice data partition B, which contains residal information, cannot be decoded even if it is correctly received, since there is no slice header in the slice data partition B. Therefore, the residal channel and the MV channel in Fig. 2-1 are actally correlated if the encoder follows H.264 specification. In this chapter, we stdy transmission distortion in a more general case where S k (r) and S k (m) can be either independent or correlated Frame-level distortion cased by MVCE To derive the frame-level MV cased distortion, we make the following assmption. Assmption 4. The second moment of ξ k is the same for all V k. Under Assmption 4, we can drop from the notation, and let E[(ξ k ) 2 ] = E[(ξ k ) 2 ] for all V k. Denote N k i (m) as the nmber of pixels contained in the i-th MV packet of the k-th frame; denote P k i (m) as PEP of the i-th MV packet of the k-th frame; denote N k (m) as the total nmber of MV packets of the k-th frame. Following the same derivation process in Section , we obtain the frame-level MV cased distortion for the k-th frame as below D k (m) = E[(ξ k ) 2 ] P k (m), (2 33) 7 To achieve this, we change the H.264 reference code JM14.0 by allowing residal packets to be sed for decoder withot the corresponding MV packets being correctly received, that is, ê k can be sed to reconstrct f k even if mv k is not correctly received. 43

44 where P k (m) 1 N k m V i=1 (Pk i (m) Ni k (m)), a weighted average over PEPs of all MV packets in the k-th frame, in which different packets may contain different nmbers of pixels Analysis of Distortion Cased by Propagated Error Pls Clipping Noise In this sbsection, we derive the distortion cased by error propagation in a non-linear decoder with clipping. We first derive the pixel-level propagation and clipping cased distortion D k (P). Then we derive the frame-level propagation and clipping cased distortion D k (P) Pixel-level distortion cased by propagated error pls clipping noise First, we analyze the pixel-level propagation and clipping cased distortion D k (P) in P-MBs. D k (P) depends on propagated error and clipping noise; and clipping noise depends on propagated error, RCE, and MVCE. Hence, D k (P) depends on propagated error, RCE, and MVCE. Let r, m, p denote the event of occrrence of RCE, MV concealment error and propagated error, respectively, and let r, m, p denote logical NOT of r, m, p respectively (indicating no error). We se a triplet to denote the joint event of three types of error; e.g., {r, m, p} denotes the event that all the three types of errors occr, and k { r, m, p} denotes the pixel k experiencing none of the three types of errors. When we analyze the condition that several error events may occr, the notation cold be simplified by the principle of formal logic. For example, k { r, m} denotes the clipping noise nder the condition that there is neither RCE nor MVCE for pixel k, while it is not certain whether the reference pixel has error. Correspondingly, denote P k { r, m} as the probability of event { r, m}, that is, P k { r, m} = P{S k (r) = 0 and S k (m) = 0}. From the definition of P k (r), the marginal probability P k {r} = P k (r) and the marginal probability P k { r} = 1 P k (r). The same, P k {m} = P k (m) and P k { m} = 1 P k (m). Define D k (p) E[( ζ k 1 + k +mv { r, m}) 2 ]; and define α k k Dk (p), which is called D k 1 +mv k propagation factor for pixel k. The propagation factor α k defined in this chapter is 44

45 different from the propagation factor [11], leakage [8], or attenation factor [16], which are modeled as the effect of spatial filtering or intra pdate; or propagation factor α k is also different from the fading factor [9], which is modeled as the effect of sing fraction of referenced pixels in the reference frame for motion prediction. Note that D k (p) is only a special case of D k (P) nder the error event of { r, m} for pixel k. However, most existing models inappropriately se their propagation factor, obtained nder the error event of { r, m}, to replace D k (P) of all other error events directly withot distingishing their difference. To calclate E[( ζ k 1 + mv k + k ) 2 ] in (2 14), we need to analyze k in for different error events for pixel k : 1) both residal and MV are erroneos, denoted by k {r, m}; 2) residal is erroneos bt MV is correct, denoted by k {r, m}; 3) residal is correct bt MV is erroneos, denoted by k { r, m}; and 4) both residal and MV are correct, denoted by k { r, m}. So, D k (P) = P k {r, m} E[( ζ k 1 + k + ˇmv {r, m}) 2 ] + P k k {r, m} E[( ζ k 1 + k +mv {r, m}) 2 ] k + P k { r, m} E[( ζ k 1 + k + ˇmv { r, m}) 2 ] + P k k { r, m} E[( ζ k 1 + k +mv { r, m}) 2 ]. k Note that the concealed pixel vale shold be in the clipping fnction range, that is, Γ( f k 1 + mv k + ě k ) = f k 1 + mv k + ě k, so k {r} = 0. Also note that if the MV channel is independent of the residal channel, we have P k {r, m} = P k (r) P k (m). However, as mentioned in Section , in H.264 specification, these two channels are correlated. In other words, P k { r, m} = 0 and P k { r, m} = P k { r} for P-MBs with slice data partitioning in H.264. In sch a case, (2 34) is simplified to (2 34) D k (P) = P k {r, m} D k 1 + P + ˇmv k k {r, m} D k 1 + P +mv k k { r} D k (p). (2 35) In a more general case, where P k { r, m} 0, Eq. (2 35) is still valid. This is becase P k { r, m} = 0 only happens nder slice data partitioning condition, where 45

46 P k { r, m} P k { r, m} and E[( ζ k 1 + k +mv { r, m}) 2 ] E[( ζ k 1 + k k + ˇmv { r, m}) 2 ] nder UEP. k Therefore, the last two terms in (2 34) is almost eqal to P k { r} D k (p). Note that for P-MB withot slice data partitioning in H.264, we have P k {r, m} = P k { r, m} = 0, P k {r, m} = P k {r} = P k {m} = P k, and P k { r, m} = P k { r} = P k { m} = 1 P k. Therefore, (2 35) can be frther simplified to D k (P) = P k D k 1 + (1 P + ˇmv k k ) D k (p). (2 36) Also note that for I-MB, there will be no transmission distortion if it is correctly received, that is, D k (p) = 0. So (3 18) can be frther simplified to D k (P) = P k D k 1. (2 37) + ˇmv k Comparing (2 37) with (3 18), we see that I-MB is a special case of P-MB with D k (p) = 0, that is, the propagation factor α k = 0 according to the definition. It is important to note that D k (P) > 0 for I-MB. In other words, I-MB also contains the distortion cased by propagation error since P k 0. However, existing LTI models [8, 9] assme that there is no distortion cased by propagation error for I-MB, which nder-estimates the transmission distortion. In the following part of this sbsection, we derive the propagation factor α k for P-MB and prove some important properties of clipping noise. To derive α k, we first give Lemma 1 as below. Lemma 1. Given the PMF of the random variable ζ k 1 +mv k and the vale of ˆf k, D k (p) can be calclated at the encoder by D k (p) = E[Φ 2 ( ζ k 1, ˆf +mv k k )], where Φ(x, y) is called error redction fnction and defined by y γ L, y x < γ L Φ(x, y) y Γ(y x) = x, γ L y x γ H y γ H, y x > γ H. (2 38) 46

47 Lemma 1 is proved in Appendix A.1. In fact, we have fond in or experiments that in any error event, mean. If we assme ζ k 1 +mv k ζ k 1 +mv k approximately follows Laplacian distribtion with zero follows Laplacian distribtion with zero mean, the calclation for D k (p) becomes simpler since the only nknown parameter for PMF of variance. Under this assmption, we have the following proposition. ζ k 1 +mv k is its Proposition 1. The propagation factor α for propagated error with Laplacian distribtion of zero-mean and variance σ 2 is given by α = 1 1 y γ 2 e L b ( y γ L b + 1) 1 γ 2 e H y b ( γ H y b where y is the reconstrcted pixel vale, and b = 2 2 σ. + 1), (2 39) Proposition 1 is proved in Appendix A.2. In the zero-mean Laplacian case, α k will only be a fnction of ˆf k Since D k 1 +mv k and the variance of k 1 ζ, which is eqal to D k 1 +mv k +mv k in this case. has already been calclated dring the phase of predicting the (k 1)-th frame transmission distortion, D k (p) can be calclated by D k (p) = α k D k 1 +mv k via the definition of α k. Then we can recrsively calclate D k (P) in (2 35) since both D k 1 + ˇmv k and D k 1 +mv k have been calclated previosly for the (k 1)-th frame. (3 8) is very important for designing a low complexity algorithm to estimate propagation and clipping cased distortion in FTD, which will be presented in Chapter 3. Next, we prove an important property of the non-linear clipping fnction in the following proposition. Proposition 2. Clipping redces propagated error, that is, D k (p) D k 1 +mv k, or α k 1. Proof. First, from Lemma 5, which is presented and proved in Appendix A.6, we have Φ 2 (x, y) x 2 for any γ L y γ H. In other words, the fnction Φ(x, y) redces the energy of propagated error. This is the reason why we call it error redction fnction. With Lemma 1, it is straightforward to prove that whatever the PMF of ζ k 1 +mv k is, E[Φ 2 ( ζ k 1, ˆf +mv k k )] E[( ζ k 1 ) 2 ], that is, D +mv k k (p) D k 1 which is eqivalent to +mv, k α k 1. 47

48 1200 Only the third frame is received with error 1000 SSE distortion Frame index Figre 2-2. The effect of clipping noise on distortion propagation. Proposition 2 tells s that if there is no newly indced errors in the k-th frame, transmission distortion decreases from the (k 1)-th frame to the k-th frame. Fig. 2-2 shows the experimental reslt of transmission distortion propagation for bs cif.yv, where transmission errors only occr in the third frame. In fact, if we consider the more general cases where there may be new error indced in the k-th frame, we can still prove that E[( ζ k 1 + k + mv ) 2 ] E[( ζ k 1 ) 2 ] sing k + mv k the proof for the following corollary. Corollary 1. The correlation coefficient between ζ k 1 + mv k and k is non-positive. Specifically, they are negatively correlated nder the condition { r, p}, and ncorrelated nder other conditions. Corollary 1 is proved in Appendix A.8. This property is very important for designing a low complexity algorithm to estimate propagation and clipping cased distortion in PTD, which will be presented in Chapter Frame-level distortion cased by propagated error pls clipping noise In (2 35), D k 1 + ˇmv k D k 1 +mv k space. However, both the sm of D k 1 + ˇmv k de to the non-stationarity of the error process over over all pixels in the (k 1)-th frame and 48

49 the sm of D k 1 +mv k over all pixels in the (k 1)-th frame will converge to D k 1 de to the randomness of MV. The formla for frame-level propagation and clipping cased distortion is given in Lemma 2. Lemma 2. The frame-level propagation and clipping cased distortion in the k-th frame is D k (P) = D k 1 P k (r) + D k (p) (1 P k (r))(1 β k ), (2 40) where D k (p) 1 V V D k k (p) and P k (r) is defined in (2 29); β k is the percentage of I-MBs in the k-th frame; D k 1 is the transmission distortion in the (k 1)-th frame. Lemma 2 is proved in Appendix A.3. Define the propagation factor for the k-th frame α k Dk (p) V ; then we have α k = k α k Dk 1 +mv k D k 1. Note that D k 1 may be different D k 1 +mv k for different pixels in the (k 1)-th frame de to the non-stationarity of error process over space. However, when the nmber of pixels in the (k 1)-th frame is sfficiently large, the sm of D k 1 +mv k Therefore, we have α k = over all the pixels in the (k 1)-th frame will converge to D k 1. V k α k D k 1 +mv k V k D k 1 +mv k, which is a weighted average of α k with the weight being D k 1 +mv k. As a reslt, D k (p) D k (P). When the nmber of pixels in the (k-1)-th frame is small, V k α k D k 1 +mv k may be larger than D k 1 althogh its probability is small as observed in or experiments. However, most existing works directly se D k (P) = D k (p) in predicting transmission distortion. This is another reason why LTI models [8, 9] nder-estimate transmission distortion when there is no MV error. Details will be discssed in Section From Proposition 1, we know that α k is a fnction of ˆf k. So, α k depends on all samples of ˆf k in the k-th frame. Since the samples of ˆf k sally change over frames de to the video content variation, the propagation factor α k also varies from frame to frame as observed in the experiments. Accrately estimating α k for each frame is very important for instantaneos distortion estimation. However, existing models assme 49

50 propagation factor is constant over all frames, which makes the distortion estimation inaccrate. We will discss how to accrately estimate α k in real time in Chapter Analysis of Correlation Cased Distortion In this sbsection, we first derive the pixel-level correlation cased distortion D k (c). Then we derive the frame-level correlation cased distortion D k (c) Pixel-level correlation cased distortion We analyze the correlation cased distortion D k (c) at the decoder in for different cases: i) for k { r, m}, both ε k = 0 and ξ k = 0, so D k (c) = 0; ii) for k {r, m}, ξ k = 0 and D k (c) = 2E[ε k ( ζ k 1 +mv k 2E[ξ k ( ζ k 1 + ˇmv k + k {r, m})]; iii) for k { r, m}, ε k = 0 and D k (c) = + k { r, m})]; iv) for k {r, m}, D k (c) = 2E[ε k ξ k ] + 2E[ε k ( ζ k 1 + ˇmv k + k {r, m})] + 2E[ξ k ( ζ k 1 + ˇmv k + k {r, m})]. From Section , we know k {r} = 0. So, we obtain D k (c) = P k {r, m} 2E[ε k + P k {r, m} (2E[ε k ξ k ] + 2E[ε k k 1 ζ ] + P +mv k k { r, m} 2E[ξ k ( ζ k 1 + k + ˇmv { r, m})] k k 1 ζ ] + 2E[ξ k k 1 + ˇmv k ζ + ˇmv]). k (2 41) In the experiments, we find that in the trajectory of pixel k, 1) the residal ê k is approximately ncorrelated with the residal in all other frames ê i v, where i k, as shown in Fig. 2-3; and 2) the residal ê k is approximately ncorrelated with the MVCE of the corresponding pixel ξ k and the MVCE in all previos frames ξ i v, where i < k, as shown in Fig Based on the above observations, we frther assme that for any i < k, ê k is ncorrelated with ê i v and ξ i v if v i is not in the trajectory of pixel k, and make the following assmption. Assmption 5. ê k is ncorrelated with ξ k, and is ncorrelated with both ê i v and ξ i v for any i < k. Since ζ k 1 +mv k and ζ k 1 + ˇmv k are the transmission reconstrcted errors accmlated from all the frames before the k-th frame, ε k is ncorrelated with ζ k 1 +mv k and ζ k 1 + ˇmv k de to 50

51 Temporal correlation between residals in one trajectory 1 Correlation coefficient Frame index Frame index Figre 2-3. Temporal correlation between the residals in one trajectory. Temporal correlation between residal and concealment error in one trajectory 0.4 Correlation coefficient Residal frame index MV frame index Figre 2-4. Temporal correlation matrix between residal and MVCE in one trajectory. Assmption 5. Ths, (2 41) becomes D k (c) = 2P k {m} E[ξ k k 1 ζ ] + 2P + ˇmv k k { r, m} E[ξ k k { r, m}]. (2 42) However, we observe that in the trajectory of pixel k, 1) the residal ê k is correlated with the MVCE ξ i v, where i > k, as seen in Fig. 2-4; and 2) the MVCE ξ k is highly correlated with the MVCE ξ i v as shown in Fig This interesting phenomenon cold be exploited by an error concealment algorithm and is sbject to or ftre stdy. 51

52 Temporal correlation between concealment errors in one trajectory 1 Correlation coefficient Frame index Frame index Figre 2-5. Temporal correlation matrix between MVCEs in one trajectory. As mentioned in Section , for P-MBs with slice data partitioning in H.264, P k { r, m} = 0. So, (2 42) becomes D k (c) = 2P k {m} E[ξ k (ˆf k 1 k 1 f )]. (2 43) + ˇmv k + ˇmv k Note that in the more general case that P k { r, m} = 0, Eq. (2 43) is still valid since ξ k is almost ncorrelated with k { r, m} as observed in the experiment. For I-MBs or P-MBs withot slice data partitioning in H.264, since P k {r, m} = P k { r, m} = 0 and P k {r, m} = P k {r} = P k {m} = P k as mentioned in Section , (2 41) can be simplified to D k (c) = 2P k (2E[ε k ξ k ] + 2E[ε k Under Assmption 5, (3 19) redces to (2 43). Define λ k E[ξk f k 1 + ˇmv k ] E[ξ k ˆf k 1 + ˇmv k k 1 ζ ] + 2E[ξ k k 1 + ˇmv k ζ ]). (2 44) + ˇmv k ] ; λk is a correlation ratio, that is, the ratio of the correlation between MVCE and concealed reference pixel vale at the receiver, to the correlation between MVCE and concealed reference pixel vale at the transmitter. λ k qantifies the effect of the correlation between the MVCE and propagated error on transmission 52

53 Measred ρ between ξ k k 1 and ˆf +MV k Estimated ρ between ξ k k 1 and ˆf +MV k Measred ρ between ξ k and Estimated ρ between ξ k and ˆf k 1 + ˇ MV k ˆf k 1 + ˇ MV k Frame index Figre 2-6. Comparison between measred and estimated correlation coefficients. distortion. Since λ k is a stable statistics of MV, estimating λ k is mch simpler and more accrate than estimating E[ξ k f k 1 + ˇmv k ] directly, thereby reslting in more accrate distortion estimate. The details on how to estimate λ k will be presented in Chapter 3. Althogh we do not know the exact vale of λ k at the encoder, its range is k 1 PT(i){ r, i m} λ k 1, (2 45) i=1 where T(i) is the pixel position of the i-th frame in the trajectory, for example, T(k 1) = k + mv k and T(k 2) = v k 1 + mv k 1 v. The left ineqality in (2 45) holds in the extreme case that any error in the trajectory will case ξ k and f k 1 + ˇmv k to be ncorrelated, which is sally tre for high motion video. The right ineqality in (2 45) holds in another extreme case that all errors in the trajectory do not affect the correlation between ξ k and f k 1 + ˇmv k, which is sally tre for low motion video. Using the definition of λ k, (2 43) becomes D k (c) = 2P k {m} (1 λ k ) E[ξ k ˆf k 1 ]. (2 46) + ˇmv k In or experiments, we observe an interesting phenomenon that ξ k is always positively correlated with ˆf k 1, and negatively correlated with ˆf k 1 This is theoretically +mv k + ˇmv. k 53

54 proved in Lemma 3 nder Assmption 6; and this is also verified by or experiments as shown in Fig Assmption 6. E[(ˆf k 1 ) 2 ] = E[(ˆf k 1 + ˇmv k +mv) 2 ]. k Assmption 6 is valid nder the condition that the distance between mv k and ˇmv k is small; this is also verified by or experiments. Lemma 3. Under Assmption 6, E[ξ k ˆf k 1 ] = E[(ξk + ˇmv k ) 2 ] and E[ξ k 2 ˆf k 1 ] = E[(ξk +mv k ) 2 ]. 2 Lemma 3 is proved in Appendix A.4. Under Assmption 6, sing Lemma 3, we frther simplify (2 46) as below. D k (c) = (λ k 1) E[(ξ k ) 2 ] P k (m). (2 47) From (3 17), we know that E[(ξ k ) 2 ] P k (m) is exactly eqal to D k (m). Therefore, (2 47) is frther simplified to D k (c) = (λ k 1) D k (m). (2 48) As mentioned in Section , we observe that ξ k follows a zero-mean Laplacian distribtion in the experiment. Denote ρ the correlation coefficient between ξ k and ˆf k 1 + ˇmv k. If we assme E[ξ k ] = 0, we have ρ = E[ξk ˆf k 1 + ˇmv k ] E[ξ k ] E[ˆf k 1 + ˇmv k ] σ ξ k σˆf k to prove that the correlation coefficient between ξ k and ˆf k 1 +mv k = σ ξ k. Similarly, it is easy 2σˆf k is σ ξ k. This agrees well 2σˆf k with the experimental reslts shown in Fig Via the same derivation process, one can obtain the correlation coefficient between ê k and ˆf k 1, and between ê +mv k k and ˆf k. One possible application of these correlation properties is error concealment with partial information available. 54

55 Frame-Level correlation cased distortion Denote V k i (m) the set of pixels in the i-th MV packet of the k-th frame. From (2 20), (2 47) and Assmption 4, we obtain D k (c) = E[(ξk ) 2 ] V = E[(ξk ) 2 ] V V k (λ k 1) P k (m) N k (m) i=1 {P k i (m) V k i (m) (λ k 1)}. Define λ k 1 V V λ k ; de to the randomness of mv k 1, k Ni k (m) Vi k {m} λk will converge to λ k for any packet that contains a sfficiently large nmber of pixels. By rearranging (2 49), we obtain (2 49) D k (c) = E[(ξk ) 2 ] V N k (m) i=1 {P k i (m) N k i (m) (λ k 1)} (2 50) = (λ k 1) E[(ξ k ) 2 ] P k (m). From (3 3), we know that E[(ξ k ) 2 ] P k (m) is exactly eqal to D k (m). Therefore, (2 50) is frther simplified to D k (c) = (λ k 1) D k (m). (2 51) Smmary In Section 2.3.1, we decomposed transmission distortion into for terms; we derived a formla for each term in Sections throgh In this section, we combine the formlae for the for terms into a single formla Pixel-Level transmission distortion Theorem 2.1. Under single-reference prediction, the PTD of pixel k is D k = D k (r) + λ k D k (m) + P k {r, m} D k 1 + P + ˇmv k k {r, m} D k 1 + P +mv k k { r} α k D k 1 +mv. k (2 52) 55

56 Proof. (2 52) can be obtained by plgging (3 16), (3 17), (2 35), and (2 47) into (3 15). Corollary 2. Under single-reference prediction and no slice data partitioning, (2 52) is simplified to D k = P k (E[(ε k ) 2 ] + λ k E[(ξ k ) 2 ] + D k 1 ) + (1 P + ˇmv k k ) α k D k 1. (2 53) +mv k Frame-Level transmission distortion Theorem 2.2. Under single-reference prediction, the FTD of the k-th frame is D k = D k (r) + λ k D k (m) + P k (r) D k 1 + (1 P k (r)) D k (p) (1 β k ). (2 54) Proof. (2 54) can be obtained by plgging (3 2), (3 3), (2 40) and (2 51) into (2 16). Corollary 3. Under single-reference prediction and no slice data partitioning, the FTD of the k-th frame is given by (2 54) Relationship between Theorem 2.2 and Existing Transmission Distortion Models As mentioned previosly, some existing works have addressed the problem of transmission distortion prediction, and they proposed several different models [8], [9], [16], [11] to estimate transmission distortion. In this section, we will identify the relationship between Theorem 2.2 and their models, and specify the conditions, nder which those models are accrate. Note that in order to demonstrate the effect of non-linear clipping on transmission distortion propagation, we disable intra pdate, that is, β k = 0 for all the following cases. 8 The same formla for both cases is becase both mean of D k 1 and mean of + ˇmv k converge to D k 1 when the nmber of pixels in the k-th frame is sfficiently large, D k 1 +mv k as seen in Appendix A.3. 56

57 2.4.1 Case 1: Only the (k 1)-th Frame Has Error, and the Sbseqent Frames are All Correctly Received In this case, the models proposed in Ref. [8] [11] state that when there is no intra coding and spatial filtering, the propagation distortion will be the same for all the frames after the (k 1)-th frame, i.e., D n (p) = D n 1 ( n k). However, this is not tre as we proved in Proposition 2. De to the clipping fnction, we have α n 1 ( n k), i.e., D n D n 1 ( n k) in case the n-th frame is error-free. Actally, from Appendix A.6, we know that the eqality only holds nder a very special case that ˆf k γ H ζ k 1 +mv k ˆf k γ L for all pixel V k Case 2: Brst Errors in Consective Frames In Ref. [16], athors observe that the transmission distortion cased by accmlated errors from consective frames is generally larger than the sm of those distortions cased by individal frame errors. This is also observed in or experiment when there is no MV error. To explain this phenomenon, let s first look at a simple case that residals in the k-th frame are all erroneos, while the MVs in the k-th frame are all correctly received. In this case, we obtain from (2 54) that D k = D k (r) + P k (r) D k 1 + (1 P k (r)) D k (p), which is larger than the simple sm D k (r) + D k (p) as in the LTI model; the nder-estimation cased by the LTI model is de to D k (D k (r) + D k (p)) = (1 α k ) P k (r) D k 1. However, when MV is erroneos, the experimental reslt is qite different from that claimed in Ref. [16] especially for the high motion video. In other words, the LTI model now cases over-estimation for a brst error channel. In this case, the predicted transmission distortion can be calclated via (2 54) in Theorem 2.2 as D k 1 = D k (r) + λ k D k (m) + P k (r) D k (1 P k (r)) α k D k 1 1, and by the LTI model as D k 2 = D k (r) + D k (m) + α k D k 1 2. So, the prediction difference between Theorem 2.2 and the 57

58 LTI model is D k 1 D k 2 = (1 α k ) P k (r) D k 1 1 (1 λ k ) P k (m) E[(ξ k ) 2 ] + α k (D k 1 1 D k 1 2 ). (2 55) At the beginning, D 0 1 = D 0 2 = 0, and D k 1 << E[(ξ k ) 2 ] when k is small. Therefore, the transmission distortion cased by accmlated errors from consective frames will be smaller than the sm of the distortions cased by individal frame errors, that is, D k 1 < D k 2. We may see from (2 55) that, de to the propagation of over-estimation D k 1 1 D k 1 2 from the (k 1)-th frame to the k-th frame, the accmlated difference between D k 1 and D k 2 will become larger and larger as k increases Case 3: Modeling Transmission Distortion as an Otpt of an LTI System with PEP as inpt In Ref. [9], athors propose an LTI transmission distortion model based on their observations from experiments. This LTI model ignores the effects of correlation between the newly indced error and the propagated error, that is, λ k = 1. This is only valid for low motion video. From (2 54), we obtain D k = D k (r) + D k (m) + ( P k (r) + (1 P k (r)) α k ) D k 1. (2 56) Let η k = P k (r) + (1 P k (r)) α k. If 1) there is no slice data partitioning, i.e., P k (m) = P k (r) = P k, and 2) P k (r) = P k (r) (which means one frame is transmitted in one packet, or different packets experience the same channel condition), then (2 56) becomes D k = {E[(ξ k ) 2 ] + E[(ε k ) 2 ]} P k + η k D k 1. Let E k E[(ξ k ) 2 ] + E[(ε k ) 2 ]. Then the recrsive formla reslts in k k D k = [( η i ) (E l P l )], (2 57) l=k L i=l+1 where L is the time interval between the k-th frame and the latest correctly received frame. 58

59 Denote the system by an operator H that maps the error inpt seqence {P k }, as a fnction of frame index k, to the distortion otpt seqence {D k }. Since generally D k (p) is a nonlinear fnction of D k 1, as a ratio of D k (p) and D k 1, α k is still a fnction of D k 1. As a reslt, η k is a fnction of D k 1. That means the operator H is non-linear, i.e., the system is non-linear. In addition, since α k varies from frame to frame as mentioned in Section , the system is time-variant. In smmary, H is generally a non-linear time-variant system. The LTI model assmes that 1) the operator H is linear, that is, H(a P k 1 + b P k 2 ) = a H(P k 1 ) + b H(P k 2 ), which is valid only when η k does not depend on D k 1 ; and 2) the operator H is time-invariant, that is, D k+δ = H(P k+δ ), which is valid only when η k is constant, i.e., both P k (r) and α k are constant. Under these two assmptions, we have η i = η, and we obtain k i=l+1 ηi response of the LTI model; then we obtain D k = = (η) k l. Let h[k] = (η) k, where h[k] is the implse k [h[k l] (E l P l )]. (2 58) l=k L From Proposition 2, it is easy to prove that 0 η 1; so h[k] is a decreasing fnction of time. We see that (2 58) is a convoltion between the error inpt seqence and the system implse response. Actally, if we let h[k] = e γk, where γ = log η, it is exactly the formla proposed in Ref. [9]. Note that (2 58) is a very special case of (2 54) with the following limitations: 1) the video content has to be of low motion; 2) there is no slice data partitioning or all pixels in the same frame experience the same channel condition; 3) η k is a constant, that is, both P k (r) and the propagation factor α k are constant, which reqires the probability distribtions of reconstrcted pixel vales in all frames shold be the same. Note that the physical meaning of η k is not the actal propagation factor, bt it is jst a notation for simplifying the formla. 59

60 2.5 PTD and FTD nder Mlti-Reference Prediction The PTD and FTD formlae in Section 2.3 are for single-reference prediction. In this section, we extend the formlae to mlti-reference prediction Pixel-level Distortion nder Mlti-Reference Prediction If mltiple frames are allowed to be the references for motion estimation, the reconstrcted pixel vale at the encoder in (2 1) becomes ˆf k = Γ(ˆf k j +mv k + ê k ). (2 59) For the reconstrcted pixel vale at the decoder in (2 3), it is a bit different as below. If mv k is correctly received, mv k = mv k and k f = Γ( f k j + ẽ + mv k k ). (2 60) f k j + mv k = f k j +mv k. However, if mv k is received with error, the concealed MV has no difference from the single-reference case, that is, mv k = ˇmv k and f k j + mv k = f k 1 + ˇmv k. As a reslt, (2 12) becomes ζ k = (ê k + ˆf k j ˆ k +mv ) (ẽ k k j k + f k + mv ) k = (ê k ẽ k ) + (ˆf k j ˆf k j ) + (ˆf k j k j f ) (ˆ k +mv k + mv k + mv k + mv k k ). (2 61) Following the same deriving process from Section to Section 2.3.5, the formlae for PTD nder mlti-reference prediction are the same as those nder single-reference prediction except the following changes: 1) MVCE ξ k ˆf k j ˆf k j +mv k + mv k and clipping noise k ( f k j + ẽ k ) Γ( f k j + ẽ k ); 2) D k (m) and D k (c) are given + mv k + mv k by (3 17) and (2 48), respectively, with a new definition of ξ k D k (p) E[( ζ k j +mv k + k { r, m}) 2 ], α k Dk (p) D k j +mv k and ˆf k j +mv k ˆf k 1 + ˇmv k ; 3) D k (P) = P k {r, m} D k 1 + P + ˇmv k k {r, m} D k j + P +mv k k { r} D k (p), (2 62) 60

61 compared to (2 35). The generalization of PTD formlae to mlti-reference prediction is straightforward since the mlti-reference prediction case jst has a larger set of reference pixels than the single-reference case. Therefore, we have the following general theorem for PTD. Theorem 2.3. Under mlti-reference prediction, the PTD of pixel k is D k = D k (r) + λ k D k (m) + P k {r, m} D k 1 + P + ˇmv k k {r, m} D k j + P +mv k k { r} α k D k j +mv. k Corollary 4. Under mlti-reference prediction and no slice data partitioning, (2 63) is simplified to (2 63) D k = P k (E[(ε k ) 2 ] + λ k E[(ξ k ) 2 ] + D k 1 ) + (1 P + ˇmv k k ) α k D k j. (2 64) +mv k Frame-level Distortion nder Mlti-Reference Prediction Under mlti-reference prediction, each block typically is allowed to choose its reference block independently; hence, different pixels in the same frame may have different reference frames. Define V k (j) { k : k = v k j mv k }, where j {1, 2,..., J} and J is the nmber of reference frames; i.e., V k (j) is the set of the pixels in the k-th frame, whose reference pixels are in the (k j)-th frame. Obviosly, J j=1 V k (j) = V k and J j=1 V k (j) =. Define w k (j) Vk (j) V k. Note that V k and V k (j) have the similar physical meanings bt only the different cardinalities. D k (m) and D k (c) are given by (3 3) and (2 51), respectively, with a new definition of ξ k (j) {ξ k : ξ k = ˆf k j +mv k factor of V k (j) by α k (j) D k (P). ˆf k 1 + ˇmv k V k (j) αk D k j }. and ξ k = J j=1 w k (j) ξ k (j). Define the propagation +mv k V k (j) Dk j +mv k. The following lemma gives the formla for 61

62 Lemma 4. The frame-level propagation and clipping cased distortion in the k-th frame for the mlti-reference case is J D k (P) = D k 1 P k {r, m} + ( P k (j){r, m} w k (j) D k j ) j=1 J + (1 β k ) ( P k (j){ r} w k (j) α k (j) D k j ), j=1 (2 65) where β k is the percentage of I-MBs in the k-th frame; P k (j){r, m} is the weighted average of joint PEPs of event {r, m} for the j-th sb-frame in the k-th frame. P k (j){ r} is the weighted average of PEP of event { r} for the j-th sb-frame in the k-th frame. Lemma 4 is proved in Appendix A.5. With Lemma 4, we have the following general theorem for FTD. Theorem 2.4. Under mlti-reference prediction, the FTD of the k-th frame is D k = D k (r) + λ k D k (m) + D k 1 P k {r, m} + J ( P k (j){r, m} w k (j) D k j ) + (1 β k ) j=1 J ( P k (j){ r} w k (j) α k (j) D k j ). j=1 (2 66) Proof. (2 66) can be obtained by plgging (3 2), (3 3), (2 65) and (2 51) into (2 16). It is easy to prove that (2 54) in Theorem 2.2 is a special case of (2 66) with J = 1 and w k (j) = 1. It is also easy to prove that (2 63) in Theorem 2.3 is a special case of (2 66) with V = 1. Corollary 5. Under mlti-reference prediction and no slice data partitioning, (2 66) is simplified to D k = D k (r) + λ k D k (m) + D k 1 P k {r} + (1 β k ) J ( P k (j){ r} w k (j) α k (j) D k j ). j=1 (2 67) 62

63 CHAPTER 3 PREDICTION OF TRANSMISSION DISTORTION FOR WIRELESS VIDEO COMMUNICATION: ALGORITHM AND APPLICATION In this chapter, we design the algorithms to estimate the transmission distortion based on the analysis in Chapter 2. We also apply the algorithm in the rate-distortion optimized mode decision problem and achieve a remarkable performance gain than existing soltions. 3.1 A Literatre Review on Estimation Algorithms of Transmission Distortion Transmitting video over wireless with good qality or low end-to-end distortion is particlarly challenging since the received video is sbject to not only qantization distortion bt also transmission distortion (i.e., video distortion cased by packet errors). The capability of predicting transmission distortion can assist in designing video encoding and transmission schemes that achieve maximm video qality or minimm end-to-end video distortion. In Chapter 2, we have theoretically derived formlae for transmission distortion. In this chapter, we leverage the analytical reslts in Chapter 2 to design algorithms for estimating transmission distortion; we also develop an algorithm for estimating end-to-end distortion, and apply it to prediction mode decision in H.264 encoder. To estimate frame-level transmission distortion (FTD), several linear model based algorithms [8 11] are proposed. These algorithms se the sm of the newly indced distortion in the crrent frame and the propagated distortion from previos frames, to estimate transmission distortion. The linear model based algorithms simplify the analysis of transmission distortion at the cost of sacrificing the prediction accracy by neglecting the correlation between the newly indced error and the propagated error. Liang et al. [16] extend the reslt in Ref. [8] by addressing the effect of correlation. However, they do not consider the effect of motion vector (MV) error on transmission distortion and their algorithm is not tested with high motion video content. Under this condition, they claim that the LTI models [8, 9] nder-estimate transmission distortion 63

64 de to positive correlation between two adjacent erroneos frames. In Chapter 2, we identify that the MV concealment error is negatively correlated with the propagated error and this correlation dominates over all other types of correlation especially for high motion video. As long as MV transmission errors exist, the transmission distortion estimated by LTI models becomes over-estimated. In Chapter 2, we also qantify the effects of those correlations on transmission distortion by a system parameter called correlation ratio. On the other hand, none of existing works analyzes the impact of clipping noise on transmission distortion. In Chapter 2, we prove that clipping noise redces the propagated error and qantify its effect by another system parameter called propagation factor. In this chapter, we design algorithms to estimate correlation ratio and propagation factor, which facilitates the design of a low complexity algorithm called RMPC-FTD algorithm for estimating frame-level transmission distortion. Experimental reslts demonstrate that or RMPC-FTD algorithm is more accrate and more robst than existing algorithms. Another advantage of or RMPC-FTD algorithm is that all parameters in the formla derived in Chapter 2 can be estimated by sing the instantaneos video frame statistics and channel conditions, which allows the frame statistics to be time-varying and the error processes to be non-stationary. However, existing algorithms estimate their parameters by sing the statistics averaged over mltiple frames and assme these statistics do not change over time; their models all assme the error process is stationary. As a reslt, or RMPC-FTD algorithm is more sitable for real-time video commnication. For pixel-level transmission distortion (PTD), the estimation algorithm is similar to the FTD estimation algorithm since the PTD formla is a special case of the FTD formla as discssed in Chapter 2. However, in some existing video encoders, e.g., H.264 reference code JM14.0 1, motion estimation and prediction mode decision are 1 jm/jm14.0.zip 64

65 separately considered. Therefore, the MV and corresponding residal are known for distortion estimation in mode decision. In sch a case, the PTD estimation algorithm can be simplified with known vales of the MV and corresponding residal, compared to sing their statistics. In this chapter, we design a PTD estimation algorithm, called RMPC-PTD for sch a case; we also extend RMPC-PTD to estimate pixel-level end-to-end distortion (PEED). PEED estimation is important for designing optimal encoding and transmission schemes. Some existing PEED estimation algorithms are proposed in Refs. [4, 5]. In Ref. [4], the recrsive optimal per-pixel estimate (ROPE) algorithm is proposed to estimate the PEED by recrsively calclating the first and second moments of the reconstrcted pixel vale. However, the ROPE algorithm neglects the significant effect of clipping noise on transmission distortion, reslting in inaccrate estimate. Frthermore, the ROPE algorithm reqires intensive comptation of correlation coefficients when pixel averaging operations (e.g., in interpolation filter and deblocking filter) are involved [23], which redces its applicability in H.264 video encoder. Stockhammer et al. [5] propose a distortion estimation algorithm by simlating K independent decoders at the encoder side dring the encoding process and averaging the distortions of these K decoders. This algorithm is based on the Law of Large Nmber (LLN), i.e., the estimated distortion will asymptotically approach the expected distortion as K goes to infinity. For this reason, we call the algorithm in Ref. [5] as LLN algorithm. However, for LLN algorithm, the larger nmber of decoders simlated, the higher comptational complexity and the larger memory reqired. As a reslt, LLN algorithm is not sitable for real-time video commnication. To enhance estimation accracy, redce complexity and improve extensibility, in this chapter, we extend RMPC-PTD algorithm to PEED estimation; the reslting algorithm is called RMPC-PEED. Compared to ROPE algorithm, RMPC-PEED algorithm is more accrate since the significant effect of clipping noise on transmission distortion is considered. Another advantage over ROPE algorithm is that RMPC-PEED 65

66 algorithm is mch easier to be extended to spport averaging operations, e.g., interpolation filter. Compared to LLN algorithm, the comptational complexity and memory reqirement of RMPC-PEED algorithm are mch lower and the estimated distortion has smaller variance. In existing video encoders, prediction mode decision is to choose the best prediction mode in the sense of minimizing the Rate-Distortion (R-D) cost for each Macroblock (MB) or sb-mb. Estimation of the MB level or sb-mb level end-to-end distortion for different prediction modes is needed. In inter-prediction, the reference pixels of the same encoding block may belong to different blocks in the reference frame; therefore, PEED estimation is needed for calclating R-D cost in prediction mode decision. In this chapter, we apply or RMPC-PEED algorithm to prediction mode decision in H.264; the reslting algorithm is called RMPC-MS. Experimental reslts show that, for prediction mode decision in H.264 encoder, or RMPC-MS algorithm achieves an average PSNR gain of 1.44dB over ROPE algorithm for foreman seqence nder PEP = 5%; and it achieves an average PSNR gain of 0.89dB over LLN algorithm for foreman seqence nder PEP = 1%. The rest of chapter is organized as follows. Section 3.2 presents or algorithms for estimating FTD nder two scenarios: one withot acknowledgement feedback and one with acknowledgement feedback. In Section 3.3, we develop algorithms for estimating PTD. In Section 3.4, we extend or PTD estimation algorithm to PEED estimation. In Section 3.5, we apply or PEED estimation algorithm to prediction mode decision in H.264 encoder and compare its complexity with existing algorithms. Section 5.5 shows the experimental reslts that demonstrate accracy and robstness of or distortion estimation algorithm and sperior R-D performance of or mode decision scheme over existing schemes. 66

67 3.2 Algorithms for Estimating FTD In this section, we develop or algorithms for estimating FTD nder two scenarios: one withot acknowledgement feedback and one with acknowledgement feedback, which are presented in Sections and 3.2.2, respectively FTD Estimation withot Feedback Acknowledgement Chapter 2 derives a formla for FTD nder single-reference prediction, i.e., D k = D k (r) + D k (m) + D k (P) + D k (c), (3 1) where D k (r) = E[(ε k ) 2 ] P k (r); (3 2) D k (m) = E[(ξ k ) 2 ] P k (m); (3 3) D k (P) = P k (r) D k 1 + (1 β k ) (1 P k (r)) α k D k 1 ; (3 4) D k (c) = (λ k 1) D k (m); (3 5) ε k is the residal concealment error and P k (r) is the weighted average PEP of all residal packets in the k-th frame; ξ k is the MV concealment error and P k (m) is the weighted average PEP of all residal packets in the k-th frame; β k is the percentage of encoded I-MBs in the k-th frame; both the propagation factor α k and the correlation ratio λ k depend on video content, channel condition and codec strctre, and are therefore called system parameters; D k 1 is the transmission distortion in the k 1 frame, which can be iteratively calclated by (3 1). Next, Sections throgh present methods to estimate each of the for distortion terms in (3 1), respectively Estimation of residal cased distortion From the analysis in Chapter 2, E[(ε k ) 2 ] = E[(ε k ) 2 ] = E[(ê k ě k ) 2 ] for all in the k-th frame; ê k is the transmitted residal for pixel k ; and ě k is the concealed residal 67

68 for pixel k at the decoder. E[(ε k ) 2 ] can be estimated from the finite samples of ε k in the k-th frame, i.e., Ê[(ε k ) 2 ] = 1 V V (ê k k ě k ) 2, where ě k is the estimate of ě k. From the analysis in Chapter 2, P k (r) = 1 N k (r) V i=1 (Pk i (r) Ni k (r)), where Pi k (r) is the PEP of the i-th residal packet in the k-th frame; N k i (r) is the nmber of pixels contained in the i-th residal packet of the k-th frame; N k (r) is the nmber of residal packets in the k-th frame. P k i (r) can be estimated from channel state statistics. Denote the estimated PEP by ˆP i k (r) for all i {1, 2,..., N k (r)}; then P k (r) can be estimated by ˆP k (r) = 1 N k (r) V i=1 (ˆP i k (r) Ni k (r)). As a reslt, D k (r) can be estimated by ˆD k (r) = Ê[(ε k ) 2 ] ˆP k (r) = 1 N k (r) (ˆP k ( V ) 2 i (r) Ni k (r)) (ê k ě k ) 2. (3 6) V k Next, we discss how to 1) conceal ê k at the decoder; 2) estimate ě k at the encoder; and 3) estimate P k i (r) at the encoder. i=1 Concealment of ê k at the decoder: At the decoder, if ê k is received with error and its neighboring pixels are correctly received, its neighboring pixels cold be tilized to conceal ê k. However, this is possible only if the pixel k is at the slice bondary and the pixels at the other side of this slice bondary is correctly received. In H.264, most pixels in a slice do not locate at the slice bondary. Therefore, if one slice is lost, most of pixels in that slice will be concealed withot the information from neighboring pixels. If the same method is sed to conceal ě k of all pixels, it is not difficlt to prove that the minimm of E[(ε k ) 2 ] is achieved when ě k = E[ê k ]. Note that when ě k is concealed by E[ê k ] at the decoder, E[(ε k ) 2 ] is the variance of ê k, that is, E[(ε k ) 2 ] = σ 2. In or experiment, we find that the histogram of ê k in ê k each frame approximately follows a Laplacian distribtion with zero mean. As proved in Ref. [24], the variance of e k depends on the spatio-temporal correlation of the inpt video seqence and the accracy of motion estimation. Since ê k is a fnction of e k, E[(ε k ) 2 ] also depends on the accracy of motion estimation. So, for a given video seqence, more accrate residal concealment and more accrate motion 68

69 estimation prodce a smaller D k (r). This cold be sed as a criterion for the design of the encoding algorithm at the encoder and residal concealment method at the decoder. Estimation of ě k at the encoder: If the encoder has knowledge of the concealment method at the decoder as well as the feedback acknowledgement of some packets, ě k can be estimated by the same concealed methods at the decoder. That means the methods to estimate ě k of pixels at the slice bondary are different from other pixels. However, if no feedback acknowledgement of which packets are correctly received, the same method may be sed to estimate ě k of all pixels, that is, ě k = 1 V V ê k k. Note that even if the feedback acknowledgement of some packets are correctly received before the estimation, the estimate obtained by this method at the encoder is still qite accrate since most pixels in a slice do not locate at the slice bondary. In most cases, for a standard hybrid codec sch as H.264, 1 V eqals zero 2 V k ê k approximately for P-MBs and B-MBs. Therefore, one simple concealment method is to let ě k = 0 as in most transmission distortion models. In this chapter, we still se ě k in case 1 V V ê k k 0 de to the imperfect predictive coding, or in the general case, that is, some feedback acknowledgements may have been received before the estimation. Note that when ě k = 1 V V ê k k at the encoder, Ê[(ε k ) 2 ] is the sample variance of ê k and in fact a biased estimator of σ 2 ê k [25]. In other words, Ê[(ε k ) 2 ] is a sfficient statistic of all individal samples ê k. If the sfficient statistic Ê[(ε k ) 2 ] is known, the FTD estimator does not need the vales of ê k of all pixels. Therefore, sch an FTD estimator incrs mch lower complexity than sing the vales of ê k of all pixels. Estimation of P k i (r): In wired commnication, application layer PEP is sally estimated by Packet Error Rate (PER), which is the ratio of the nmber of incorrectly received packets to the nmber of transmitted packets, that is, ˆP k i (r) = PER k i (r). In a wireless fading channel, instantaneos physical layer PEP is a fnction of the 2 This is actally an objective of predictive coding. 69

70 instantaneos channel gain g(t) at time t [18], which is denoted by p(g(t)). At an encoder, there are two cases: 1) the transmitter has perfect knowledge of g(t), and 2) the transmitter has no knowledge of g(t) bt knows the probability density fnction (pdf) of g(t). For Case 1, the estimated PEP ˆP k i (r) = p(g(t)) since g(t) is known. Note that since the channel gain is time varying, the estimated instantaneos PEP is also time varying. 3 For Case 2, p(g(t)) is a random variable since only pdf of g(t) is known. Hence, we shold se the expected vale of p(g(t)) to estimate P k i (r), that is, ˆP k i (r) = E[p(g(t))], where the expectation is taken over the pdf of g(t) Estimation of MV cased distortion From the analysis in Chapter 2, E[(ξ k ) 2 ] = E[(ξ k ) 2 ] = E[(ˆf k 1 ˆf k 1 +mv k + ˇmv) 2 ] for all k in the k-th frame; mv k is the transmitted MV for pixel k ; and ˇmv k is the concealed MV for pixel k at the decoder. E[(ξ k ) 2 ] can be estimated from the finite samples of ξ k in the k-th frame, i.e., Ê[(ξ k ) 2 ] = 1 V V (ˆf k 1 k +mv k Similar to Section , P k (m) = 1 V ˆf k 1 + ˇmv ) 2, where ˇmv k k is the estimate of ˇmv k. N k (m) i=1 (Pi k (m) Ni k (m)), where Pi k (m) is the PEP of the i-th MV packet in the k-th frame; N k i (m) is the nmber of pixels contained in the i-th MV packet of the k-th frame; N k (m) is the nmber of MV packets in the k-th frame. P k i (m) can be estimated from channel state statistics. Denote the estimated PEP by ˆP i k (m) for all i {1, 2,..., N k (m)}; then P k (r) can be estimated by ˆP k (m) = 1 N k (m) V i=1 (ˆP i k (m) Ni k (m)). As a reslt, D k (m) can be estimated by ˆD k (m) = Ê[(ξ k ) 2 ] ˆP k (m) = 1 N k (m) ( V ) 2 i=1 (ˆP k i (m) N k i (m)) V k (ˆf k 1 +mv k ˆf k 1 + ˇmv ) 2. (3 7) k Next, we discss how to 1) conceal mv k at the decoder; 2) estimate ˇmv k at the encoder; and 3) estimate P k i (m) at the encoder. 3 This implies that the pixel error process is non-stationary over both time and space. 70

71 Concealment of mv k at the decoder: Different from residal, MV are highly correlated in both temporal and spatial domains. Hence, the decoder may conceal the MV by temporally neighboring block if its spatially neighboring blocks are not available. Depending on whether the neighboring blocks are correctly received or not, there may be several options of MV error concealment methods for each block, or each pixel to make it more general. If the neighboring blocks are correctly received, ˇmv k can be concealed by the median or average of those neighboring blocks. Interested readers may refer to Ref. [21], [26], [27] for discssions on different MV concealment methods. In or experiment, we also observe that the histogram of ξ k in one frame approximately follows a Laplacian distribtion with zero mean. For different concealment methods, the variance of ξ k will be different. The more accrate concealed motion estimation, the smaller D k (m). Estimation of ˇmv k at the encoder: If the encoder knows the concealment methods of crrent block and the PEP of neighboring blocks, we can estimate the MV cased distortion by assigning different concealment methods with different probabilities at the encoder as in Ref. [4]. However, if the encoder does not know what concealment methods are sed by the decoder or no neighboring blocks can be tilized for error concealment (e.g., both temporal and spatial neighboring blocks are in error), a simple estimation algorithm [9], [10] is to let ˇmv k = 0, that is, sing the pixel vale from the same position of the previos frame. In this chapter, we still se ˇmv k to denote the estimate of concealed motion vector for the general case. Estimation of P k i (m): The estimation of P k i (m) is similar to the estimation of P k i (r). Note that in H.264 specification, there is no slice data partitioning for an instantaneos decoding refresh (IDR) frame [22], so P k i (r) = P k i (m) for all pixels in an IDR-frame. This is also tre for I-MB, and P-MB withot slice data partitioning. For P-MB with slice data partitioning in H.264, the error state of residal and the error state of MV of the same pixel are partially correlated. To be more specific, if the MV packet is lost, the 71

72 corresponding residal packet cannot be decoded even if it is correctly received, since there is no slice header in the residal packet. As a reslt, P k i (r H.264 ) = P k i (r) + (1 P k i (r))p k i (m) Estimation of propagation and clipping cased distortion To estimate D k (P), we only need to estimate α k since P k (r) has been estimated in Section In Chapter 2, we theoretically derive the propagation factor α k of pixel k for propagated error with a zero-mean Laplacian distribtion, i.e., α = 1 1 y γ 2 e L b ( y γ L + 1) 1 γ b 2 e H y b ( (γ H y) + 1), (3 8) b where γ L and γ H are ser-specified low threshold and high threshold, respectively; y is the reconstrcted pixel vale; b = 2σ; and σ is the standard deviation of the 2 propagated error. Here, we provide three methods to estimate the propagation factor α k as below. Estimation of α k by α k : As defined in Chapter 2, α k = V k α k Dk 1 +mv k V k D k 1 +mv k. Therefore, we may first estimate α k by (3 8) and then estimate α k by its definition. However, this method reqires to compte exponentiations and divisions in (3 8) for each pixel, and needs large memory to store ˆD k 1 +mv k for all pixels in all reference frames. Estimate the average of a fnction by the fnction of an average: If we estimate α k directly by the frame statistics instead of pixel vales, both the comptational complexity and memory reqirement will be decreased by a factor of N V k. If only ˆD k 1 instead of ˆD k 1 V k ˆα k ˆD k 1 V k ˆD = 1 k 1 V is stored in memory, we may simplify estimating α k by ˆα k = +mv k V ˆα. k This is accrate if all packets in the same frame experience k the same channel condition. We see from (3 8) that α k is a fnction of the reconstrcted pixel vale ˆf k and the variance of propagated error σ k 1 2ˆf this case. Denote α k = g(ˆf k, D k 1 ); we have α k = 1 V +mv k, which is eqal to D k 1 in V k g(ˆf k, D k 1 ). One simple and intitive method is to se the fnction of an average to estimate the average of a fnction, that is, ˆα k = g( 1 V V ˆf k k, ˆD k 1 ). 72

73 Improve estimation accracy by sing the property of (3 8): Althogh the above method dramatically redces the estimation complexity and memory reqirement, that simple approximation is only accrate if α k is a linear fnction of ˆf k. In other words, sch approximation cases nderestimation for the convex fnction or overestimation for the concave fnction [28]. Althogh (3 8) is neither a convex fnction nor a concave fnction, it is interesting to see that 1) α k is symmetric abot ˆf k monotonically increasing fnction of ˆf k decreasing fnction of ˆf k when γ H+γ L < ˆf k 2 when γ L < ˆf k than the whole fnction. So, we propose to se 1 1 of V V ˆf k k = γ H+γ L 2 ; 2) α k is a < γ H+γ L 2, and α k is a monotonically < γ H ; 3) both half sides are mch more linear V V k ˆf k γ H+γ L 2 + γ H+γ L 2 instead to estimate α k. Since the symmetry property is exploited, sch algorithm gives mch more accrate estimate ˆα k. From the analysis in Chapter 2, we have D k (p) = α k D k 1 ; so we can estimate D k (p) by ˆD k (p) = ˆD k 1 ˆα k. To compensate the accracy loss of sing frame statistics, we may se the following algorithm to estimate D k (p) withot the exponentiation and division for each pixel: ˆD k (p) = ( ˆD k 1 ˆD k 1 (r) ˆD k 1 (m)) ˆα k + Φ 2 (ε k 1, ˆf k ) + Φ 2 (ξ k 1, ˆf k ), (3 9) where ˆD k 1 (r) can be estimated by (3 6); ˆD k 1 (m) can be estimated by (3 7); ˆα k can be estimated by (3 8); Φ 2 (ε k 1, ˆf k ) = 1 1 V V V Φ 2 (ε k 1 k, ˆf k ) and Φ 2 (ξ k 1, ˆf k ) = V Φ 2 (ξ k 1 k, ˆf k ), while both of them can be easily calclated by y γ L, y x < γ L Φ(x, y) y Γ(y x) = x, γ L y x γ H y γ H, y x > γ H. (3 10) 73

74 Or experimental reslts in Section 5.5 show that the proposed algorithm provides accrate estimate. Finally, it is straightforward to estimate D k (P) by ˆD k (P) = ˆP k (r) ˆD k 1 + (1 β k ) (1 ˆP k (r)) ˆD k (p). (3 11) Estimation of correlation-cased distortion To estimate D k (c), the only parameter needs to be estimated is λ k since D k (m) has been estimated in Section As defined in Chapter 2, λ k = 1 V λ k, where k λ k = E[ξk f k 1 + ˇmv k ] E[ξ k k 1 ˆf + ˇmv k Chapter 2. ]. λk depends on the motion activity of the video content according to In or experiment, we find that λ k is small when the average MV length over the set in the k-th frame is larger than half of the block length, and λ k 1 when the average MV length in the k-th frame is smaller than half of the block length, or when the propagated error from the reference frames is small. An intitive explanation for this phenomenon is as below: 1) if the average MV length is large and the MV packets are received with error, most concealed reference pixels will be in some block different from the block where the corresponding tre reference pixels locate; 2) if the average MV length is small, most concealed reference pixels and the corresponding tre reference pixels will still be in the same block even if the MV packet is received with error; 3) since the correlation between two pixels inside the same block is mch higher than the correlation between two pixels located in different blocks, hence λ k is small when the average MV length is large and vice versa; 4) if there is no propagated error from the reference frames, according to the definition, it is easy to prove that λ k = 1. Therefore, we propose a low complexity algorithm to estimate λ k by video frame V statistics as below (1 P k 1 (m))(1 P k 1 (r)), mv k > ˆλ k = 1, otherwise, block size 2 (3 12) 74

75 where P k 1 (r) is defined in (3 4); P k 1 (m) is defined in (3 5); mv k = 1 V V mv k, k and mv k is the length of mv k. As a reslt, ˆD k (c) = (ˆλ k 1) ˆD k (m). (3 13) Smmary Withot feedback acknowledgement, the transmission distortion of the k-th frame can be estimated by ˆD k = ˆD k (r) + ˆλ k ˆD k (m) + ˆP k (r) ˆD k 1 + (1 β k ) (1 ˆP k (r)) ˆD k (p), (3 14) where ˆD k (r) can be estimated by (3 6); ˆD k (m) can be estimated by (3 7); ˆD k (p) can be estimated by (3 9); ˆλ k can be estimated by (3 12); ˆP k (r) can be estimated by the estimated PEP of all residal packets in the k-th frame as discssed in Section We call the reslting algorithm in (3 14) as RMPC-FTD algorithm FTD Estimation with Feedback Acknowledgement In some wireless video commnication systems, the receiver may send the transmitter a notification abot whether packets are correctly received. This feedback acknowledgement mechanism can be tilized to improve FTD estimation accracy as shown in Algorithm 1. Algorithm 1. FTD estimation at the transmitter nder feedback acknowledgement. 1) Inpt: ˆP k i (r) and ˆP k i (m) for all i {1, 2,..., N k }. 2) Initialization and pdate. If k = 1, do initialization. If k > 1, pdate with feedback information. If there are acknowledgements for packets in the (k 1)-th frame, For j = 1 : N k 1 if ACK for the j-th residal packet is received, pdate ˆP k 1 j (r) = 0. if NACK for the j-th residal packet is received, pdate ˆP k 1 j (r) = 1. 75

76 End 3) Estimate D k via if ACK for the j-th MV packet is received, pdate ˆP k 1 j (m) = 0. if NACK for the j-th MV packet is received, pdate ˆP k 1 j (m) = 1. Update ˆD k 1. Else (neither ACK nor NACK is received), go to 3). ˆD k = ˆD k (r) + ˆλ k ˆD k (m) + ˆP k (r) ˆD k 1 + (1 β k ) (1 ˆP k (r)) ˆD k (p), which is (3 14). 4) Otpt: ˆD k. Algorithm 1 has a low comptational complexity since ˆD k 1 is pdated based on whether packets in the (k 1)-th frame are correctly received or not. In a more general case that the encoder can tolerate a feedback delay of d frames, we cold pdate ˆD k 1 based on the feedback acknowledgements for the (k d)-th frame throgh the (k 1)-th frame. However, this reqires extra memory for the encoder to store all the system parameters from the (k d)-th frame to the (k 1)-th frame in order to pdate ˆD k Pixel-level Transmission Distortion Estimation Algorithm The PTD estimation algorithm is similar to the FTD estimation algorithm presented in Section 3.2. However, the vales of some variables in the PTD formla derived in Chapter 2 may be known at the encoder. Taking k as an example, before the prediction mode is selected, the best motion vector mv k of each prediction mode is known after motion estimation is done; hence the residal ê k and reconstrcted pixel vale ˆf k of each mode are also known. In sch a case, these known vales cold be sed to replace the statistics of the corresponding random variables to simplify the PTD estimation. In this section, we discss how to se the known vales to improve the estimation accracy and redce the algorithm complexity. 76

77 3.3.1 Estimation of PTD In this section, we consider the case with no data partitioning; hence, P k = P k (r) = P k (m). For the case with slice data partitioning, the derivation process is similar to that in this section. From Chapter 2, we know that PTD can be calclated by D k = D k (r) + D k (m) + D k (P) + D k (c), (3 15) where D k (r) = E[(ε k ) 2 ] P k (r); (3 16) D k (m) = E[(ξ k ) 2 ] P k (m); (3 17) D k (P) = P k D k 1 + (1 P + ˇmv k k ) D k (p); (3 18) D k (c) = 2P k (2E[ε k ξ k ] + 2E[ε k k 1 ζ ] + 2E[ξ k k 1 + ˇmv k ζ ]); (3 19) + ˇmv k where D k (p) E[( ζ k j +mv k + k { r, m}) 2 ] for j {1,..., J}; J is the nmber of previos encoded frames sed for inter motion search; k { r, m} denotes the clipping noise nder the error event that the packet is correctly received. If the vales for mv k, ê k and ˆf k are known, given the error concealment at the encoder, the vales for ε k = ê k ě k and ξ k = ˆf k 1 +mv k ˆf k 1 + ˇmv k are also known. Then, D k (r) = (ε k ) 2 P k, D k (m) = (ξ k ) 2 P k, and D k (c) = P k (2ε k ξ k + 2ε k E[ ζ k 1 + ˇmv k ] + 2ξ k E[ ζ k 1 + ˇmv k ]). Hence, the formla for PTD can be simplified to D k = E[( ζ k ) 2 ] = P k ((ε k + ξ k ) 2 + 2(ε k + ξ k ) E[ ζ k 1 ] + D k 1 + (1 P + ˇmv k + ˇmv) k k ) D k (p). (3 20) Denote ˆD( ) the estimate of D( ), and denote Ê( ) as the estimate of E( ). Therefore, D k can be estimated by ˆD k = ˆP k ((ε k + ξ k ) 2 + 2(ε k + ξ k ) Ê[ ζ k 1 + ˇmv k ] + ˆD k 1 + ˇmv k ) + (1 ˆP k ) ˆD k (p), where ˆP k can be obtained by the PEP estimation algorithm in Section 3.2. ˆD k 1 + ˇmv k is the estimate in the (k 1)-th frame and is stored for calclating 77

78 D k. Therefore, the only nknowns are Ê[ ζ k 1 + ˇmv k ] and ˆD k (p), which can be calclated by the methods in Sections and Calclation of Ê[ ζ k ] Since Ê[ ζ k 1 + ˇmv k ] from the (k 1)-th frame is reqired for calclating ˆD k, we shold estimate the first moment of ζ k and store it for the sbseqent frame. From Chapter 2, we know ζ k = ε k + ξ k + ε k = ξ k = 0 and ζ k j + mv k = k j ζ + k + mv. For P-MBs, when MV packet is correctly received, k k j ζ ; when MV packet is received with error, +mv k ζ k j + mv k = ζ k 1 + ˇmv k, and since residal and MV are in the same packet, k {r, m} = k {r} = 0 as proved in Chapter 2. Therefore, the first moment of ζ k can be recrsively calclated by E[ ζ k ] = P k (ε k + ξ k + E[ ζ k 1 ]) + (1 P + ˇmv k k ) E[ ζ k j + k +mv { r, m}]. (3 21) k Conseqently, E[ ζ k ] can be estimated by Ê[ ζ k ] = ˆP k (ε k + ξ k + Ê[ ζ k 1 + ˇmv k ]) + (1 ˆP k ) Ê[ ζ k j +mv k + k { r, m}]. For I-MBs, when the packet is correctly received, ζ k = 0; when MV packet is received with error, the reslt is the same as for P-MBs. Therefore, the first moment of ζ k can be recrsively calclated by E[ ζ k ] = P k (ε k + ξ k + E[ ζ k 1 ]), (3 22) + ˇmv k and E[ ζ k ] can be estimated by Ê[ ζ k ] = ˆP k (ε k + ξ k + Ê[ ζ k 1 + ˇmv k ]) Calclation of Ê[ ζ k j +mv k + k { r, m}] and ˆD k (p) From Chapter 2, we know that for I-MBs, D k (p) = 0; for P-MBs, D k (p) = α k D k j +mv k and it can be estimated by ˆD k (p) = ˆα k ˆD k j +mv k, where ˆα k is estimated by (3 8) with y = ˆf k and σ 2 = ˆD k j +mv k. However, sch complexity is too high to be sed in prediction mode decision since every pixel reqires sch a comptation for each mode. To address this, we leverage the property proved in Proposition 3 to design a low-complexity and high-accracy algorithm to recrsively calclate Ê[ ζ k j +mv k + k { r, m}] and ˆD k (p) for P-MBs. 78

79 Proposition 3. Assme γ H = 255 and γ L = 0. The propagated error clipping noise k { r, m} satisfy ζ k j +mv k + k { r, m} = ˆf k 255, ζk j ˆf k, +mv k ζk j +mv k < ˆf k 255 > ˆf k ζ k j +mv k, otherwise. ζ k j +mv k and the (3 23) Proposition 3 is proved in Appendix B.1. Using Proposition 3, Ê[ ζ k j +mv k + k { r, m}] in (4 1) and ˆD k (p) in (4 2) can be estimated nder the following three cases. Case 1: If Ê[ ζ k j ] < ˆf +mv k k 255, we have Ê[ ζ k j +mv k ˆD k (p) = Ê[( ζ k j + k +mv { r, m}) 2 ] = (ˆf k k 255) 2. + k { r, m}] = ˆf k 255, and Case 2: If Ê[ ζ k j ] > ˆf +mv k k, we have Ê[ ζ k j + k +mv { r, m}] = ˆf k k, and ˆD k (p) = (ˆf k ) 2. Case 3: If ˆf k ˆD k (p) = Ê[( ζ k j +mv k ) 2 ] Smmary 255 Ê[ ζ k j ] ˆf +mv k k, we have Ê[ ζ k j + k +mv { r, m}] = Ê[ ζ k j and k +mv], k PTD can be recrsively estimated by (4 2) and (4 1) or (3 22); and Ê[ ζ k j +mv k k { r, m}] and ˆD k (p) can be calclated by the methods in Section The reslting algorithm is called RMPC-PTD algorithm. 3.4 Pixel-level End-to-end Distortion Estimation Algorithm The pixel-level end-to-end distortion (PEED) for each pixel in the k-th frame is defined by D k,ete f k E[(f k f k ) 2 ], where f k is the reconstrcted pixel vale at the decoder. Then we have D k,ete = E[(f k = E[(f k = E[(f k = (f k f k ) 2 ] ˆf k + ˆf k ˆf k + ζ k ) 2 ] is the inpt pixel vale at the encoder and f k ) 2 ] ˆf k ) 2 + E[( ζ k ) 2 ] + 2(f k ˆf k ) E[ ζ k ]. (3 24) + 79

80 We call f k ˆf k qantization error and ζ k transmission error. While f k ˆf k depends only on the qantization parameter (QP), ζ k mainly depends on the PEP and the error concealment scheme. If the vale of ˆf k is known, then the only nknowns in (3 24) are E[( ζ k ) 2 ] and E[ ζ k ], which can be estimated by the methods in Section 3.3. We call the algorithm in (3 24) as RPMC-PEED algorithm. Compared to ROPE algorithm [4], which estimates the first moment and second moment of the reconstrcted pixel vale f k, we have the following observations. First, RPMC-PEED algorithm estimates the first moment and the second moment of reconstrcted error ζ k ; therefore, RPMC-PEED algorithm is mch easier to be enhanced to spport the averaging operations in H.264, sch as interpolation filter. Second, estimating the first moment and the second moment of ζ k in RMPC-PEED prodces lower distortion estimation error than estimating both moments of f k in ROPE. Third, or experimental reslts show that ROPE may prodce a negative vale as the estimate for distortion, which violates the reqirement that (tre) distortion mst be non-negative; or experimental reslts also show that the negative distortion estimate is cased by not considering clipping, which reslts in inaccrate distortion estimation by ROPE. Note that in Chapter 2, we assme the clipping noise at the encoder is zero, that is, ˆ k = 0. If we se ˆf k j + ê +mv k k to replace ˆf k error by f k in (3 24), we may calclate the qantization (ˆf k j +mv k + ê k ) and calclate the transmission error by ζ k = (ˆf k j + ê +mv k k ) f k = (ˆf k j + ê +mv k k ) ( f k 1 + ẽ + mv k k k ) (3 25) = ε k + ξ k + k 1 ζ + k + mv, k which is exactly the formla for transmission error decomposition in Chapter 2. Therefore, ˆ k does not affect the end-to-end distortion D k,ete if we se ˆf k j +mv k to replace ˆf k in calclating both the qantization error and the transmission error. + ê k 80

81 3.5 Applying RMPC-PEED Algorithm to H.264 Prediction Mode Decision Rate-distortion Optimized Prediction Mode Decision In H.264 specification, there are two types of prediction modes, i.e., inter prediction and intra prediction 4. In inter prediction, there are 7 modes, i.e., modes for 16x16, 16x8, 8x16, 8x8, 8x4, 4x8, and 4x4 lma blocks. In intra prediction, there are 9 modes for 4x4 lma blocks and 4 modes for 16x16 lma blocks. Hence, there are a total of = 20 modes to be selected in mode decision. For each MB, or proposed Error-Resilient Rate Distortion Optimized (ERRDO) mode decision consists of two steps. First, R-D cost is calclated by J(ω m ) = D k ETE(ω m ) + λ R(ω m ), (3 26) where D k ETE = V k i D k,ete ; Vk i is the set of pixels in the i-th MB (or sb-mb) of the k-th frame; ω m is the prediction mode, and ω m (ω m {1, 2,, 20}); R(ω m ) is the encoded bit rate for mode ω m ; λ is the preset Lagrange mltiplier. Then, the optimal prediction mode that minimizes the rate-distortion (R-D) cost is fond by ˆω m = arg min ω m {J(ω m )}. (3 27) If D k ETE (ω m) in (3 26) is replaced by sorce coding distortion or qantization distortion, we call it Sorce-Coding Rate Distortion Optimized (SCRDO) mode decision. Using (3 26) and (3 27), we design Algorithm 2 for ERRDO mode decision in H.264; Algorithm 2 is called RMPC-MS algorithm. Algorithm 2. ERRDO Mode decision for an MB in the k-th frame (k 1). 1) Inpt: QP, PEP. 2) Initialization of Ê[ ζ 0 ] and Ê[( ζ 0 ) 2 ] for all pixel. 4 There are two other encoding modes for P-MB defined in H.264, i.e., skip mode and I PCM mode. However, they are sally not involved in the PEED estimation process. 81

82 3) For mode = 1 : 20 (9+4 intra, 7 inter). 3a) If intra mode, calclate Ê[ ζ k ] by (3 22) for all pixels in the MB, go to 3b), Else if Ê[ ζ k j ] < ˆf +mv k k 255, Ê[ ζ k j + k +mv { r, m}] = ˆf k k 255, Ê[( ζ k j + k +mv { r, m}) 2 ] = (ˆf k k 255) 2, Else if Ê[ ζ k j ] > ˆf +mv k k Else Ê[ ζ k j + k +mv { r, m}] = ˆf k k, Ê[( ζ k j + k +mv { r, m}) 2 ] = (ˆf k k ) 2, End Ê[ ζ k j + k +mv { r, m}] = Ê[ ζ k j k +mv], k Ê[( ζ k j + k +mv { r, m}) 2 ] = Ê[( ζ k j k +mv) 2 ], k calclate Ê[ ζ k ] by (4 1) for all pixels in the MB, 3b) calclate ˆD k by (4 2) for all pixels in the MB, 3c) estimate D k,ete by (3 24) for all pixels in the MB, 3d) calclate R-D cost via (3 26) for each mode, End Via (3 27), select the mode with minimm R-D cost as the optimal mode for the MB. 5) Otpt: the best mode for the MB. Using Theorem 3.1, we can design another ERRDO mode decision algorithm that prodces the same soltion as that of Algorithm 2, as Proposition 4 states. Theorem 3.1. (Decomposition Theorem) If there is no slice data partitioning, end-to-end distortion can be decomposed into a mode-dependent term and a mode-independent 82

83 term, i.e., D k,ete(ω m ) = D k,ete (ω m) + C k. (3 28) where C k is independent of ω m and D k,ete (ω m) = (1 P k ) {(f k Theorem 3.1 is proved in Appendix B.2. ˆf k ) 2 + D k (p) + 2(f k ˆf k ) E[ ζ k j + k +mv { r, m}]}. k Using Theorem 3.1, we only need to change two places in Algorithm 2 to obtain a new algorithm, which we call Algorithm A: first, replace Step 3c) in Algorithm 2 by (3 29) estimate D k,ete by (3 29) for all pixels in the MB ; second, replace 3d) in Algorithm 2 by calclate R-D cost via D k (ω ETE m) + λ R(ω m ) for each mode, where D k = ETE D k.,ete V k i Proposition 4. If there is no slice data partitioning, Algorithm A and Algorithm 2 prodce the same soltion, i.e., ˆω m = arg min ωm { ˆD ETE k (ω m) + λ R(ω m )} = arg min ωm { ˆD k (ω ETE m) + λ R(ω m )}. Proposition 4 is proved in Appendix B.3. Note that D k ETE in (3 29) is not exactly the end-to-end distortion; bt the decomposition in (3 28) can help redce the complexity of some estimation algorithms, for example, LLN algorithm [29] Complexity of RMPC-MS, ROPE, and LLN Algorithm In this sbsection, we compare the complexity of RMPC-MS algorithm with that of two poplar mode decision algorithms, namely, ROPE algorithm and LLN algorithm, which are also based on pixel-level distortion estimation. To make a fair comparison, the same conditions shold be sed for all the three algorithms. Assme all the three algorithms se an error concealment scheme that conceals an erroneos pixel by the 83

84 pixel in the same position of the previos frame; then, ě k = 0 and ˇmv k = 0; hence, ε k + ξ k = f k ˆf k 1. Here, the complexity is qantified by the nmber of additions (ADDs) and mltiplications (MULs) 5. If a sbrotine (or the same set of operations) is invoked mltiple times, it is conted only once since the temporary reslt is saved in the memory; for example, ε k + ξ k in (4 2) and (4 1) is conted as one ADD. A sbstraction is conted as an addition. We only consider pixel-level operations; block-level operations, for example MV addition, are neglected. We ignore the complexity of those basic operations since their complexity is the same for all the three algorithms, sch as motion compensation RMPC-MS algorithm Let s first consider the complexity of RMPC-MS algorithm, i.e. Algorithm 2, for inter modes. In Algorithm 2, the worst case is Ê[ ζ k j ] < ˆf +mv k k 255. Under this case, there is one ADD and one sqare, i.e. MUL. The other two cases reqire only two copy operations, and so are neglected. Note that ˆf k Ê[ ζ k j +mv k ] 255 with high probability, that is, Ê[ ζ k j ] < ˆf +mv k k 255 is relatively rare. Therefore, in most cases, there are only two copy operations in the loop. Calclating the second moment of ζ k needs 4 ADDs and 4 MULs as in (4 2). Similarly, calclating the first moment of ζ k needs 2 ADDs and 2 MULs as in (4 1). Finally, calclating the end-to-end distortion needs 3 ADDs and 2 MULs as in (3 24). Hence, the worst case of calclating the end-to-end distortion for each pixel is 10 ADDs and 9 MULs. Note that in most cases, the complexity is 9 ADDs and 8 MULs for inter modes as shown in Table 3-1. Note that since P k is the same for all pixels in one MB, we do not need to calclate 1 P k for each pixel. Mltiplying by 2 can be achieved by a shift operation; so it is not conted as one MUL. 5 Those minor operations, sch as memory copy, shift, and conditional statement, are neglected for all algorithms. 84

85 k j For Intra modes, we know that ζ + k +mv { r, m} = 0 from Chapter 2. Therefore, k the complexity of intra mode is redced to 3 ADDs and 3 MULs in (4 2), 1 ADDs and 1 MULs in (3 22). As a reslt, the end-to-end distortion for each pixel is 7 ADDs and 6 MULs for each intra mode. In H.264, there are 7 inter modes and 13 intra modes; therefore there are a total of 154 ADDs and 134 MULs for each pixel in most cases. In the worst case, there are a total of 161 additions and 141 MULs for each pixel, where the additional comptation comes from the consideration of clipping effect. Memory Reqirement Analysis: To estimate the end-to-end distortion by Algorithm 2, the first moment and the second moment of the reconstrcted error of the best mode shold be stored after the mode decision. Therefore, 2 nits of memory are reqired to store those two moments for each pixel. Note that the first moment takes vales in { 255, 254,, 255}, i.e., 8 bits pls 1 sign bit per pixel, and the second moment takes vales in {0, 1,, }, i.e., 16 bits per pixel ROPE algorithm In ROPE algorithm, the moment estimation formlae for inter prediction and intra prediction are different. For inter modes, calclating the first moment needs 2 ADDs and 2 MULs; calclating the second moment needs 3 ADDs and 4 MULs; calclating the end-to-end distortion needs 2 ADDs and 2 MULs. For intra modes, calclating the first moment needs 1 ADD and 2 MULs; calclating the second moment needs 1 ADD and 3 MULs. Hence, an inter mode needs 7 ADDs and 8 MULs; an intra mode needs 4 ADDs and 7 MULs. For H.264, the total complexity for each pixel is 101 ADDs and 147 MULs. Note that when we implement ROPE in JM16.0, we find that ROPE algorithm cases ot-of-range vales for both the first moment and the second moment de to the neglect of clipping noise. Experimental reslts show that ROPE may prodce a negative vale as the estimate for distortion, which violates the reqirement that (tre) distortion 85

86 mst be non-negative. Hence, in a practical system the estimated reslt from ROPE algorithm needs to be clipped into a legitimate vale; this will incr a higher complexity. Memory Reqirement Analysis: To estimate the end-to-end distortion by ROPE algorithm, the first moment and the second moment of the reconstrcted pixel vale of the best mode shold be stored after the mode decision. Therefore, 2 nits of memory are reqired to store the two moments for each pixel. The first moment takes vales in {0, 1,, 255}, i.e., 8 bits per pixel; the second moment takes vales in {0, 1,, }, i.e., 16 bits per pixel. Note that in the original ROPE algorithm [4], the vales of the two moments are not bonded since the propagated errors are never clipped LLN algorithm In JM16.0, LLN algorithm ses the same decomposition method as Theorem 3.1 for mode decision [29]. In sch a case, for inter modes, reconstrcting the pixel vale in one simlated decoder needs 1 ADD; calclating the end-to-end distortion needs 1 ADD and one MUL. For intra modes, there is no additional reconstrction for all simlated decoders since the newly indced errors are not considered; therefore, calclating the end-to-end distortion needs 1 ADD and 1 MUL. Sppose the nmber of simlated decoders at the encoder is N d, the complexity for LLN algorithm is 27N d ADDs and 20N d MULs. The defalt nmber of simlated decoders in JM16.0 is 30, which means the complexity for LLN algorithm is 810 ADDs and 600 MULs. Thirty simlated decoders is sggested in Ref. [6]. In or experiment, we find that if the nmber of simlated decoders is less than 30, the estimated distortion exhibits high degree of randomness (i.e., having a large variance); however, if the nmber of simlated decoders is larger than 50, the estimated distortion is qite stable (i.e., having a small variance). Note that the error concealment operations in LLN algorithm are reqired bt not conted in the complexity since it is done after the mode decision. However, even withot considering the extra error concealment operations, the complexity of LLN algorithm is still mch higher than RMPC-MS and ROPE. Increasing the nmber of 86

87 Table 3-1. Complexity Comparison comptational complexity memory reqirement inter mode 9 ADDs, 8 MULs (worst 10 ADDs, 9 MULs) RMPC-MS intra mode 7 ADDs, 6 MULs 25 bits/pixel total complexity 154 ADDs, 134 MULs (worst 161 ADDs, 141 MULs) inter mode 7 ADDs, 8 MULs (more with clipping) ROPE intra mode 4 ADDs, 7 MULs 24 bits/pixel total complexity 101 ADDs, 147 MULs (more with clipping) inter mode 2N d ADDs, N d MULs (more with error concealment) LLN intra mode N d ADDs, N d MULs 8N d bits/pixel total complexity 27N d ADDs, 20N d MULs (more with error concealment) simlated decoders at the encoder can improve estimation accracy bt at the cost of linear increase of comptational complexity. Memory Reqirement Analysis: To estimate the end-to-end distortion by LLN algorithm, for each simlated decoder, each reconstrcted pixel vale of the best mode shold be stored after the mode decision. Therefore, the encoder needs N d nits of memory to store the reconstrcted pixel vale. A reconstrcted pixel takes vales in {0, 1,, 255}, i.e., 8N d bits per pixel. Table 3-1 shows the complexity of the three algorithms. 3.6 Experimental Reslts In Section 3.6.1, we compare the estimation accracy of RMPC-FTD algorithm to that of the existing models nder different channel conditions; we also compare their robstness against imperfect estimate of PEP. In Section 3.6.2, we compare the R-D 87

88 performance of RMPC-MS and existing mode decision algorithms for H.264; we also compare them nder interpolation filter and deblocking filter. To collect the statistics and test the algorithms, all possible channel conditions shold be tested for every video seqence 6. However, estimating transmission distortion and collecting the statistics of each video seqence nder all possible error events are tedios tasks since only command line interface or configration file is available in crrent open sorce H.264 reference code, sch as JM and x264. To analyze the statistics and verify or algorithm, or lab has developed a software tool, called Video Distortion Analysis Tool (VDAT), which provides a friendly Graphical User Interface (GUI). VDAT implements channel simlator, spports different video codec, comptes the statistics, and spports several distortion estimation algorithms 7. VDAT is sed in all the experiments in this section Estimation Accracy and Robstness In this section, we se Algorithm 1 to estimate FTD and compare it with Sthlmller s model [8] and Dani s model [9]. To evalate estimation accracy, we compare the estimated distortion of different algorithms with tre distortion for 50 frames nder the case of no acknowledgement feedback Experiment setp To implement the estimation algorithms, all transmission distortion related statistics shold be collected for all random variables, sch as residal, motion vector, reconstrcted pixel vale, residal concealment error, MV concealment error, propagated error, clipping noise. All sch statistics are collected from video codec Interested readers can download all the VDAT sorce codes at: zhifeng/project/vdat/index.htm. 88

89 JM All tested video seqences are in CIF format, and each frame is divided into three slices. To spport the slice data partitioning, we se the extended profile as defined in H.264 specification Annex A [22]. To provide neqal error protection (UEP), we let MV packets experience lower PEP than residal packets. The first frame of each coded video seqence is an I-frame, and the sbseqent frames are all P-frames. In the experiment, we let the first I-frame be correctly received, and all the following P-frames go throgh an error-prone channel with controllable PEP. We set QP=28 for all the frames. Each video seqence is tested nder several channel conditions with UEP. De to the space limit, we only present the experimental reslts for video seqences foreman and stefan. Experimental reslts for other video seqences can be fond online 9. For each seqence, two wireless channel conditions are tested: for good channel condition, residal PEP is 2% and MV PEP is 1%; for poor channel condition, residal PEP is 10% and MV PEP is 5%. For each PEP setting of each frame, we do 600 simlations and take the average to mitigate the effect of randomness of simlated channels on instantaneos distortion Estimation accracy of different estimation algorithms Fig. 3-1 shows the estimation accracy of RMPC-FTD algorithm, Sthlmller s model in Ref. [8] and Dani s model in Ref. [9] for seqence foreman. Fig. 3-2 shows their estimation accracy for seqence stefan. We can see that RMPC-FTD algorithm achieves the most accrate estimate. Since the sperposition algorithm in Sthlmller s model neglects the effect of clipping noise and negative correlation between MV concealment error and propagated error, it over-estimates transmission distortion as shown in Fig However, since the clipping effect and the correlation cased 8 jm/jm14.0.zip

90 distortion is small for low motion seqence nder low PEP as proved in Chapter 2, linear model is qite accrate as shown in Fig. 3-1(a). Notice that in foreman seqence nder good channel, the estimated distortion different from grond trth is only abot MSE = 12 after accmlated with 50 frames withot feedback. In Ref. [9], athors claim that the larger the fraction of pixels in the reference frame to be sed as reference pixels, the larger the transmission errors propagated from the reference frame. However, de to randomness of motion vectors, the probability that a pixel with error is sed as reference is the same as the probability that a pixel withot error is sed as reference. Therefore, the nmber of pixels in the reference frame being sed for motion prediction has nothing to do with the fading factor. As a reslt, the algorithm in Ref. [9] nder-estimates transmission distortion as shown in Fig. 3-1 and Fig Accmlated MSE distortion Residal PEP: 2%, MV PEP: 1% for foreman_cif Tre distortion RMPC algorithm Sthlmller s model Dani s model Accmlated MSE distortion Residal PEP: 10%, MV PEP: 5% for foreman_cif Tre distortion RMPC algorithm Sthlmller s model Dani s model Frame index Frame index (a) (b) Figre 3-1. Transmission distortion D k vs. frame index k for foreman : (a) good channel, (b) poor channel. In or experiment, we observe that 1) the higher the propagated distortion, the smaller the propagation factor; and 2) the higher percentage of reconstrcted pixel vales near 0 or 255, the smaller the propagation factor. These two phenomena once more verify that the propagation factor is a fnction of all samples of reconstrcted pixel vale and sample variance of propagated error as proved in Chapter 2. These phenomena cold be explained by (3 8) in that 1) α is a decreasing fnction of b for 90

91 Accmlated MSE distortion Residal PEP: 2%, MV PEP: 1% for stefan_cif Tre distortion RMPC algorithm Sthlmller s model Dani s model Accmlated MSE distortion Residal PEP: 10%, MV PEP: 5% for stefan_cif Tre distortion RMPC algorithm Sthlmller s model Dani s model Frame index Frame index (a) (b) Figre 3-2. Transmission distortion D k vs. frame index k for stefan : (a) good channel, (b) poor channel. Accmlated MSE distortion Tre distortion RMPC algorithm Sthlmller s model Dani s model PEP varies for foreman_cif PEP (%) Figre 3-3. Transmission distortion D k vs. PEP for foreman. b > 0; 2) α is an increasing fnction of y for 0 y 127 and a decreasing fnction of y for 128 y 255. We also note that a larger sample variance of propagated error cases α to be less sensitive to the change of reconstrcted pixel vale, while a larger deviation of reconstrcted pixel vale from 128 cases α to be less sensitive to the change of sample variance of propagated error. To frther stdy estimation accracy, we test the estimation algorithms nder many different channel conditions. Fig. 3-3 and Fig. 3-4 show the estimation accracy nder PEP varying from 1% to 10%. In both figres, RMPC-FTD algorithm achieves the most accrate distortion estimation nder all channel conditions. 91

92 Accmlated MSE distortion Tre distortion RMPC algorithm Sthlmller s model Dani s model PEP varies for stefan_cif PEP (%) Figre 3-4. Transmission distortion D k vs. PEP for stefan Robstness of different estimation algorithms In Section , we assme PEP is perfectly known at the encoder. However, in a real wireless video commnication system, PEP is sally not perfectly known at the encoder; i.e., there is a random estimation error between the tre PEP and the estimated PEP. Hence, it is important to evalate the robstness of the estimation algorithms against PEP estimation error. To simlate imperfect PEP estimation, for a given tre PEP denoted by P tre, we assme the estimated PEP is a random variable and is niformly distribted in [0, 2 P tre ]; e.g., if P tre = 10%, the estimated PEP is niformly distribted in [0, 20%]. Figs. 3-5 and 3-6 show the estimation accracy of the three algorithms for foreman and stefan, respectively, nder imperfect knowledge of PEP at the encoder. From the two figres, it is observed that compared to the case nder perfect knowledge of PEP at the encoder, for both Sthlmller s model and Dani s model, imperfect knowledge of PEP may case increase or decrease of the gap between the estimated distortion and the tre distortion. Specifically, for Sthlmller s model, if the PEP is nder-estimated, the gap between the estimated distortion and the tre distortion decreases, compared to the case nder perfect knowledge of PEP; for Dani s model, if the PEP is over-estimated, the gap between the estimated distortion and the tre distortion decreases, compared to the case nder perfect knowledge of PEP. In contrast, RMPC-FTD algorithm is more 92

93 robst against PEP estimation error, and provides more accrate distortion estimate than Sthlmller s model and Dani s model. Accmlated MSE distortion Residal PEP: 2%, MV PEP: 1% for foreman_cif Tre distortion RMPC algorithm Sthlmller s model Dani s model Accmlated MSE distortion Residal PEP: 10%, MV PEP: 5% for foreman_cif Tre distortion RMPC algorithm Sthlmller s model Dani s model Frame index Frame index (a) (b) Figre 3-5. Transmission distortion D k vs. frame index k for foreman nder imperfect knowledge of PEP: (a) good channel, (b) poor channel. Accmlated MSE distortion Residal PEP: 2%, MV PEP: 1% for stefan_cif Tre distortion RMPC algorithm Sthlmller s model Dani s model Accmlated MSE distortion Residal PEP: 10%, MV PEP: 5% for stefan_cif Tre distortion RMPC algorithm Sthlmller s model Dani s model Frame index Frame index (a) (b) Figre 3-6. Transmission distortion D k vs. frame index k for stefan nder imperfect knowledge of PEP: (a) good channel, (b) poor channel R-D Performance of Mode Decision Algorithms In this sbsection, we compare the R-D performance of Algorithm 2 with that of ROPE and LLN algorithms for mode decision in H.264. To compare all algorithms nder the mlti-reference pictre motion compensated prediction, we also enhance the original ROPE algorithm [4] with mlti-reference capability. 93

94 Experiment setp JM16.0 encoder and decoder is sed in the experiments. To spport more advanced techniqes in H.264, we se the high profile defined in H.264 specification Annex A [22]. We condct experiments for five schemes, that is, three ERRDO schemes, i.e., RMPC-MS, LLN, ROPE; random intra pdate; and defalt SCRDO scheme with no transmission distortion estimation. All the tested video seqences are in CIF resoltion with 30fps. Each coded video seqence is tested nder different PEP settings from 0.5% to 5%. Each video seqence is coded for its first 100 frames with 3 slices per frame. The error concealment method sed is to copy the pixel vale in the same position of the previos frame. The first frame is assmed to be correctly received. The encoder setting is given as below. No slice data partitioning is sed; constrained intra prediction is enabled; the nmber of reference frames is 3; B-MBs are not inclded; only 4x4 transform is sed; CABAC is enabled for entropy coding; in LLN algorithm, the nmber of simlated decoders is R-D performance nder no interpolation filter and no deblocking filter In the experiments of this sbsection, both the deblocking filter and the interpolation filter with fractional MV in H.264 are disabled. De to the space limit, we only show the plot of PSNR vs. bit rate for video seqences foreman and football nder PEP = 2% and PEP = 5%, with rate control enabled. Figs. 3-7 and 3-8 show PSNR vs. bit rate for foreman and football, respectively. The experimental reslts show that RMPC-MS achieves the best R-D performance; LLN and ROPE achieves similar performance and the second best R-D performance; the random intra pdate scheme (denoted by RANDOM ) achieves the third best R-D performance; the SCRDO scheme (denoted by NO EST ) achieves the worst R-D performance. LLN has poorer R-D performance than RMPC-MS; this may be becase 30 simlated decoders are still not enogh to achieve reliable distortion estimate althogh LLN with 30 simlated decoders already incrs mch higher complexity than RMPC-MS. 94

95 NO_EST RANDOM RMPC ROPE LLN NO_EST RANDOM RMPC ROPE LLN PSNR (db) PSNR (db) Bit rate (kb/s) Bit rate (kb/s) (a) (b) Figre 3-7. PSNR vs. bit rate for foreman, with no interpolation filter and no deblocking filter: (a) PEP=2%, (b) PEP=5% NO_EST RANDOM RMPC ROPE LLN NO_EST RANDOM RMPC ROPE LLN PSNR (db) 28 PSNR (db) Bit rate (kb/s) Bit rate (kb/s) (a) (b) Figre 3-8. PSNR vs. bit rate for football, with no interpolation filter and no deblocking filter: (a) PEP=2%, (b) PEP=5%. The reason why RMPC-MS achieves better R-D performance than ROPE, is de to the consideration of clipping noise in RMPC-MS. Debg messages show that, withot considering the clipping noise, ROPE over-estimates the end-to-end distortion for inter modes; hence ROPE tends to select intra modes more often than RMPC-MS and LLN, which leads to higher encoding bit rate in ROPE; as a reslt, the PSNR gain achieved by ROPE is compromised by its higher bit rate. To verify this conjectre, we test all 95

96 seqences nder the same Qantization Parameter (QP) settings withot rate control. we observe that ROPE algorithm always prodces higher bit rate than other schemes. Table 3-2 shows the average PSNR gain (in db) of RMPC-MS over ROPE and LLN for different video seqences and different PEP. The average PSNR gain is obtained by the method in Ref. [30], which measres average distance (in PSNR) between two R-D crves. From Table 3-2, we see that RMPC-MS achieves an average PSNR gain of 1.44dB over ROPE for foreman nder PER = 5%; and it achieves an average PSNR gain of 0.89dB over LLN for foreman seqence nder PEP = 1%. Table 3-2. Average PSNR gain (in db) of RMPC-MS over ROPE and LLN Seqence PEP RMPC vs. ROPE RMPC vs. LLN coastgard 5% % % % football 5% % % % foreman 5% % % % mobile 5% % % % R-D performance with interpolation filter and deblocking filter In H.264, interpolation filter provides notable objective (PSNR) gain and deblocking filter provides notable sbjective gain. To spport the interpolation filter with fractional MV in H.264 [20], we extend Algorithm 2 by sing the nearest neighbor to approximate 96

97 the reference pixel pointed by a fractional MV. In addition, deblocking filter is also enabled in JM16.0 to compare RMPC-MS, ROPE and LLN algorithms. Note that both RMPC-MS and ROPE are derived withot considering filtering operations. De to high spatial correlation between adjacent pixels, the averaging operation indced by a filter will prodce many cross-correlation terms for estimating distortion in a sbpixel position. Yang et al. [17] enhance the original ROPE algorithm with interpolation filter in H.264. However, their algorithm reqires 1 sqare root operation, 1 exponentiation operation, and 2 mltiplication operations for calclating each cross-correlation term. Since a six-tap interpolation filter is sed in H.264 for sbpixel accracy of motion vector, there are 15 cross-correlation terms for calclating each sbpixel distortion. Therefore, the complexity of their algorithm is very high and may not be sitable for real-time encoding. In this sbsection, we se a very simple bt R-D efficient method to estimate sbpixel distortion. Specifically, we choose the nearest integer pixel arond the sbpixel, and se the distortion of the nearest integer pixel as the estimated distortion for the sbpixel. Note that this simple method is not aimed at extending RMPC-MS and ROPE algorithms, bt jst to compare the R-D performances of these two algorithms for H.264 with fractional MV for motion compensation. We first show the experiment reslts with interpolation filter bt with no deblocking filter as in Figs. 4-1 and From Figs. 4-1 and 3-10, we observe the same reslt as shown in Section 4.3.2: RMPC-MS achieves better R-D performance than LLN and ROPE algorithms. From Figs. 4-1 and 3-10, we also can see that each of the five algorithms achieves higher PSNR than its corresponding scheme with no interpolation filter; this means the simple method is valid. We also observe from Table 4-1 that in this case, RMPC-MS achieves an average PSNR gain of 2.97dB over ROPE for seqence mobile nder PEP = 0.5%; and it achieves an average PSNR gain of 1.13dB over LLN for foreman nder PEP = 1%. 97

98 40 38 NO_EST RANDOM RMPC ROPE LLN NO_EST RANDOM RMPC ROPE LLN PSNR (db) 34 PSNR (db) Bit rate (kb/s) Bit rate (kb/s) (a) (b) Figre 3-9. PSNR vs. bit rate for foreman, with interpolation and no deblocking: (a) PEP=2%, (b) PEP=5% NO_EST RANDOM RMPC ROPE LLN NO_EST RANDOM RMPC ROPE LLN PSNR (db) 29 PSNR (db) Bit rate (kb/s) Bit rate (kb/s) (a) (b) Figre PSNR vs. bit rate for football, with interpolation and no deblocking: (a) PEP=2%, (b) PEP=5%. We also show the experiment reslts with both interpolation filter and deblocking filter as shown in Figs and It is interesting to see that each of the five algorithms with interpolation filter and deblocking filter achieves poorer R-D performance than the corresponding one with interpolation filter and no deblocking filter. That is, adding deblocking filter degrades the R-D performance of each algorithm since their estimated distortions become less accrate. In this case, ROPE sometimes performs better than RMPC-MS; this can be seen in Fig. 3-12, which is also the only case we 98

99 Table 3-3. Average PSNR gain (in db) of RMPC-MS over ROPE and LLN nder interpolation filtering Seqence PEP RMPC vs. ROPE RMPC vs. LLN coastgard 5% % % % football 5% % % % foreman 5% % % % mobile 5% % % % have observed that ROPE performs better than RMPC-MS. This may be becase RMPC-MS has a higher percentage of inter modes than ROPE. Since the deblocking operation is exected after the error concealment as in JM16.0, for intra prediction, deblocking filter only affects the estimated distortion if the packet is lost; for inter prediction, deblocking filter always impacts the estimated distortion. Therefore, the estimation accracy for inter prediction sffers from deblocking filter more than that for intra prediction. Ths, it is likely that more inter modes in RMPC-MS case higher PSNR drop in Fig

100 NO_EST RANDOM RMPC ROPE LLN NO_EST RANDOM RMPC ROPE LLN PSNR (db) PSNR (db) Bit rate (kb/s) Bit rate (kb/s) (a) (b) Figre PSNR vs. bit rate for foreman, with interpolation and deblocking: (a) PEP=2%, (b) PEP=5% NO_EST RANDOM RMPC ROPE LLN NO_EST RANDOM RMPC ROPE LLN PSNR (db) 28 PSNR (db) Bit rate (kb/s) Bit rate (kb/s) (a) (b) Figre PSNR vs. bit rate for football, with interpolation and deblocking: (a) PEP=2%, (b) PEP=5%. 100

101 CHAPTER 4 THE EXTENDED RMPC ALGORITHM FOR ERROR RESILIENT RATE DISTORTION OPTIMIZED MODE DECISION In this chapter, We first prove a new theorem for calclating the second moment of a weighted sm of correlated random variables withot the reqirement of their probability distribtion. Then, we apply the theorem to extend the RMPC-MS algorithm in Chapter 3 to spport the sbpixel-level Mean Sqare Error (MSE) distortion estimation. 4.1 An Overview on Sbpixel-level End-to-end Distortion Estimation for a Practical Video Codec Existing pixel-level algorithms, e.g., the RMPC algorithm, are based on the integer pixel MV assmption to derive an estimate of D k. Therefore, their application in state-of-the-art encoders is limited de to the possible se of fractional motion compensation. For the RMPC algorithms, if the MV of one block for encoding is fractional, the MV has to be ronded to the nearest integer. This block will se the reference block pointed to by the ronded MV as a reference. However, in state-of-the-art codecs, sch as H.264 [22] and HEVC proposals [31], an interpolation filter is sed to interpolate a reference block if the MV is fractional. Therefore, the distortion of nearest neighbor approximation is not optimal for sch an encoder. As a reslt, we need to extend the existing RMPC algorithm to optimally estimate the distortion for blocks with interpolation filtering. Some sbpixel-level end-to-end distortion estimation algorithms have been proposed to assist mode decision as in Ref. [5, 17, 32]. In the H.264/AVC JM reference software [33], the LLN algorithm proposed in Ref. [5] is adopted to estimate the end-to-end distortion for mode decision. However, in the LLN algorithm more decoders lead to higher comptational complexity and larger memory reqirements. Also for the same video seqence and the same PEP, different encoders may have different estimated distortions de to the randomly prodced error events at each encoder. In Ref. [32], the athors extend ROPE for the H.264 encoder by sing the pper bond, 101

102 obtained from the Cachy-Schwarz approximation, to approximate the cross-correlation terms. However, sch an approximation reqires very high complexity. For example, for an N-tap filter interpolation, each sbpixel reqires N integer mltiplications 1 for calclating the second moment terms; N(N 1)/2 floating-point mltiplications and N(N 1)/2 sqare root operations for calclating the cross-correlation terms; and N(N 1)/2 + N 1 additions and 1 shift for calclating the estimated distortion. On the other hand, the pper bond approximation is not accrate for practical video seqences since it assmes that correlation coefficient is 1, for any two neighboring pixels. In Ref. [17], athors propose some correlation coefficient models to approximate the correlation coefficient of two pixels as fnctions, e.g., an exponentially decaying fnction, of their distance. However, de to the random behavior of individal pixel samples, the statistical model does not prodce an accrate pixel-level distortion estimate. In addition, sch correlation coefficient model approximations incr extra complexity compared to the Cachy-Schwarz pper bond approximation, i.e., they need additional N(N 1)/2 exponential operations and N(N 1)/2 floating-point mltiplications for each sbpixel. Therefore, the complexity incrred is prohibitively high for real-time video encoders. On the other hand, since both the Cachy-Schwarz pper bond approximation and the correlation coefficient model approximation need the floating-point mltiplications, additional rond-off errors are navoidable, which frther redce their estimation accracy. In Chapter 2, we propose a divide-and-conqer method to qantify the effects of 1) residal concealment error, 2) Motion Vector (MV) concealment error, 3) propagation error and clipping noise, and 4) correlations between any two of them, on transmission 1 One common method to simplify the mltiplication of an integer variable and a fractional constant is as below: first scale p the fractional constant by a certain factor; rond it off to an integer; then do integer mltiplication; finally scale down the prodct. 102

103 distortion. Based on or theoretical reslts, we proposed the RMPC algorithm in Chapter 3 for rate-distortion optimized mode decision with pixel-level end-to-end distortion estimation. Since the correlation between the transmission errors of neighboring pixels is mch smaller and more stable than the correlation between the reconstrcted vales of neighboring pixels, the RMPC algorithm is easier than ROPE to be extended for spporting sbpixel-level end-to-end distortion estimation. In this chapter, we first theoretically derive the second moment of a weighted sm of correlated random variables as a closed-form fnction of the second moments of those individal random variables. Then we apply this reslt to design a very low complexity bt accrate algorithm for mode decision. This algorithm is referred to as Extended RMPC (ERMPC). The ERMPC algorithm only reqires N integer mltiplications, N 1 additions, and 1 shift to calclate the second moment for each sbpixel. Experimental reslts show that, ERMPC achieves an average PSNR gain of 0.25dB over the existing RMPC algorithm for the mobile seqence when PEP eqals 2%; and ERMPC achieves an average PSNR gain of 1.34dB over the the LLN algorithm for the foreman seqence when PEP eqals 1%. The rest of this chapter is organized as follows. In Section 4.2, we first derive the general theorem for the second moment of a weighted sm of correlated random variables, and then apply this theorem to design a low-complexity and high-accracy algorithm for mode decision. Section 5.5 shows the experimental reslts, which demonstrates the better R-D performance and sbjective performance of the ERMPC algorithm over existing algorithms for H.264 mode decision in error prone environments. 4.2 The Extended RMPC Algorithm for Mode Decision In this section, we first state the problem of pixel-level distortion estimation in a practical video codec. Then we derive a general theorem for any second moment of a weighted sm of correlated random variables for helping solve the problem. At last, we 103

104 apply the theorem in designing a low-complexity and high-accracy distortion estimation algorithm for mode decision Sbpixel-level Distortion Estimation In Chapter 3, we know that E[ ζ k ] and E[( ζ k ) 2 ] can be recrsively calclated by and E[ ζ k ] = P k (ε k + ξ k + E[ ζ k 1 ]) + (1 P + ˇmv k k ) E[ ζ k j + k +mv { r, m}], (4 1) k E[( ζ k ) 2 ] = P k ((ε k + ξ k ) 2 + 2(ε k + ξ k ) E[ ζ k 1 ] + E[( ζ k 1 + ˇmv k + ˇmv) 2 ]) k + (1 P k ) E[( ζ k j +mv k + k { r, m}) 2 ], (4 2) where ε k ê k ě k is the residal concealment error when the residal packet is lost; ξ k ˆf k 1 ˆf k 1 +mv k + ˇmv k is the MV concealment error when the MV packet is lost; E[ ζ k 1 + ˇmv k ] and E[( ζ k 1 + ˇmv k ) 2 ] in the k 1-th frame can be recrsively calclated by (4 1) and (4 2). P k is the pixel error probability. Denote Ê( ) as the estimate of E( ); E[ ζ k j +mv k + k { r, m}] and E[( ζ k j +mv k + k { r, m}) 2 ] can be estimated by Ê[ ζ k j +mv k + k { r, m}] = ˆf k 255, Ê[ ζ k j ˆf k, Ê[ ζ k j +mv k ], ˆf k +mv k Ê[ ζ k j ] > ˆf +mv k k ] < ˆf k Ê[ ζ k j ] ˆf +mv k k, (4 3) and Ê[( ζ k j +mv k + k { r, m}) 2 ] = (ˆf k 255) 2, Ê[ ζ k j (ˆf k ) 2, Ê[( ζ k j +mv k ) 2 ], ˆf k +mv k Ê[ ζ k j ] > ˆf +mv k k ] < ˆf k Ê[ ζ k j ] ˆf +mv k k, (4 4) 104

105 In H.264, the accracy of motion compensation is in nits of one qarter of the distance between lma samples. 2 The prediction vales at half-sample positions are obtained by applying a one-dimensional 6-tap Finite Implse Response (FIR) filter horizontally and vertically. The prediction vales at qarter-sample positions are generated by averaging samples at integer- and half-sample positions [20]. In sch a case, some variables in the pixel-level distortion estimation now inclde a fractional MV and those variables shold be re-estimated. As a reslt, E[ ζ k 1 ], E[( ζ k 1 + ˇmv k + ˇmv) 2 ], k E[ ζ k j + k +mv { r, m}] and E[( ζ k j + k k +mv { r, m}) 2 ] in (4 1) and (4 2) shold be estimated k based on the fractional MV. Since Ê[ ζ k j + k +mv { r, m}] can be calclated by Ê[ ζ k j k +mv] k as in (4 3) and Ê[( ζ k j + k +mv { r, m}) 2 ] can be calclated by Ê[( ζ k j k +mv) 2 ] as in (4 4), we k only need to determine the first moment and second moment of their neighboring integer pixel positions. Take ζ k j +mv k ζ k 1 + ˇmv k and ζ k j +mv k from for example. Denote v k j = + mv k and v is in a sbpixel position in the k j-th frame. All neighboring pixels in the integer position, sed to interpolate the reconstrcted pixel vale at v, are denoted by i and with a weight w i, i 1, 2,..., N, where N = 6 for the half-sample interpolation, and N = 2 for the qarter-sample interpolation in H.264. Therefore, the interpolated reconstrcted pixel vale at the encoder is ˆf k j v = and the interpolated reconstrcted pixel vale at the decoder is f k j v = N i=1 w i ˆf k j i, (4 5) N w i i=1 f k j i. (4 6) 2 Note that considering the chroma distortion does not always improve the R-D performance bt indces more complexity. Therefore, we only consider lma components in this chapter. 105

106 As a reslt, we have E[ ζ k j v ] = E[ N i=1 w i (ˆf k j i f k j i )] = N i=1 (w i E[ ζ k j i ]), (4 7) and E[( ζ k j v ) 2 ] = E[( = E{[ = E[( N i=1 N i=1 w i ˆf k j i w i (ˆf k j i N w i i=1 ζ k j i ) 2 ]. N w i i=1 f k j i )] 2 } f k j i ) 2 ] (4 8) Since Ê[ ζ k j i ] have been calclated by the RMPC algorithm, Ê[ ζ v k j ] can be very easily calclated by (4 7). However, calclating Ê[( ζ v k j ) 2 ] is not straightforward since E[( ζ v k j ) 2 ] in (4 8) is in fact the second moment of a weighted sm of random variables A New Theorem for Calclating the Second Moment of a Weighted Sm of Correlated Random Variables The Moment Generating Fnction (MGF) can be sed to calclate the second moment for random variables [25]. However, to estimate the second moment of a weighted sm of random variables, the traditional moment generating fnction sally reqires knowing their probability distribtion and assming they are independent. However, most random variables involved in the averaging operations in a video codec are not independent and their probability distribtions are nknown. Therefore, some approximations, sch as the Cachy-Schwarz pper bond approximation [32] or the correlation coefficient model approximation [17], are sally adopted to approximate the second moment of a complicated random variable. However, those approximations reqire very high complexity. For example, for each sbpixel, with the N-tap filter interpolation, the Cachy-Schwarz pper bond approximation reqires N integer mltiplications for calclating the second moment terms, N(N 1)/2 floating-point mltiplications and N(N 1)/2 sqare root operations for calclating the cross-correlation 106

107 terms, and N(N 1)/2 + N 1 additions and 1 shift for calclating the estimated distortion. The correlation coefficient model reqires an additional N(N 1)/2 exponential operations and N(N 1)/2 floating-point mltiplications when compared to the Cachy-Schwarz pper bond approximation. In a wireless video commnication system, the comptational capability of the real-time encoder is sally very limited, and floating-point processing is ndesirable especially for mobile devices. Therefore, the qestion is how to design a new algorithm to accrately calclate the second moment in (4 8) via only integer mltiplication, integer addition, and shifts. we can design a low-complexity and high-accracy algorithm to extend the RMPC algorithm throgh the consideration of the following theorem. Theorem 4.1. For any N correlated random variables {X 1, X 2,..., X N } and w i R, i {1, 2,..., N}, the second moment of the weighted sm of these random variables is given by N E[( w i X i ) 2 ] = i=1 N w i i=1 N j=1 [w j E(X 2 j )] N 1 k=1 l=k+1 N [w k w l E(X k X l ) 2 ] (4 9) Theorem 4.1 is proved in Appendix C In H.264, most averaging operations, e.g., interpolation, deblocking, and bi-prediction, are the special cases of Theorem 4.1 in that N i=1 w i = 1. In (4 9), N j=1 [w j E(X 2 j )] is the weighted sm of E(X 2 j ), which has been estimated by the RMPC algorithm, and the only nknown is N 1 k=1 N l=k+1 [w k w l E(X k X l ) 2 ]. However, we will see that this nknown can be assmed to be negligible for the prposes of mode decision The Extended RMPC Algorithm for Mode Decision Replacing X k and X l in (4 9) by ζ k i and ζ k j, we obtain ζ k i ζ k j = ˆf k i = (ˆf k i f k i (ˆf k j f k j ) ˆf k j ) ( f k i f k j ) (4 10) 107

108 In (4 10), both ˆf k i ˆf k j and f k i f k j depend on the spatial correlation of the reconstrcted pixel vales in position i and j. When i and j are located in the same neighborhood, they are very likely to be transmitted in the same packet. Therefore, the difference between ˆf k i ˆf k j and f k i f k j is very small and hence E[( ζ k i ζ k j ) 2 ] is mch smaller than E[( ζ k i ) 2 ] and E[( ζ k j ) 2 ]. On the other hand, distortion is estimated for one MB or one sb-mb as in (3 26) for mode decision. When the cardinality Vl k is large, N j=i+1 [w i w j E( ζ k i ζ k i ) 2 ] converges to a constant for all modes with high v V k l N 1 i=1 probability de to the smmation over all pixels in that MB. For simplicity, we will call it negligible term in the following sections. Therefore, in (4 9) only the first term in the right-hand side need to be calclated. Since N i=1 w i = 1, we estimate E[( ζ k v ) 2 ] for mode decision by Ê[( ζ k v ) 2 ] = N [w i Ê( ζ k i ) 2 ]. (4 11) i=1 With (4 11), the complexity for estimating the distortion of each sbpixel, with the N-tap filter interpolation, is dramatically redced to only N integer mltiplications, N 1 additions, and 1 shift. Here, we propose the following algorithm to extend the RMPC algorithm for mode decision. Algorithm 3. Rate distortion optimized mode decision for an MB in the k-th frame (k >= 1). 1) Inpt: QP, PEP. 2) Initialization of Ê[ ζ 0 ] and Ê[( ζ 0 ) 2 ] for all pixel. 3) Loop for all available modes for each MB. estimate E[ ζ k j ] via (4 7) and E[( ζ k j +mv k +mv) 2 ] via (4 11) for all pixels in the MB, k estimate E[ ζ k j +mv k + k { r, m}] via (4 3) for all pixels in the MB, estimate E[( ζ k j +mv k + k { r, m}) 2 ] via (4 4) for all pixels in the MB, estimate E[ ζ k ] via (4 1) and E[( ζ k ) 2 ] via (4 2) for all pixels in the MB, estimate D k via (3 24) for all pixels in the MB, 108

109 estimate R-D cost for the MB via (3 26), End Via (3 27), select the best mode with minimm R-D cost for the MB. 4) Otpt: the best mode for the MB. In this chapter, Algorithm 3 is referred to as ERPMC. Note that if an MV packet is lost, the ERMPC algorithm conceals the MV with integer accracy to redce both the concealment complexity and estimation complexity. Therefore, estimating E[ ζ k 1 + ˇmv k ] and E[( ζ k 1 + ˇmv k ) 2 ] in (4 1) and (4 2) does not reqire (4 11) and saves comptational cost Merits and Limitations of ERMPC Algorithm Merits Since both the Cachy-Schwarz pper bond approximation [32] and the correlation coefficient model approximation [17] indce floating-point mltiplications, rond-off error is navoidable in those algorithms. The algorithm by Yang et al. [17] needs extra complexity to mitigate the effect of rond-off error in their distortion estimation algorithm. In contrast, one of the merits of Theorem 4.1 is that it only needs integer mltiplications and integer additions. Assming w i (and w i w j ) can be scaled p to be an integer withot any rond-off error, we may compare all R-D costs by scaling them for all modes. Therefore, rond-off error is avoided in the ERMPC algorithm. In Ref. [8], the athors prove that a low-pass interpolation filter will decrease the frame-level propagated error nder some assmptions. In fact, it is easy to prove that when N i=1 w i = 1 and V k l is large, the negligible term is larger than or eqal to zero. Even in the MB-level, the negligible term is larger than or eqal to zero with very high 3 Note that ˇmv k denotes the concealed motion vector for pixel k, nder the case that mv k is received with error. 109

110 probability. From (4 9), we see that the block-level distortion decreases, with very high probability, after the interpolation filter. One additional benefit of (4 9) is to gide the design of the interpolation filter. Traditional interpolation filter design aims to minimize the prediction error. With (4 9), we may design an interpolation filter by maximizing N k=1 N l=k+1 [w k w l E(X k X l ) 2 ] nder the constraint of N j=1 [w j E(X 2 j )] Limitations In Algorithm 3, the second moment of propagated error E[( ζ k j +mv k ) 2 ] is estimated by neglecting the negligible term to redce the complexity. A more accrate alternative method is to estimate E( ζ k i ζ k i ) 2 ] recrsively by storing the vale in memory. This will be considered in or ftre work. 4.3 Experimental Reslts In this section, we compare the R-D performance and sbjective performance of the ERMPC algorithm with that of the LLN algorithm for mode decision in H.264. We also compare ERMPC with RMPC and ROPE by sing the nearest neighbor to approximate the reference pixel pointed by a fractional MV. To compare all algorithms nder mlti-reference pictre motion compensated prediction, we also enhance the original ROPE algorithm [4] with mlti-reference capability Experiment Setp The JM16.0 encoder and decoder is sed in the experiments. The high profile as defined in the H.264 specification Annex A [22] is sed. All the tested video seqences are in CIF resoltion at 30fps. Each coded video seqence is tested nder different PEP settings from 0.5% to 5%. Each video seqence is coded for its first 100 frames with 3 slices per frame. The error concealment method sed for all algorithms is to copy the pixel vale in the same position of the previos frame. The first frame is assmed to be correctly received. 110

111 The encoder setting is given as below. Constrained intra prediction is enabled; the nmber of reference frames is 3; B slices are not inclded; only 4x4 transform is sed; CABAC is enabled for entropy coding; in the LLN algorithm, the nmber of simlated decoders is R-D Performance De to space limitations, we only show the plots of PSNR vs. bit rate for video seqences foreman and mobile nder PEP = 2% and PEP = 5%. Figs. 4-1 and 4-2 show PSNR vs. bit rate for foreman and mobile, respectively. The experimental reslts show that ERMPC achieves the best R-D performance; RMPC achieves the second best R-D performance; ROPE achieves better performance than LLN in some cases sch as at high rate in Fig. 4-1, bt worse performance than LLN in other cases sch as in Fig. 4-2 and at the low rate in Fig RMPC ROPE LLN ERMPC RMPC ROPE LLN ERMPC PSNR (db) PSNR (db) Bit rate (kb/s) Bit rate (kb/s) (a) (b) Figre 4-1. PSNR vs. bit rate for foreman : (a) PEP=0.5%, (b) PEP=2%. It is interesting to see that for some seqences and channel conditions, ERMPC achieves a notable PSNR gain over RMPC. This is, for example, evident with mobile and foreman. For some other cases, however, ERMPC only achieves a marginal PSNR gain over RMPC (e.g., coastgard and football ). From the analysis in Section 4.2.1, we know that the only difference between RMPC and ERMPC is the estimate of the error from the reference pixel, i.e., propagated error, nder the condition that there 111

112 33 32 RMPC ROPE LLN ERMPC RMPC ROPE LLN ERMPC PSNR (db) 28 PSNR (db) Bit rate (kb/s) Bit rate (kb/s) (a) (b) Figre 4-2. PSNR vs. bit rate for mobile : (a) PEP=0.5%, (b) PEP=2%. is no newly indced error in the crrent pixel. Therefore, the performance gain of ERMPC over RMPC only comes from inter modes, since they both se exactly the same estimates for intra modes. Ths, a higher percentage of intra modes in coastgard and football may reslt in a marginal PSNR gain of ERMPC over RMPC. For most seqences and channel conditions, we observe that the higher the bit rate for encoding, the more the PSNR gain of ERMPC over RMPC, sch as in Fig. 4-1 and Fig. 4-2(a). In (3 24), the end-to-end distortion consists of both qantization distortion and transmission distortion. The ERMPC algorithm gives a more accrate estimation of propagated error in transmission distortion than the RMPC algorithm. When the bit rate for sorce encoding is very low, with rate control the controlled QP is large, and hence the qantization distortion becomes the dominant factor in the end-to-end distortion. Therefore, the PSNR gain of ERMPC over RMPC is marginal. On the contrary, when the bit rate for sorce encoding is high, the transmission distortion becomes the dominant part in the end-to-end distortion. Therefore, the PSNR gain of ERMPC over RMPC is notable. However, this is not always tre as observed in Fig. 4-2(b). In JM16.0, the Lagrange mltiplier in (3 26) is a fnction of QP. A higher bit rate or smaller QP also cases a smaller Lagrange mltiplier. Therefore, the rate cost in (3 26) becomes smaller, which may prodce a higher percentage of intra modes. In sch a case, the 112

113 PSNR gain of ERMPC over RMPC decreases when the bit rate becomes higher. As a reslt, different seqences give different reslts depending on whether more intra modes are selected when bit rate increases. LLN has poorer R-D performance than ERMPC. This may be becase 30 simlated decoders are still not enogh to achieve a reliable distortion estimate althogh LLN with 30 simlated decoders already incrs mch higher complexity than ERMPC. Since the original ROPE does not spport the interpolation filtering operation and its extensions [17, 32] indce many floating-point operations and rond-off errors, we only se the same nearest neighbor approximation to show how its R-D performance differs from ERMPC, RMPC, and LLN. We see that sch an extension is valid for some seqences, sch as foreman. However, this approximation gives poor R-D performance for some other seqences, sch as mobile. Therefore, RMPC is easier to be extended than ROPE since the nearest neighbor approximation for RMPC in all seqences achieves good performance. Table 4-1 shows the average PSNR gain (in db) of ERMPC over RPMC, LLN, and ROPE for different video seqences and different PEP. The average PSNR gain is obtained by the method in Ref. [30], which measres average distance (in PSNR) between two R-D crves. From Table 4-1, we see that ERMPC achieves an average PSNR gain of 0.25dB over RMPC for the seqence mobile nder PEP = 2%; it achieves an average PSNR gain of 1.34dB over LLN for the foreman seqence nder PEP = 1%; and it achieves an average PSNR gain of 3.18dB over ROPE for the mobile seqence nder PEP = 0.5% sbjective Performance Since PSNR cold be less meaningfl for error concealment, a mch more important performance criterion is the sbjective performance, which directly relates to the degree of ser s satisfaction. Fig. 4-3 shows the sbjective qality of the 84-th frame and the 99-th frame of foreman seqence nder a PER of 1% and a bit rate of 113

114 Table 4-1. Average PSNR gain (in db) of ERMPC over RMPC, LLN and ROPE Seqence PEP ERMPC vs. RMPC ERMPC vs. LLN ERMPC vs. ROPE coastgard 5% % % % football 5% % % % foreman 5% % % % mobile 5% % % % kbps. From Fig. 4-3, we see the similar performance reslt as in Section That is, ERMPC achieves the best performance Discssion Effect of clipping noise on the mode decision The experiments show that since it does not consider clipping noise, ROPE over-estimates the end-to-end distortion for inter modes. Hence, ROPE tends to select intra modes more often than ERMPC, RMPC, and LLN, which leads to higher encoding bit rates. To verify this conjectre, we tested all seqences nder the same Qantization Parameter (QP) settings from 20 to 32 withot rate control. We observed that the ROPE algorithm always prodced a higher bit rate than other schemes as shown in Fig. 4-4 and Fig

115 (a) (b) (c) (d) (e) (f) (g) (h) Figre 4-3. (a) ERMPC at the 84-th frame, (b) RMPC at the 84-th frame, (c) LLN at the 84-th frame, (d) ROPE at the 84-th frame, (e) ERMPC at the 99-th frame, (f) RMPC at the 99-th, (g) LLN at the 99-th frame, (h) ROPE at the 99-th frame RMPC ROPE LLN ERMPC RMPC ROPE LLN ERMPC 39 PSNR (db) PSNR (db) Bit rate (kb/s) Bit rate (kb/s) (a) (b) Figre 4-4. PSNR vs. bit rate for foreman : (a) PEP=0.5%, (b) PEP=2% Effect of transmission errors on mode decision Compared to the reglar RDO process in JM16.0 withot considering the transmission error, ERMPC/RMPC/LLN/ROPE algorithms show three distinctions from it. 1) The nmber of intra MBs increases since transmission error is acconted for in the 115

116 42 40 RMPC ROPE LLN ERMPC RMPC ROPE LLN ERMPC PSNR (db) PSNR (db) Bit rate (kb/s) Bit rate (kb/s) (a) (b) Figre 4-5. PSNR vs. bit rate for mobile : (a) PEP=0.5%, (b) PEP=2%. mode decision; 2) The nmber of MBs with skip mode increases since the transmission error will increase the transmission distortion in all other modes except the skip mode; 3) if we allow the first frame to be erroneos, the second frame will have many intra MBs since the propagated error from the first frame is mch higher than for other frames. This is becase only the vale of 128 can be sed to conceal the reconstrcted pixel vale if the first frame is lost, while if other frames are lost, the collocated pixel in the previos frame can be sed to conceal the reconstrcted pixel vale. Therefore, the transmission error in the first frame gives mch higher propagated error than in the other frames. 116

117 CHAPTER 5 RATE-DISTORTION OPTIMIZED CROSS-LAYER RATE CONTROL IN WIRELESS VIDEO COMMUNICATION In this chapter, we derive more accrate sorce bit rate model and qantization distortion model than existing parametric models. We also improve the performance bond of channel coding with convoltional codes and a Viterbi decoder, and derive its performance nder Rayleigh block fading channel. Given the instantaneos channel condition, i.e. SNR and bandwidth, we design a rate-distortion optimized cross-layer rate control (CLRC) algorithm by jointly choosing qantization step size and channel coding rate. 5.1 An Literatre Review on Rate Distortion Models in Wireless Video Commnication Systems Under the prevalence of 3G/4G network and smart phones nowadays, real-time mobile video applications, e.g., videophone calls, are becoming more and more poplar. However, transmitting video over mobile phone with good qality is particlarly challenging since the mobile channels sbject to mltipath fading, and therefore the channel condition changes from frame to frame. Given the instantaneos channel condition, e.g., signal noise ratio (SNR) and bandwidth, the minimm end-to-end distortion can be achieved by optimally allocating the transmission bit rate between sorce bit rate and redndant bit rate. In a practical wireless video commnication system, this can be achieved by jointly control the sorce encoding parameters, e.g, qantization step size, in the video encoder, and channel encoding parameters, e.g., channel coding rate, in the channel encoder. Since both the video statistics and channel condition vary with time, we need to dynamically control those parameters for each frame in real-time video encoding and packet transmission. Therefore, we need to estimate the bit rate and distortion for each possible combination of parameters before encoding each frame. As a reslt, accrate bit rate model and distortion model will be very helpfl to achieve the minimm end-to-end distortion with low complexity. 117

118 There are many works trying to address this problem dring these years. While most of them derive the end-to-end distortion as fnctions of bit rate and packet error rate [8, 10], others se operational rate-distortion (R-D) fnctions [34]. The analytical models are more desirable since it is very difficlt for the video encoder to get all operational fnctions for different video statistics and channel conditions before real encoding. However, the existing analytical models are still not accrate enogh to accommodate the time-varying channel condition. On the other hand, to get tractable formlae in those analytical models [8, 10], athors all assme that block codes, i.e., Reed-Solomon codes, are adopted as the forward error correction (FEC) scheme. Based on that FEC scheme, the distortion is derived as a fnction of channel coding rate and bit error rate. However, this assmption has two limitations: 1) most p-to-date video commnication systems se convoltional codes or more advanced codes, e.g., trbo codes, for physical layer channel coding de to their flexible choice of channel coding rate withot the change the channel coding strctre; 2) in the cross-layer optimization problem, the selection of sorce bit rate and redndant bit rate based on a given bit error rate is sboptimal, while the optimal soltion can be achieved by jointly choosing them based on the given instantaneos channel condition, e.g., SNR and channel bandwidth. In this chapter, we aim to solve the cross-layer optimization problem by deriving more accrate bit rate model and end-to-end distortion model, which consists of two parts, that is, qantization distortion model and transmission distortion model. Plenty of bit rate models have been developed in existing literatre. Most of the existing works derive bit rate as a fnction of video statistics and qantization step size [35 38], while others model bit rate as a fnction of video statistics and other parameters [39]. In general, these models come from either experimental observation [37, 39 41] or parametric modeling [38, 42, 43]. However, both of them have some limitations. The experimental modeling sally indces some model 118

119 parameters which can only be estimated from previos frames. Therefore, the model accracy depends not only on the statistics and coding parameters bt also on the estimation accracy of those model parameters. However, in theory, the instantaneos frame bit rate shold be independent of previos frames given instantaneos video frame statistics and coding parameters. In addition, the estimation error of those model parameters may have a significant impact on the model accracy, which can be observed in the H.264/AVC JM reference software [33] and will be explained in detail in the experimental section of this chapter. On the other hand, the parametric modeling has the following two limitations: 1) the assmed residal probability distribtion, e.g., Laplacian distribtion, may deviate significantly from the tre histogram; 2) the implicit assmption of all transform coefficients being identically distribted is not valid if rn-length coding is condcted before the entropy coding as in most practical encoders. Since the model-selection problem may often be more important than having an optimized algorithm [44], simply applying these parametric models to a real encoder may reslt in poor R-D performance. In this chapter, we improve the bit rate model by modeling the component of rn-level mapping pls entropy coding as the process of choosing different codebooks for different qantized transform coefficients. We also compensate the mismatch between the tre histogram and the assmed Laplacian distribtion in the parametric model by tilizing the estimation deviation of previos frames. Experimental reslts show that or method achieves a more accrate estimate of bit rate compared to existing models. Qantization distortion is cased by qantization error nder lossy sorce coding and it has been extensively explored since the seminal work of Shannon s rate distortion theory first proposed in Ref. [1] and later proved in Ref. [2]. The qantization distortion are stdied either as a fnction of bit rate and the sorce probability distribtion, e.g., the classical R-D fnction for Gassian sorce [28, 45], or as a fnction of level nmber and the sorce probability distribtion given a certain qantizer, e.g., niform scaler 119

120 qantizer for Gassian sorce [46]. In the case of minimm mean-sqare digitization of memoryless Gassian sorces, qantizers with niformly spaced levels have entropies that exceed the rate-distortion fnction by approximately 0.25 bits/sample at relatively high rates [47]. In Ref. [48], the performance of optimm qantizers for a wide class of memoryless non-gassian sorces is investigated, and it is shown that niform threshold qantizers perform as effectively as optimm qantizers. For this reason, the niform qantizer is ssally adopted in a practical video encoder, e.g. H.264 [22]. For a niform qantizer, the qantization distortion have been derived as a fnction of qantization step size (or corresponding operation point) and video frame statistics either from experimental observation [37, 39] or by parametric modeling [38, 42]. Althogh the parametric modeling has achieved qite accrate reslt, it can be frther improved de to the sorce distribtion model inaccracy. In this chapter, we improve the estimation accracy of qantization distortion by tilizing the similar method in bit rate model. Experimental reslts show that or qantization distortion model is more accrate than existing models. Transmission distortion is cased by transmission error nder error-prone channels. Predicting transmission distortion at the transmitter poses a great challenge de to the spatio-temporal correlation inside the inpt video seqence, the nonlinearity of video codec, and varying PEP in time-varying channels. The existing transmission distortion models can be categorized into the following three classes: 1) pixel-level or block-level models (applied to prediction mode selection) [4 6]; 2) frame-level or packet-level or slice-level models (applied to cross-layer encoding rate control) [7 11]; 3) GOP-level or seqence-level models (applied to packet schedling) [12 16]. Althogh the existing transmission distortion models work at different levels, they share some common properties, which come from the inherent characteristics of wireless video commnication system, that is, spatio-temporal correlation, nonlinear codec and time-varying channel. However, none of those works analyzed the effect of 120

121 non-linear clipping noise on the transmission distortion, and therefore cannot provide accrate transmission distortion estimation. In Chapter 2, we analytically derive, for the first time, the transmission distortion formla as a closed-form fnction of packet error probability (PEP), video frame statistics, and system parameters; and then in Chapter 3, we design the RMPC algorithm to predict the transmission distortion with low complexity and high accracy. In this chapter, we will frther derive PEP and transmission distortion as fnctions of SNR, transmission rate, and channel coding rate for cross-layer optimization. Channel coding can be considered as the embedding of signal constellation points in a higher dimensional signaling space than is needed for commnications. By mapping to a higher dimensional space, the distance between points increases, which provides better error correction and detection performance [18]. In general the performance of soft-decision decoding is abot 2-3dB better than hard-decision decoding [18]. Since convoltional decoders have efficient soft-decision decoding algorithms, sch as Viterbi algorithm [49], we choose convoltional codes for physical layer channel coding in this chapter 1. In addition, Rate-compatible pnctred convoltional (RCPC) codes can adaptively change the coding rate withot changing the encoder strctre, which makes convoltional codes an appropriate method in real-time video commnication over wireless fading channels. In this chapter we improve the performance bond of convoltional codes by adding a threshold for low SNR case, and extend it to spport a more flexible SNR threshold for transmitters with channel estimation. For transmitters withot channel estimation, we also derive the expected PEP as a simple fnction of convoltional encoder strctre and channel condition nder Rayleigh block fading channel. 1 Or algorithm can also be sed for other channel codes, e.g. block codes, Trbo codes, and LDPC codes, given their performance for different coding rates. 121

122 Given the bit rate fnction, qantization distortion fnction and transmission distortion fnction, minimizing end-to-end distortion becomes an optimization problem nder the transmission bit rate constraint. In this chapter, we also apply or bit rate model, qantization distortion model and transmission distortion model to cross-layer rate control with rate-distortion optimization (RDO). De to the discrete characteristics and the possibility of non-convexity of distortion fnction [50], the traditional Lagrange mltiplier soltion for continos convex fnction optimization is infeasible in a video commnication system. The discrete version of Lagrangian optimization is first introdced in Ref. [51], and then first sed in a sorce coding application in Ref. [50]. De to its simplicity and effectiveness, this optimization method is de facto adopted by the practical video codec, e.g., H.264 reference code JM [33]. In this chapter, we will se the same method to solve or optimization problem. The rest of this chapter is organized as follows. In Section 5.2, we formlate the cross-layer optimization problem. In Section 5.3, we derive or bit rate model, qantization distortion model, and transmission distortion model. In Section 5.4, we propose a practical cross-layer rate control algorithm to achieve minimm end-to-end distortion nder the given SNR and channel bandwidth. Section 5.5 shows the experimental reslts, which demonstrates both the higher accracy of or models and the better performance of or algorithm over existing algorithms. 5.2 Problem Formlation Fig. 2-1 shows the strctre of a typical wireless video commnication system. It consists of an encoder, two channels and a decoder where residal packets and MV packets are transmitted over their respective channels. Note that in this system, both residal channel and MV channel are application-layer channels. Fig. 5-1 shows the channel details for these two channels. 122

123 Entropy coding R s Channel coding R t Modlation Power control Coding rate: R c Modlation order Power Channel estimation Channel gain: g Noise: n Entropy decoding Channel decoding Demodlation Figre 5-1. Channel model. The general RDO problem in a wireless video commnication system can be formlated as min D k ETE s.t. Rt k Rcon, k where DETE k is the end-to-end distortion of the k-th frame, Rk t (5 1) is the transmitted bit rate of the k-th frame, R k con (depend on the channel condition) is the bit rate constraint of the k-th frame. From the definition, we have DETE k 1 E[ (f V k k V k f k ) 2 ], (5 2) where V k is the set of pixels in the k-th frame; f k is the original pixel vale for pixel in the k-th frame; f k is the reconstrcted pixel vale for the corresponding pixel at the decoder; Define qantization error as f k ˆf k and transmission error as ˆf k f k, where ˆf k the reconstrcted pixel vale for pixel in the k-th frame at the encoder. While f k depends only on the qantization parameter (QP) 2, ˆf k f k ˆf k mainly depends on the PEP and the error concealment scheme. In addition, experimental reslts show that is 2 In the rate control algorithm design, qantization offset is often fixed. 123

124 f k ˆf k is zero-mean, which is also obvios in theory for encoders designed nder MMSE criterion. Therefore, we make the following assmption. Assmption 7. f k ˆf k and ˆf k f k Under Assmption 7, from (5 2), we obtain DETE k 1 = E[ (f V k k ˆf k ) 2 1 ] + E[ V k V k 1 = E[ (f V k k ˆf k ) 2 1 ] + E[ V k V k = D k Q + D k T, are ncorrelated, and E[f k (ˆf k V k (ˆf k V k f k ) 2 ] + 2 V k f k ) 2 ] ˆf k ] = 0. E[(f k V k ˆf k )]E[(ˆf k f k )] where, the first term in the right-hand side is called frame-level qantization distortion (FQD), i.e., DQ k 1 E[ V k V (f k k ˆf k ) 2 ] and the second term in the right-hand side is called frame-level transmission distortion (FTD), i.e., DT k 1 E[ V k V (ˆf k k f k ) 2 ]. In a typical video codec, the spatial correlation and temporal correlation is first removed by intra prediction and inter prediction. Then the residal is transformed and (5 3) qantized. Given the niform qantizer, D k Q only depends on the qantization step size Q k and the video frame statistics ϕ k f. Therefore, we can express Dk Q as a fnction of Q k and ϕ k f, i.e., D Q(Q k, ϕ k f ), where D Q( ) is independent from the frame index k. In Chapter 2, we have derived D k T as a fnction of PEP, video frame statistics ϕk f and system parameters ϕ k s, i.e., D T (PEP k, ϕ k f, ϕk s ). Since PEP k depends on SNR γ(t), transmission bit rate R k t, and channel coding rate R k c, D k T also depends on the Rk c. The higher channel coding rate, the higher PEP k and ths the larger D k T. However, nder the same bandwidth limit, the higher channel coding rate also means the fewer redndant bits or the higher sorce bit rate, and ths the smaller DQ k. In order to design the optimm Q k and R k c to achieve the minimm D k ETE, we need to have PEPk as a fnction of SNR γ(t), transmission rate R k t, and R k c, i.e., P(γ(t), R k t, R k c ). Denote ϕ k c the channel statistics, i.e. ϕ k c = {γ(t), R k t }, we can express D k T as a fnction of Rk c, ϕ k c, ϕ k f, and ϕ k s, i.e., D T (R k c, ϕ k c, ϕ k f, ϕk s ). On the other hand, R k t = Rk s R k c where R k s denote the sorce 124

125 bit rate and it is a fnction of the qantization step size Q k and video frame statistics ϕ k f, i.e., R s (Q k, ϕ k f ). Therefore, if we can derive the closed-form fnctions for D Q (Q k, ϕ k f ), D T (PEP k, ϕ k f, ϕk s ) and R s (Q k, ϕ k f ), (5 1) can be solved by finding the best parameter pair {Qk, Rc k }. In other words, the problem in (5 1) is eqivalent to min D Q (Q k, ϕ k f ) + D T (R k c, ϕ k c, ϕ k f, ϕ k s ) s.t. N k i=1 R s (Q k, ϕ k f,i ) R k c,i R k t, (5 4) where N k is the total nmber of packets in the k-th frame, and i is the packet index. In smmary, or problem in (5 4) is given the system strctre ϕ k s, time-varying video frame statistics ϕ k f and time-varying channel statistics ϕ k c, how to minimize D k ETE by jointly controlling the parameters pair {Q k, R k c,i }. 5.3 Derivation of Bit Rate Fnction, Qantization Distortion Fnction and Transmission Distortion Fnction In this section, we derive the sorce rate fnction R s (Q k, ϕ k f ), qantization distortion fnction D Q (Q k, ϕ k f ), and transmission distortion fnction D T (PEP k, ϕ k f, ϕk s ) Derivation of Sorce Coding Bit Rate Fnction The entropy of qantized transform coefficients for I.I.D. zero-mean Laplacian sorce nder niform qantizer Following the similar deriving process as in Ref. [38, 42, 43], it is easy to prove that for independent and identically distribted (i.i.d.) zero-mean Laplacian sorce nder niform qantizer with qantization step size Q and qantization offset θ 2, the entropy of qantized transform coefficients is H = P 0 log 2 P 0 + (1 P 0 ) ( θ 1 log 2 e 1 e θ 1 log 2 (1 e θ 1 ) θ 1 θ 2 log 2 e + 1), (5 5) where θ 1 = 2 Q ; (5 6) σ 125

126 Q is the qantization step size; σ is the standard deviation of the Laplacian distribtion; θ 2 is the qantization offset; P 0 = 1 e θ 1 (1 θ 2 ) is the probability of qantized transform coefficient being zero. (5 5) is proved in Appendix D Improve with rn length model In a video encoder, the qantized transform coefficients are actally not i.i.d. Althogh we may assme the DCT transform or integer transform [22] highly de-correlates the correlation among neighboring pixels, different transform coefficients have very different variances in statistics. For example, in a 4x4 integer transform, the 16 coefficients show a decreasing variance in the well-known zigzag scan order as sed in H.264. As a reslt, the coefficients with higher freqency have higher probability of being zeroes after qantization. On the other hand, the coefficients with lower freqency show more randomness in different levels even after qantization. Sch characteristics are exploited by the rn-level mapping after zigzag scan to frther increase the compressibility for entropy coding. We may regard the component of rn-level mapping pls entropy coding as choosing different codebooks for different qantized transform coefficients. From information theory, we know the concavity of the entropy as a fnction of the distribtion (Theorem in Ref [28]). Therefore, not considering the mixtre of 16 coefficients with different variances will overestimate the entropy of mixed transform coefficients. To derive the joint entropy for 16 coefficients with different variances, we need to model the variance relationship among those 16 coefficients. Having done extensive experiments, we find an interesting phenomenon 3, that is, the variance is approximately 3 This phenomenon is fond from samples in one frame or one GOP for CIF seqences, i.e., the nmber of sample is larger than

127 All blocks in the 3th frame, QP=34, foreman cif Tre var for 8x8 mode Estimated var for 8x8 mode Tre var for 4x4 mode Estimated Var for 4x4 mode 120 Var of transform coefficients Zigzag order Figre 5-2. Variance model. a fnction of position in the two-dimensional transform domain as follows σ 2 (x,y) = 2 (x+y) σ 2 0, (5 7) where x and y is the position in the two-dimensional transform domain, and σ0 2 is the variance of the coefficient at position (0, 0). With (5 7), we can derive the variance σ(x,y) 2 for all positions given the average variance σ 2 as in Appendix D.2. Fig. 5-2 shows the tre variances and estimated variances by (D 5) for all transform coefficients before qantization in the third frame of foreman seqence with QP = 34. We only show inter prediction modes 8x8 and 4x4 in Fig The reslts of other inter prediction modes [20] are similar. However, we also notice that de to the high correlation among all coefficients in intra prediction modes, the tre variance of DC component is mch larger than estimated variance by (D 5). The more accrate variance model for DC component in intra modes will be investigated in or ftre work. 127

128 Then, the estimated joint entropy of 16 non-identical transform coefficients by compensating the rn length coding model is H rlc = x=0 y=0 3 H (x,y), (5 8) where H (x,y) is the entropy for coefficient position (x, y), and can be calclated by (D 5), (5 6) and (5 5) with their own σ 2 (x,y) and θ 1(x,y) Practical consideration of Laplacian assmption Statistically speaking, (5 8) is only valid for sfficiently large samples. When there are not enogh samples or the sample variance is very small, e.g., Q > 3σ, the Laplacian assmption for individal coefficients is not accrate. In sch cases, we may se the mixed distribtion in (5 5) as the estimate instead of (5 8). That is, estimatedby (5 8), Q 3σ H k = estimatedby (5 5), otherwise. (5 9) Improvement by considering the model inaccracy The assmed residal probability distribtion, e.g., Laplacian distribtion, may deviate significantly from the tre histogram especially when the nmber of samples are not sfficient. Therefore, we need to compensate the mismatch between the tre residal histogram and assmed Laplacian distribtion to obtain a better estimate. Denote H l as the entropy for the case with a Laplacian distribtion, H t as the entropy for the case with the tre histogram and ν = H l H t. In a video seqence, the changes of residal statistics and qantization step size between adjacent frames have almost the same effect on H l and H t. Therefore, we may se the previos frame statistics to compensate the estimated reslt from (5 8). Assme the ratio between H k l and H k t approximate ν k 1, we have Hk l H k t = Hk 1 l Ht k 1. As a reslt, (5 8) can be frther compensated as Ĥ k = Hk 1 t H k 1 l H k. (5 10) 128

129 Althogh very simple, (5 8) and (5 10) significantly improve the estimation accracy of residal entropy as shown in Fig Sorce coding bit rate estimation for the H.264 encoder For a hybrid video coder with block-based coding scheme, e.g., H.264 encoder, the encoded bit rate R s consists of residal bits R resi, motion information bits R mv, prediction mode bits R mode, and syntax bits R syntax. That is, R k s = Ĥ k N resoltion N fps + R k mv + R k mode + R k syntax, (5 11) where N resoltion is the normalized video resoltion considering color components, and N fps means the nmber of frames per second. Compared to R k resi, Rk mv, R k mode, and Rk syntax are less affected by Q. Therefore, R k mv, R k mode, Rk syntax can be estimated from the statistics in the previos frames Derivation of Qantization Distortion Fnction In this sbsection, we improve the estimation accracy of qantization distortion by tilizing the same techniqes in Section In Ref. [38, 42], athors derive the distortion for zero-mean Laplacian residal distribtion nder niform qantizer as D Q = Q2 (θ 1 e θ 2 θ1 (2 + θ 1 2 θ 2 θ 1 ) e θ 1 ) θ 2 1 (1 eθ 1 ), (5 12) Since the coefficients after transform is not identical in distribtion, we need to derive the overall qantization distortion fnction by considering each coefficient individally. Using the variance relationship among coefficients in (5 7), we have D overall = x=0 y=0 3 D (x,y), (5 13) where D (x,y) is the distortion for coefficient position (x, y), and can be calclated by (D 5), (5 6) and (5 12) with their own σ 2 (x,y) and θ 1(x,y). When there are not enogh samples or the sample variance is very small, e.g., Q > 3σ, the Laplacian assmption for individal coefficients is not accrate. In sch 129

130 cases, we may se the mixed distribtion in (5 12) as the estimate instead of (5 13). That is, estimatedby (5 13), DQ k = estimatedby (5 12), Q 3σ otherwise. (5 14) Similarly, we need to compensate the mismatch between the tre residal histogram and assmed Laplacian distribtion for qantization distortion estimation. Denote D Q,l as qantization distortion for the case with a Laplacian distribtion, D Q,t as qantization distortion for the case with the tre histogram and µ = D Q,l D Q,t. (5 14) can be compensated as ˆD Q k = Dk 1 Q,t DQ,l k, (5 15) D k 1 Q,l where D k Q,l is calclated from (5 14). (5 13) and (5 15) significantly improve the estimation accracy of qantization distortion as shown in Fig Derivation of Transmission Distortion Fnction In this sbsection, we derive the FTD as a fnction of SNR, transmission rate, and channel coding rate Transmission distortion as a fnction of PEP In Chapter 2, we derived the FTD formla nder single-reference motion compensation and no slice data partitioning as D k T = P k (E[(ε k ) 2 ] + λ k E[(ξ k ) 2 ] + D k 1 ) + (1 P k ) α k D k 1 (1 β k ). (5 16) P k is the weighted average PEP of all packets in the k-th frame; ε k is the residal concealment error; ξ k is the MV concealment error; β k is the percentage of encoded I-MBs in the k-th frame; both the propagation factor α k and the correlation ratio λ k depend on video statistics, channel condition and codec strctre, and are therefore called system parameters; D k 1 is the transmission distortion of the k 1 frame, which can be iteratively calclated by (5 16). 130

131 P k is defined as P k 1 N k V k i=1 (Pk i Ni k ), where Ni k is the nmber of pixels contained in the i-th packet of the k-th frame; P k i is PEP of the i-th packet of the k-th frame; N k is the total nmber of packets of the k-th frame. The other video frame statistics and system parameters can be easily estimated as described in Chapter 3. We will describe how to estimate PEP in the following sbsections PEP as a fnction of SNR, transmission rate, and channel coding rate in a fading channel Below, we analyze the conditional PEP for convoltion coding scheme nder wireless fading channel, given SNR. Since convoltional codes are linear codes, the probability of error can be obtained by assming that the all-zero seqence is transmitted, and determining the probability that the decoder decides in favor of a different seqence [18]. The probability of mistaking transmitted seqence with a seqence, Hamming distance d away, is called pairwise error probability, and denoted as P 2 (d). With soft decision, if the coded symbols otpt from the convoltional encoder are sent over an AWGN channel sing coherent BPSK modlation with energy E c = R c E b, then it can be show that 2Ec d P 2 (d) = Q( ) = Q( 2γ d). (5 17) N 0 Before calclating the PEP, we need to analyze the first error probability, which is defined as the probability that another path that merges with the all-zero path at a given node has a metric that exceeds the metric of the all-zero path for the first time [52]. According to the definition, the first error probability can be approximated by its pper bond, i.e., the probability of mistaking the all-zero path for another path throgh the trellis, as P fe d max d=d free W d P 2 (d), (5 18) where W d is the weight spectrm of the specific convoltional code; d free is the free distance of the specific convoltional code; d max is the maximm distance between the 131

132 transmitted seqence and decoded seqence 4. As a reslt, the PEP for a block of L decoded bits and for a given SNR can be pper-bonded as [53, 54] PEP(γ) 1 (1 P fe (γ)) L L P fe (γ). (5 19) However, both pper bonds in (5 18) and (5 19) are only tight when γ is large. When γ is small sch as in a fading channel, the reslted bond may be mch larger than 1, i.e., L P fe (γ) 1. From or experimental reslts, we find that the PEP(γ) follows waterfall shape when γ increase, that is, there exist a threshold γ th sch that, when γ > γ th, the bond is qite tight, and when γ < γ th, the bond becomes very loose and exceed 1 qickly. Therefore, we improve the performance bond by sing the following formla. R t R c N PEP(γ) k N fps d max d=d free W d P 2 (d, γ), γ γ th 1, γ < γ th, (5 20) where γ th can be nmerically calclated from (5 21) given the convoltional encoder strctre (W d, d free, and d max ), coding rate (R c ) and modlation scheme (P 2 (d)). Note that W d, d free, and d max in (5 20) are fnctions of R c in RCPC. (5 20) is qite accrate as shown in Fig 5-5, where PEP th = 1. d max d=d free R t R c N k N fps W d P 2 (d, γ th ) = PEP th. (5 21) Note that a change in the modlation and demodlation techniqe sed to transmit the coded information seqence affects only the comptation of P 2 (d) [52]. Therefore, (5 20) is general for any modlation scheme. 4 d max different from the formla in Ref. [52] de to the code is trncated by the packet length, which improve the pper bond bt the effect on performance is negligible when packet length is large [52]. 132

133 In a real-time video commnication system, if the estimated PEP(γ) is larger than a threshold vale, i.e. PEP(γ) > PEP th, transmitter may discard this packet instead of transmitting it. 5 The benefit of doing this is threefold: 1) if PEP(γ) is large, it is a waste of energy and time to transmit the packet; therefore, sing (5 20) saves transmission energy; 2) in cross-layer rate control, since video encoder has the knowledge of channel condition, video encoder will skip encoding crrent frame when the channel gain is very low, which saves the encoding energy; 3) If crrent frame is skipped, video encoder will se previos encoded frames as references for encoding the following frames, which redce the reference error propagation. (5 20) is derived nder the condition that γ is known at the transmitter with channel estimation. In some wireless system, γ is nknown for transmitter, e.g., withot feedback channel. In sch case, the expected PEP, i.e. E γ [PEP], instead of PEP is sed for estimating transmission distortion given the probability distribtion of channel gain. Proposition 5 gives the formla of expected PEP nder a Rayleigh block fading channel. Proposition 5. Under a Rayleigh block fading channel, the expected PEP is given by where γ th is defined by (5 21). E γ [PEP] = γ γ th th 1 γ e γ (1 + ), (5 22) d free γ th Proposition 5 is proved in Appendix D.3. We see from (D 7) that if γ th γ, E γ [PEP] 1 e γ th γ 1 e So, to control the PEP nder a reasonable level, the transmitter shold set its transmission power sch that γ >> γ th before transmitting the packet. 5 In some delay-insensitive applications, e.g. streaming video, the bffer is sed to hold packets when channel condition is poor. In sch cases, the packet will be dropped at the transmitter only when the qee bffer is fll or delay bond is violated, which will decrease the PEP. 133

134 Transmission distortion as a fnction of SNR, transmission rate, and channel coding rate in a fading channel In case of adaptive modlation, adaptive transmission power and adaptive bandwidth (sbcarrier) allocation, P 2 (d) is a fnction of modlation order M, transmission power P t and passband bandwidth B. In this chapter, we stdy the case that modlation, power and bandwidth are all given dring the cross-layer rate control. Under sch conditions, both transmission bit rate R t and SNR γ are known vales. For example, with modlation order M and Nyqist plse-shaping, R t = B log 2 (M) and γ = P tt t g N 0. As a reslt, both PEP and D t depend only on the tning parameter R c. 5.4 Rate-Distortion Optimized Cross-layer Rate Control and Algorithm Design In this section, we apply or models derived in Section 5.3 to cross-layer rate control application. We adopt the discrete version of Lagrange mltiplier as sed in JM [33] to achieve the R-D optimized parameter pair {Q k, Rc,i k }. We also design a practical cross-layer rate control algorithm Optimization of Cross-layer Rate Control Problem To solve (5 4), we may either se Lagrangian approaches or dynamic programming approaches [44]. In terms of complexity, the Lagrangian approach is preferable, since it can be rn independently in each coding nit, whereas dynamic programming reqires a tree to be grown. Note that the complexity of the dynamic programming approaches can grow exponentially with the nmber of coding nits considered, while the Lagrangian approach s complexity only grow linearly [44]. By sing the theorem in Ref. [50, 51], we may se the Lagrangian approach for the i-th packet in the k-th frame independently as (Qi k, Rc,i) k = arg min{d Q (Qi k ) + D T (Rc,i) k + λ Rs(Qi k ) }, (5 23) Rc,i k where λ is the preset Lagrange mltiplier, which can be determined either by bi-section search [50, 55] or by modeling [33, 56]. 134

135 For some special cases, e.g. video conference, the frame size is sally small. In sch a case, each frame is transmitted in one packet, and therefore, the bit allocation problem can be simplified. To be more specific, since all bits are allocated into one packet, given γ and R t, every R c have a corresponding R s ; every R s has a corresponding Q and therefore D Q (by (5 11), (5 10), (5 5) and (5 14)). As mentioned in Section , D T is also a fnction of R c. In other words, the end-to-end distortion D k ETE only depends on R c. Therefore, there exists an optimm R k c, sch that D ETE is minimized. As a reslt, Lagrange mltiplier can be omitted. That is, the optimm R k c can be achieved by comparing D ETE for all possible R c, and the optimm Q k can be calclated by the corresponding R k c Algorithm Design In this sbsection, we propose a practical algorithm for cross-layer rate-distortion optimization as following. Algorithm 4. Cross-layer optimized qantization step size Q and channel coding rate R c decision for the k-th frame. 1) Inpt: R t, γ, PEP th. 2) Initialization of Q 0 and R 0 c for the first frame, i.e. k = 1. If k > 1, go to 3). 3a) If N k > 1, i.e., each frame is contained in more than one packet. Initialize Λ j = Λ 0 by the method proposed in Ref. [50] loop for Λ j = Λ 0, Λ 1,..., Λ, for packet index i from 1 to N k > 1, For each packet, loop for all combinations of {Q, R c } nder the given Λ j calclate γ th by (5 21), estimate P k i for all packets by (5 20), estimate D T by (5 16), estimate θ 1 by (5 6), estimate D Q by (5 15), 135

136 calclate D ETE (Q, R c ) by (5 3), estimate R k s,i by (5 5), (5 8), (5 10) and (5 11), End obtain the best {Q k i (Λ j ), R k c,i (Λ j)}, i.e., {Q k i (Λ j ), R k c,i (Λ j)} via (5 23), End estimate R k t by R k s,i and R k c,i, calclate Λ j+1, End obtain the best {Qi k, Rc,i k }, i.e., {Qk i (Λ ), Rc,i k (Λ )}, for each packet. 3b) If N k = 1, i.e., each frame is contained in one packet. loop for all channel coding rates. calclate γ th by (5 21), estimate PEP for the k-th frame by (5 20), estimate D k T by (5 16), estimate Ĥ k by (5 11), estimate Q k by (5 10), (5 8), (5 5) and (5 6), estimate ˆD k Q by (5 15), calclate D ETE (R c ) by (5 3), End select the best R k c and corresponding Q k with minimm end-to-end distortion. 4) Otpt: the best {Q k i, R k c,i }. Algorithm 4 is referred to as CLRC. Note that in Algorithm 4, the iterations to acqire the best Lagrange mltiplier Λ se bi-section search [50, 55]. Since loop for all combinations of {Q, R c } is exected for each candidate Lagrange mltiplier, the complexity is very high. We may also se the modeling method [33, 56] instead of bi-section search to design CLRC. In sch a case, R-D optimized {Q, R c } decision is 136

137 similar to the R-D optimized mode decision in Ref. [33] except three differences: 1) the mode decision is replaced by channel coding rate decision given the Lagrange mltiplier, 2) the qantization distortion is replaced by the End-to-end distortion, 3) the sorce coding bit rate is replaced by the transmission bit rate. Note that the modeling method redces the complexity to estimate the best {Qi k, Rc,i k } at the cost of accracy. 5.5 Experimental Reslts In Section 5.5.1, we verify the accracy of or proposed models. Then in Section 5.5.2, we compare the performance between or CLRC algorithm and existing rate control algorithms Model Accracy In this sbsection, we test the bit rate model proposed in (5 10), distortion model proposed in (5 15), and PEP model proposed in (5 20) Bit rate model The JM16.0 encoder is sed to collect the tre distortion and reqired statistics. Fig. 5-3 shows the tre residal bit rate and estimated residal bit rate for foreman and mobile for the first 20 frames in order to make different crves distingishable. In Fig. 5-3, Tre bpp means the tre bit per pixel (bpp) prodced by the JM16.0 encoder; withot rlc means bpp estimated by (5 5); with rlc means bpp estimated by (5 8); withot compensation means bpp estimated by (5 9); with compensation means bpp estimated by (5 9) and (5 10); Rho-domain means bpp estimated by Refs. [10, 57]; Xiang s model means bpp estimated by Refs. [38, 58]. We can see that the estimation accracy is improved by (5 8) when tre bpp is relatively large. However, when tre bpp is small, withot rlc gives higher estimation accracy. By tilizing the statistics of the previos frame from (5 10), the estimation accracy is frther improved. We also find that Rho-domain is accrate at low bpp; however, it is not accrate at high bpp. For Xiang s model, the estimated bpp is smaller than the tre bpp in most cases. Note that we also want to compare the bit rate model 137

138 foreman cif bps 0.4 mobile cif bps Tre bpp Estimated bpp withot compensation Estimated bpp with compensation Estimated bpp withot rlc Estimated bpp with rlc Estimated bpp by Rho domain Estimated bpp by Xiang s model Tre bpp Estimated bpp withot compensation Estimated bpp with compensation Estimated bpp withot rlc Estimated bpp with rlc Estimated bpp by Rho domain Estimated bpp by Xiang s model bits per pixel (bpp) bits per pixel (bpp) Frame Index Frame Index (a) Figre 5-3. bpp vs. Frame index: (a) foreman, (b) mobile. (b) sed in JM16.0. However, de to the estimation error of its model parameters, the first few frames may abnormally nderestimate the qantization step size Q. Therefore, the rate control algorithm in JM16.0 se three parameters, i.e., RCMinQPPSlice, RCMaxQPPSlice and RCMaxQPChange, to redce the effect of the estimation error. Their defalt vales are 8, 42, 4, respectively. However, we believe a good rate control algorithm shold depend mainly on the model accracy rather than those manally chosen thresholds. When those parameters are set as 0, 51, 51, the estimated QP cold even be 0 in the first few frames. That is, the first few frames consme most of the allocated bits, and there are only few bits available for the remaining frames in JM. Therefore, we do not test its model accracy in this sbsection. Instead, we will plot the R-D performance for it in Section Qantization distortion model Fig. 5-4 shows the corresponding qantization distortion of each bit rate crve in Fig.5-3. Note that since Refs. [38, 58] directly se (5 12) to estimate the qantization distortion, withot rlc means the qantization distortion estimated by both (5 12) and Xiang s model. Similar to Fig.5-3, we can see that the estimation accracy is improved by (5 13) when θ 1 is small, i.e., when qantization distortion is relatively small. However, when qantization step size is large, (5 12) is more accrate than (5 13). Note that, the relativity is for the same video seqence. For different video seqences, since the 138

139 residal variances are different, in order to achieve the same bit rate, seqences with larger variance, e.g., mobile, will se higher qantization step size than seqences with lower variance, e.g., foreman. Different from the bit rate model, which depends only on θ 1, the qantization distortion model depends on both Q and θ 1. Therefore, we cannot se the absolte vale of qantization distortion between two seqences for comparing estimation accracy of (5 12) and (5 13). After normalized by the factor Q 2 in (5 12) and (5 13), their relative accracy is valid in most cases. However, in some rare cases, (5 13) is more accrate than (5 12) even when Q > 3σ. This can be observed for frame index from 14 to 17 in foreman seqence. We still need to investigate the reason behind it to frther improve or model accracy. For all cases, the estimation accracy is improved by tilizing the statistics of the previos frame from (5 15). Similar to Fig.5-3, rho-domain is more accrate at large θ 1, i.e., low bit rate or relatively large qantization distortion, than at small θ foreman cif bps 70 mobile cif bps Tre distortion Qantization distortion Estimated distortion withot compensation Estimated distortion with compensation Estimated distortion withot rlc Estimated distortion with rlc Estimated distortion by Rho domain Qantization distortion Tre distortion Estimated distortion withot compensation Estimated distortion with compensation Estimated distortion withot rlc Estimated distortion with rlc Estimated distortion by Rho domain Frame index Frame index (a) (b) Figre 5-4. Qantization vs. Frame index: (a) foreman, (b) mobile PEP model Here we verify the accracy of PEP model derived in (5 20). We se the RCPC codes from Table I-VI in Ref [59]. To be more specific, we choose a typical convoltional encoder strctre with constraint length 7, i.e. 6 memory registers, G1 = 133 and G2 = 171. The channel coding rates are 2/3, 3/4, 4/5, 5/6, 6/7 and 7/8. For completeness, we pt all encoder parameters in Table 5-1. Viterbi algorithm is sed to decode the received 139

140 bits with noise. BPSK modlation is sed. Each packet contains 2000 information bits. For each SNR and channel coding rate, there are 1000 packets simlated to collect the tre packet error rate (PER). Table 5-1. RCPC encoder parameters coding rate pnctring matrix dfree weight spectrm 2/3 [1 1, 1 0] 6 [1, 16, 48, 158, 642, 2435, 9174, 34701, , ] 3/4 [1 1, 1 0, 0 1] 5 [8, 31, 160, 892, 4512, 23297, , , , ] 4/5 [1 1, 1 0, 1 0, 1 0] 4 [3, 24, 172, 1158, 7408, 48706, , , , ] 5/6 [1 1, 1 0, 0 1, 1 0, 0 1] 4 [14, 69, 654, 4996, 39677, , , , , ] 6/7 [1 1, 1 0, 1 0, 0 1, 1 0, 0 1] 3 [1, 20, 223, 1961, 18084, , , , , ] 7/8 [1 1, 1 0, 1 0, 1 0, 0 1, 1 0, 0 1] 3 [2, 46, 499, 5291, 56137, , , , , ] Fig. 5-5 shows the tre PER and estimated PEP by the pper bond in (5 20). We can see that the estimated PEP crve is only abot 1dB higher than the corresponding tre PER crve Performance Comparison In this sbsection, we show both objective performance and sbjective performance of CLRC algorithm. In order to see the gain achieved by channel estimation, we also compare the performance achieved by (5 20) and (5 22). This reslt may serve as a gideline for system design to balance the performance and cost. Using (5 22), we can also compare the performance gain achieved by or models from existing models. 6 From the experimental reslts, we observe that the estimated PEP crve shows an constant offset from the tre PER crve given the RCPC encoder strctre, and different RCPC encoder strctre shows a different offset. We may tilize this observation to frther improve the PEP model in or ftre work. 140

141 PEP / PER packet length = 2000 Rc= ; tre PER Rc= ; estimated PEP Rc=0.75; tre PER Rc=0.75; estimated PEP Rc=0.8; tre PER Rc=0.8; estimated PEP Rc= ; tre PER Rc= ; estimated PEP Rc= ; tre PER Rc= ; estimated PEP Rc=0.875; tre PER Rc=0.875; estimated PEP EbN0 (db) Figre 5-5. PEP nder different RCPC coding rates Experiment setp The JM16.0 encoder and decoder [33] are sed in the experiments. All the tested video seqences are in CIF resoltion at 30fps. Each video seqence is encoded for its first 30 frames where the first frame is an I-frame and the following frames are P-frames. The error concealment method is to copy the pixel vale in the same position of the previos frame. The first frame is assmed to be correctly received with enogh channel protection or timely acknowledgement feedback. The encoder setting is given as below: Constrained intra prediction is enabled; the nmber of reference frames is 5; B slices are not inclded; only 4x4 transform is sed; CABAC is enabled for entropy coding; For all rate control algorithms, the first frame se a fix QP, i.e., QP=28. Each coded video seqence is tested nder different rayleigh fading channels, i.e., different combinations of bandwidth from 100Kbps to 1Mbps and average SNR from 4dB to 10dB. For each specific channel condition, we simlate 300 random packet error processes to mitigate the effect of error randomness on each frame. RCPC codes and modlation are the same as those in Section

142 PSNR performance Figs. 5-6 shows Y-component PSNR vs. average SNR for foreman and mobile. In Fig. 5-6, proposed-constant-pep represent the performance achieved by or models withot channel estimation, i.e., sing (5 22). constant-pep represent the performance achieved by the defalt rate control algorithm in JM16.0, i.e., JVT-H017r3 [33, 60, 61], withot channel estimation. For each algorithm, we test two parameter settings of (RCMinQPPSlice, RCMaxQPPSlice, RCMaxQPChange), i.e., (8, 42, 4) and (0, 51, 51) to see how accrate those models are nder different manally set thresholds. The experimental reslts show that nder the same QP-limitation range, CLRC achieves p to 5dB PSNR gain in foreman and p to 4dB PSNR gain in mobile over no channel estimation. We observe that both CLRC and proposed-constant-pep show very stable reslt when the QP-limitation range varies, while constant-pep show very different reslts nder different QP-limitation ranges. To be more specific, for constant-pep the smaller QP-limitation range gives p to 3dB PSNR gain over larger QP-limitation range. This phenomenon frther proves the higher accracy of or models RD foreman cif bps Y constant PEP constant PEP QP limit proposed constant PEP proposed constant PEP QP limit CLRC CLRC QP limit RD mobile cif bps Y constant PEP constant PEP QP limit proposed constant PEP proposed constant PEP QP limit CLRC CLRC QP limit PSNR (db) PSNR (db) Average SNR (db) Average SNR (db) (a) Figre 5-6. PSNR vs. average SNR: (a) foreman, (b) mobile. (b) Note that in Ref. [10], athors also propose bit rate model, qantization distortion model and transmission distortion model for solving joint sorce channel rate control problem. However, in both that bit rate model and qantization distortion model, only the model parameter, i.e., rho, can be estimated from a given bit rate or qantization 142

143 distortion. In order to estimated the qantization step size or QP before real encoding, those models reqires the prior knowledge of residal histogram [62, 63]. Since H.263 encoders sally se mean sqare error (MSE) as a criterion for motion estimation, this kind of prior knowledge is accessible in H.263 after motion estimation and before qantization. However, it is not available in H.264 encoders since R-D cost instead of MSE is adopted as a criterion for motion estimation and mode decision. The R-D cost fnction indces a Lagrange mltiplier, which can only be determined after QP is known. Therefore, their bit rate model enconters a chicken-and-egg problem if one tries to apply it for estimating qantization step size in H.264 encoders. De to this reason, we do not implement those models in Ref. [10] for cross-layer rate control in the H.264 encoder [33]. Note that since the model parameters in Ref. [10] is attainable after real encoding, we still compare their model accracy in Section For the accracy comparison between or transmission distortion model and the transmission distortion model in Ref. [10], please refer to Chapter 3. Fig. 5-7 shows Y-component PSNR vs. bandwidth for foreman and mobile. We see the similar reslts as in Fig That is, 1) CLRC achieves the best performance; 2) both CLRC and proposed-constant-pep show very stable reslt when the QP-limitation range varies, while constant-pep show very different reslts nder different QP-limitation ranges. However, we also observe in or experiments that PSNR shows more randomness for a given SNR in Fig. 5-7 than in Fig For example, for mobile the PSNR at 400kbps is even higher than PSNR at all other bit rates. After investigation, we find this is de to the randomness of prodced SNR sample seqence for a given average SNR in a fading channel. In other words, the randomness of SNR sample seqence has more impact on distortion than bit rate in a fading channel. To mitigate the effect of the randomness, we shold simlate sfficiently large nmber of SNR sample seqences for a given average SNR. Unfortnately, this is prohibitively time consming and therefore impractical to simlate. For example, if we simlate

144 SNR sample seqences with 30 frames per seqences for average SNR 8dB; for each SNR sample seqence, 300 random packet error processes are simlated to mitigate the effect of error randomness on each frame; In order to plot PSNR vs. bandwidth for 6 algorithms and settings with 4 bit rates, we need frames encoding operations and frames decoding operations, which at least needs 16,000 hors by sing JM16.0 [33] in a PC with a 2.29GHz CPU RD foreman cif 8dB Y constant PEP constant PEP QP limit proposed constant PEP proposed constant PEP QP limit CLRC CLRC QP limit RD mobile cif 10dB Y constant PEP constant PEP QP limit proposed constant PEP proposed constant PEP QP limit CLRC CLRC QP limit PSNR (db) PSNR (db) Bit rate (b/s) x Bit rate (b/s) x 10 5 (a) Figre 5-7. PSNR vs. bandwidth: (a) foreman, (b) mobile. (b) Sbjective performance Since PSNR cold be less meaningfl for error concealment, a mch more important performance criterion is the sbjective performance, which directly relates to the degree of ser s satisfaction. By tilizing the channel information, i.e., SNR and bandwidth, or CLRC algorithm intelligently chooses the reference frames which are transmitted nder the best channel conditions and neglects those references frames which experience poor channel conditions. As a reslt, the well-known error propagation problem is prohibited even dring the encoding process. To illstrate the sbjective performance, we plot for frames from the foreman seqence. Fig. 5-8(a) shows a random channel sample nder average SNR=10dB and bit rate=1000kbps; Fig. 5-8(b) shows Distortion vs. Frame index for foreman cif nder this channel; Fig. 5-9 shows the corresponding sbjective qality of reconstrcted frames. We see that de to a low channel SNR dring the timeslots of the 10-th 144

145 30 25 Average SNR = 10dB foreman cif bps 10dB Y constant PEP constant PEP QP limit proposed constant PEP proposed constant PEP QP limit CLRC CLRC QP limit SNR (db) 15 PSNR (db) Frame index Frame index (a) (b) Figre 5-8. A random channel sample nder average SNR=10dB and bit rate=1000kbps: (a) A random SNR sample, (b) Distortion vs. Frame index for foreman cif nder this channel. frame, the encoder with CLRC skip encoding these three frames to save encoding and transmission energy. Since there are no packets transmitted, the reconstrcted pictre of both those three frames at the decoder are the same as at the encoder. Then, when the channel condition goes well in the 11-th frame, encoder with CLRC se the 9-th frame as reference to reconstrct the 11-th fame. Since the channel condition is good in the timeslot of the 11-th frame, there are no transmission distortion at the decoder. Therefore, the error propagation is prohibited in the following frames. For the encoder withot channel estimation, the 10-th frame is encoded and transmitted. De to the low channel SNR dring the timeslots of the 10-th frame, the packets are received with error at the receiver and therefore, the reslted PSNR is almost the same as that of encoder with CLRC. However, withot channel information, the encoder still se the 10-th frame as one of the references for encoding the 11-th frame. Therefore, althogh the 11-th frame is correctly received at the receiver de to good channel condition, the reconstrcted error in the 10-th frame are propagated into the 11-th frame at the decoder, which cases both lower sbjective qality and PSNR comparing to the encoder with CLRC. In Fig. 5-9, de to the space limit, we only show the sbjective qality for encoder with constant-pep nder defalt QP limitation 145

146 range. As we may foresee, the sbjective qality for encoder with constant-pep nder maximm QP limitation range is the worst among all cases. Original: Frame index = 10 CLRC: Frame index = 10 proposed constant PEP: Frame index = 10 constant PEP QP limit: Frame index = 10 (a) (b) (c) (d) Original: Frame index = 11 CLRC: Frame index = 11 proposed constant PEP: Frame index = 11 constant PEP QP limit: Frame index = 11 (e) (f) (g) (h) Figre 5-9. For the 10-th frame: (a) original, (b) CLRC, (c) proposed-constant-pep, (d) constant-pep-qp-limit; for the 11-th frame: (e) original, (f) CLRC, (g) proposed-constant-pep, (h) constant-pep-qp-limit. 146

Prediction of Transmission Distortion for Wireless Video Communication: Analysis

Prediction of Transmission Distortion for Wireless Video Commnication: Analysis Zhifeng Chen and Dapeng W Department of Electrical and Compter Engineering, University of Florida, Gainesville, Florida 326