A MODEL OF JERKINESS FOR TEMPORAL IMPAIRMENTS IN VIDEO TRANSMISSION. S. Borer. SwissQual AG 4528 Zuchwil, Switzerland

A MODEL OF JERKINESS FOR TEMPORAL IMPAIRMENTS IN VIDEO TRANSMISSION S. Borer SwissQual AG 4528 Zuchwil, Switzerland www.swissqual.com silvio.borer@swissqual.com ABSTRACT Transmission of digital videos over band limited and possibly error prone channels is an important source of temporal impairments, such as reduced frame rates, frame freezings, and frame droppings. This often results in a non fluent and non smooth presentation of video during playback. Perceptually, this is called jerkiness. In this paper a model of perceived jerkiness is proposed. It can be applied to video sequences showing any pattern of jerkiness, with constant or variable frame rate. It includes a parametrisation of the viewing condition. Thus, it can be used to predict jerkiness from small to large resolutions. The model is compared to data from subjective experiments containing a large variety of temporal degradations and spanning the whole range of currently used video resolutions from QCIF up to HD. On this data it shows excellent performance. The model does not require any comparison to a reference signal. Therefore, it can be applied to no-reference monitoring approaches. Index Terms Jerkiness, Video Quality, Digital Video Transmission 1. INTRODUCTION Digital video transmission systems consist of a chain of a video encoder, a network, a decoder, and a video player. Due to limited bandwidth of the transmission the encoder can reduce the frame rate of the video to save on bandwidth. On the other hand, due to a possible error-prone network video transmission may be delayed or information can get lost. As a result, during playback, the video player may repeat frames, drop frames, or display frames exceedingly long. All this introduces temporal impairments of the video playback. Technically, temporal variations in the video playback are called jitter, and the exceedingly long display of a frame is called freezing. For an end-user, these impairments may lead to perceptual degradations called jerkiness. In this paper we present a jerkiness measure. It uses information from a video sequence captured at the output of a video player. It is a so-called no-reference measure, i.e. no information about the video before encoding and transmission is used. Our proposed jerkiness measure shows a high correspondence with subjective ratings. A typical application for this no-reference jerkiness model is quality monitoring. Several other studies about the relation between temporal impairments and perceptual quality exist: The effect of frame rate reduction for low-bitrate transmission on perceived video quality was studied in [3] for videos of QCIF resolution. Furthermore, the impact of frame rate reduction, jitter, and freezing was studied in [2] for the larger VGA resolution. Both studies used the same test methodology as the data in this paper, and therefore, the model predictions can be compared to their data. The relation between short and long freezing on perceived quality was studied in [6], using a different subjective test methodology. Using this data, a model was proposed in [5]. A different model was proposed in [4] based on a compressed video sequence. As the mathematical forms as well as the evaluation methodologies are different, it is difficult to compare the two. Our model is different to the above two models in three aspects: First, it includes a dependence on frame resolution and viewing condition. Therefore, it can be applied for any frame size from QCIF up to HD. Second, the formulation is more general and can be used for any fixed or variable frame rate video sequence. Third, the mathematical form of the jerkiness measure is different. The paper is organised as follows: First, section 2 introduces notation and concepts. The central part is section 3, where the jerkiness model is introduced. Section 4 describes the model dependence on the viewing condition and the frame resolution. In section 5 the performance of the jerkiness model is evaluated. Finally, section 6 concludes with a summary of the main results and future research topics. 2. NOTATION AND CONCEPTS The input of the jerkiness model is a video sequence. The video sequence is captured at the output of the video player. Each new frame is shown by the player as an image and is captured together with a time stamp, the display start of the 978-1-4244-6960-4/10/$26.00 2010 IEEE 218 QoMEX 2010

frame. Depending on the situation, the captured video sequences may have a variable frame rate. More formally, a video sequence is a sequence of the form v =(f i,t i ) i=1,..n, (1) where f i denotes frame i of the video, which will be displayed starting at time t i up to time t i+1, and n is the total number of frames. If the total number of frames in the sequence is clear from the context, or is of minor importance, the shorthand notation (f i,t i ) will be used. The times t i will be referred to as time stamps, and motion intensity 60 50 40 30 20 10 m i +1 (v) t i =t i +1 t i Δt i = t i+1 t i (2) is called the display time of frame i. For each video sequence as input the output of the jerkiness model is a positive value, the estimated or predicted jerkiness. Therefore, the term jerkiness measure will also be used. The jerkiness measure will have two main variables, the display time of frames and the motion intensity. The next paragraph describes how motion is measured. Ideally, motion would be measured by calculating some statistic of the velocity distribution of the moving objects in the video sequence. As this might be very complicated, a very simple motion measure will be used here. For a video sequence v =(f i,t i ) the motion intensity is the root mean squared interframe difference over the whole frame m i+1 (v) = (f i+1 (x) f i (x)) 2, (3) x where f i (x) denotes the pixel value of the Y-component of a frame i at location x. Note that the frames are used in full resolution for small resolutions only, here QCIF. For large resolutions, the Y- component of the frame is first subsampled to QCIF resolution before motion calculation. 3. MEASURING JERKINESS This section is the central part of the paper, introducing the jerkiness model. Let (f i,t i ) denote a video sequence. Repeated frames will be removed from the video sequence beforehand. For ease of presentation, suppose that frames that are perceived as a repetition are exact repetitions, i.e. they are equal. If this is not the case, an approach for the detection of perceived repeated frames can be found in [1]. Recall that t i denotes the display time start of frame f i, and t i+1 is the display time end of frame f i, equal to the display time start of the following frame. For motivation, consider a sample video sequence containing an isolated freezing event as shown in figure 1. The plot shows the motion intensity. The interval of zero motion corresponds to the freezing of the video sequence. The 0 4200 4400 4600 4800 5000 5200 5400 5600 5800 time Fig. 1. The plot shows the motion intensity for a short video sequence with freezing. The dashed lines show the freezing duration and the motion intensity at the end of the freezing. main variables of the jerkiness measure are the display time Δt i = t i+1 t i (horizontal dashed line) of the freezing frame and the motion intensity m i+1 (vertical dashed line) at the end of the freezing interval. This corresponds to the perception of the freezing of the frame and the jump, i.e. sudden large motion, at the end of the freezing interval. The considerations for the choice of the jerkiness measure (5) are the following: First, if the motion intensity is larger, then the jerkiness is larger. Thus, jerkiness should be a monotone function, say μ of motion intensity. Second, if there is no motion, then jerkiness is zero. Thus, μ(0) = 0 and jerkiness is a product of μ and a second part. Third, dependence of jerkiness on motion intensity might saturate for large motion intensities. On the other hand, for very small motion intensity values the perceptual impact of jitter might be small. Thus, μ is chosen to be a parametrised S-shaped or sigmoid function, see appendix 8 for details. Fourth, jerkiness depends on the frame display time Δt i. The larger the display time the larger the jerkiness. Thus, jerkiness depends monotonously on display time. Fifth, for very small display times there is no jerkiness. Thus, jerkiness is a product of a display time dependent part and the motion dependent part μ, where the display time dependent part yields small values for small values of the display time. Last, for very large frame display times it is difficult to make useful subjective experiments. Thus, a simple approach is a close to linear relationship, e.g. if a 10sec video sequence freezes almost 10 seconds or if it freezes twice almost 5 seconds should give a similar jerkiness value. This can be modelled by Δt i τ(δt i ), (4) for an S-shaped function τ. This is close to linear in the saturated region of τ, is monotone, and in the region where τ is almost zero the term (4) is almost zero. 219

1.0 0.8 0.6 0.4 0.2 0.25 0.20 0.15 0.10 0.05 zoom Table 1. Viewing distance (in multiples of display height) and corresponding viewing angle (under which the display height is seen) as recommended in [8]. resolution viewing distance viewing angle (in display height) (of display height) QCIF 6-10 0.166-0.100 CIF 6-8 0.166-0.125 SD 4-6 0.249-0.166 HD 3 0.330 0.00 0.00 0.05 0.10 0.15 0.20 0.0 0.0 0.2 0.4 0.6 0.8 1.0 display time in s Fig. 2. The plot shows the S-shaped transform τ used for the dependency of the jerkiness measure on the frame display time, for a CIF video under standard viewing conditions. The above motivates the following definition of the jerkiness measure: Jerkiness is the sum (over all frame times) of the product of relative display time times a monotone function of display time times a monotone function of motion intensity at display time end, J(v) = 1 Δt i τ α (Δt i ) μ(m i+1 (v)). (5) T i Here, the video sequence is of the form (1), the display time is defined in (2), the motion intensity is defined in (3), and T is the total video duration. The functions τ α and μ are S- shaped functions, as described in appendix 8. They have 3 parameters, the position and slope of the inflection point. For μ these are kept fix. For τ the 3 parameters are reparametrised by a single parameter α depending on the viewing angle. Details and motivation for this reparametrisation are described in section 4. Figure 2 shows the transformation τ α under typical viewing conditions for CIF resolution. For frame display times below 50ms the value is almost zero, thus jerkiness values will be very small. The inflection point is shown as a black dot, close to 100ms, and is mapped to 0.05, which will produce moderate jerkiness values. This is consistent with findings of [6], where the authors found that at display times around 100ms there is a high probability for subjects to detect a small degradation. The functions τ α and μ take values in [0, 1), therefore the jerkiness measure will take values in [0, 1), where zero means no jerkiness. Under normal circumstances, dropping more frames will result in a higher motion intensity at the end of the frame drop, and therefore a higher jerkiness value. Thus, the lower the frame rate, the higher the jerkiness value. In addition, all function are continuous. Therefore, intuitively speaking, two similar video sequences will have similar jerkiness values. Note that one key idea is already present in [5], the dependence of jerkiness on the motion intensity at the end of the jerkiness interval. Contrary to our formulation as a product of display time dependent and motion dependent part, the authors used a additive combination. This has a different interpretation: Jerkiness is measured if either a long display time or high motion is present. In our formulation, jerkiness is measured if both a long display time and motion is present. 4. RESOLUTION DEPENDENCE In this section the jerkiness model is completed by including into the model the dependence on the video resolution, from QCIF up to HD. In this context, resolution means not only the video resolution itself but includes the corresponding viewing condition. Recall that the jerkiness model contains a product of two S-shaped functions. One, μ, accounts for the motion dependent part, the other, τ α, accounts for part of the display time dependent part. Recall that motion intensity is measured on subsampled frames for larger resolutions. Therefore, the motion dependent part already contains implicitly a frame resolution dependent component. In this section, the resolution dependence of the display time dependent part τ α will be explained. There may be different reasons for a resolution dependent perception of jerkiness: different viewing angles, as typically the viewing angle for HD is much larger than for QCIF; expectations might be different, viewers in a subjective experiment may react much stronger on temporal impairments in a HD experiment than by watching and rating QCIF video sequences, and there might be many more reasons. In this paper, only the dependence on the viewing angle will be taken into account. Table 1 shows the typical viewing angle and viewing distance used in the subjective experiments performed for the projects of the video quality expert group (VQEG), see [8]. It can be seen, that the viewing angle is 220

larger for larger resolutions. E.g. the viewing angle for HD is about 2-3 times as large as the one for QCIF. The dependence of jerkiness on the resolution will be included in the model using the following hypothesis: Suppose two video sequences of different resolution having the same content are viewed under typical viewing distances. Approximately, the same jerkiness value should result, if the temporal sampling rates are adjusted, such that the moving objects have the same angular displacement from one frame to the next frame. In the jerkiness model, equation (5), the resolution dependent part is the S-shaped function τ α, which is parametrised by the position (p x,p y ) and slope q of the inflection point, see equation (6) in the appendix. If the viewing angle is increased by a factor c, then the angular displacement is increased by a factor c, thus, under the hypothesis above, the temporal resolution has to be increased by decreasing the display time by 1/c. Thus τ α is compressed by a factor c along the time axis, i.e. the new parameters of the inflection point are the position (p x /c, p y ) and slope q c. On the other hand, for smaller viewing angles, the S-shaped function is stretched along the time axis. Here, the following factors are used: Given the parameters (p x,p y ) and slope q for QCIF, the factor c for CIF is 1.18,for SD it is 1.54 and for HD it is 2.54. These factors are derived from the viewing angles given in table 1. This model of resolution dependence implies the following for a fixed temporal impairment: The larger the resolution, the larger the viewing angle under typical viewing conditions, and therefore, the larger the angular displacement of a moving object before and after the impairment, hence, the higher the jerkiness value. 5. MODEL EVALUATION To understand the correspondence between modelled jerkiness and perceived quality subjective tests were performed. This section starts with describing the subjective datasets. Then, one dataset of CIF resolution was used for parameter estimation. Finally, the other datasets were used for model evaluation. 5.1. Subjective Data Four different subjective datasets are used. Two datasets, one QCIF and one CIF video resolution, containing reduced frame rate and freezed frame conditions as a result of transmission errors. Two datasets for the large SD and HD resolutions containing frame freezes only, as reduced frame rates are not common for these resolutions. In more detail, the datasets are composed as follows: 1. QCIF: jerkiness 0.8 0.6 0.4 0.2 0.0 cif qcif sd hd 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 mos Fig. 3. The scatter plot shows the MOS values of the subjective data against the models jerkiness predictions. Data is shown for formats QCIF, CIF, SD, HD. 2. CIF: (a) Contents: Four contents: speaker, fishes, actors, sport. (b) Conditions: reduced frame rates: 25fps, 12.5fps, 8fps, 5fps. Freezing, 1,2,or 3 freezing intervals at random positions of fixed duration. All uncompressed. (a) Contents: The same contents as in QCIF dataset, plus an average motion movie sequence. (b) Conditions: frame rates of 30fps or 25fps, 12.5fps, 5fps. Freezing, 1,2,or 3 freezing intervals at random positions of fixed duration. All uncompressed. 3. SD: 4. HD: (a) Contents: Five contents: interview, music video, comedy, action movie, sport. (b) Conditions: Freezing as a result of 0.5%, 1%, 2% of random (bursty) packet loss. Freezing as a result of 0.06% of uniform packet loss. All 25fps and encoded at 4MBit/s. (a) Contents: The five contents of SD: interview, music video, comedy, action movie, sport. (b) Conditions: Freezing as a result of 0.5%, 1%, 2% of random (bursty) packet loss. Freezing as a result of 0.06% of uniform packet loss. All 25fps and encoded at 16MBit/s. 221

Note that the datasets were created using a full matrix design, i.e. for each content and each condition one sample was generated. Each dataset contained a low motion content, speaker or interview, and at least one high motion content, sport. Additional samples containing other types of distortions were added to generate equilibrated datasets, which can be used for subjective testing. Note that SD and HD data was compressed at a high bitrate. This was done because the data was generated with additional intentions in mind, outside of the scope of this paper. Nevertheless, it was checked that there were no spatial artifacts influencing the present analysis. Subjective test were performed following the recommendation [7] and best practise of [8]. Of course, viewing distance was adjusted to match the values in table 1. For each subjective test 24 subjects were used. The subjects rated the videos using the scores 1(bad), 2(poor), 3(fair), 4(good), 5(excellent). The subjective ratings were averaged, to generate the mean opinion score (MOS). 5.2. Parameter Estimation Model parameters were estimated using the CIF dataset. Recall the jerkiness measure, equation (5). There are six parameters, three for each S-shaped function τ α and μ. The following parameters were determined: For μ the parameters (p x,p y,q) = (5, 0.5, 0.25) and for τ α the parameters (p x,p y,q)=(0.12, 0.05, 1.5) for QCIF resolution. The parameters for τ α for the other resolutions are given by (p x,p y,q)=(0.12/c, 0.05, 1.5 c), as explained in section 4, where the values of c are given. 5.3. Model Evaluation Keeping these parameters fixed, jerkiness was calculated and compared to the subjective ratings. Figure 3 shows the result. The scatter plot shows all data samples for the four resolutions. There is a non-linear relationship between the mean opinion score and the predicted jerkiness. To quantify the correspondence of the predicted jerkiness to subjective score, the predicted values were fitted to the MOS using a 3rd order polynomial. The residual RMSE is shown in table 2. By comparison, [5] reports similar error values for CIF, the errors for QCIF in [4] are considerably larger. Note that this comparison serves as an indication only, as video data is different as well as subjective testing methodologies. Furthermore, model predictions can be compared to the values reported in other studies: In [2], for VGA resolution at 12.5fps an average content has a score of 4.0, and for a 5fps sequence a score of 2.26. Our model predicts for a high motion scene at 12fps a score of 3.73 and for a 5fps sequence a score of 2.27. For a high motion QCIF sequence jerkiness prediction at 8.5fps is 3.68, and a value of 3.8 is reported in [3]. Table 2. Jerkiness values are fitted to MOS using a 3rd order polynomial. The residual root mean squared error for all data samples except CIF from figure 3 and for each resolution separately are shown in the table. Recall that the CIF data was used to fit model parameters. Resolution RMSE All but CIF 0.26 QCIF 0.27 SD 0.27 HD 0.22 (CIF 0.31) 6. CONCLUSION In this paper a model of perceived jerkiness was proposed. It is general enough to be applicable for small to large resolutions, variable frame rate video sequences containing any type of temporal impairment. The model showed excellent correspondence to the subjective data. As it has relatively few parameters, model over-fitting can almost be excluded. In addition, the model is consistent with findings of subjective experiments performed by others: Frame dropping experiments [6], and frame rate reduction experiments [3],[2]. Unfortunately, a detailed comparison to other models is difficult. First of all, the models in [5] and [4] can be applied for one resolution only, CIF and QCIF, respectively. Second, in [4] the model input are compressed video sequences only. Furthermore, the performance can not be compared to [5], as their data is not available and the values of the model parameters are not given in the paper. Still open is a systematic analysis of the relation of jerkiness to motion intensity. The data of the subjective experiments used in this paper were not designed to analyse this relation in more detail. Note that in the present approach the jerkiness measure depends on the possibly large jump of the motion intensity at the end of a freezing interval. It does not depend on the motion intensity at the beginning of a freezing interval., i.e. the sudden stopping of motion of the video playback. If an additional dependence on the sudden stopping of motion improves performance is still an open question for research. 7. ACKNOWLEDGEMENT Many thanks go to Alexander Raake and Marie Neige Garcia from Deutsche Telekom Laboratories involved in the design, execution, and evaluation of the subjective tests for the SD and HD data. 222

S(x) 1.0 0.8 0.6 0.4 0.2 (0.12,0.1,1.5) (0.5,0.2,10) 0.0 0.0 0.2 0.4 0.6 0.8 1.0 x Fig. 4. The plot shows two sample S-shaped functions. The dot shows the inflection point. The parameters (p x,p y,q) are given in parenthesis. 8. APPENDIX This appendix describes the details of the S-shaped functions used for the jerkiness model. The function, denoted here by S, is parametrised by the location (p x,p y ) and slope q of the inflection point, in detail [4] Kai-Chieh Yang, Khaled El-Maleh, Clark C. Guest, and Pankaj K. Das, Perceptual temporal quality metric for compressed video, in Proc. of the Second International Workshop on Video Processing and Quality Metrics for Consumer Electronics (VPQM), 2007. [5] Ricardo R. Pastrana-Vidal and Jean-Charles Gicquel, Automatic quality assessment of video fluidity impairments using a no-reference metric, in Proc. of the Second International Workshop on Video Processing and Quality Metrics for Consumer Electronics (VPQM), 2006. [6] R.R. Pastrana-Vidal, J.C. Gicquel, C. Colomes, and H. Cherifi, Sporadic frame dropping impact on quality perception, in Proceedings of SPIE, 2004, vol. 5292, p. 182. [7] ITU-T, Subjective video quality assessment methods for multimedia applications, ITU-T Recommendation P.910, 1999. [8] VQEG, HDTV and Multimedia project, http://www.its.bldrdoc.gov/vqeg/projects, 2009. { a x b if x<= p x S(x) = d 1+exp( c (x p +1 d else, (6) x)) where a = p y /p (qpx/py) x, b = qp x /p y, c =4q/d,and d = 2(1 p y ). The S-shaped function starts at the origin, grows polynomially up to the inflection point and saturates exponentially towards 1. See figure 4 for two sample plots using different parameters. 9. REFERENCES [1] Stephen Wolf, A no reference (nr) and reduced reference (rr) metric for detecting dropped video frames, in Proc. of the Second International Workshop on Video Processing and Quality Metrics for Consumer Electronics (VPQM), 2009. [2] Q. Huynh-Thu and M. Ghanbari, Impact of jitter and jerkiness on perceived video quality, in Proc. of the Second International Workshop on Video Processing and Quality Metrics for Consumer Electronics (VPQM), 2006. [3] Q. Huynh-Thu and M. Ghanbari, Perceived quality of the variation of the video temporal resolution for low bit rate coding, in Picture Coding Symposium, 2007. 223