Resource allocation for H.264 video encoder: Sequence-level controller

Size: px

Start display at page:

Download "Resource allocation for H.264 video encoder: Sequence-level controller"

Flora Griffith
5 years ago
Views:

1 Resource allocation for H.264 video oder: Seque-level controller Alexander Bronstein September 7, 2009 Abstract Tradeoff between different resources (bitrate, computational time, etc.) and compression quality, often referred to as rate-distortion optimization (RDO), is a key problem in video coding. With a little exaggeration, it can be claimed that what distinguishes between a good and a bad co is how optimally it finds such a tradeoff. This is the first part of our paper presenting an algorithm for resource allocation in a scalable H.264 video co. Here, we discuss a seque-level resource allocation and control scheme. The scheme allows for optimal frame rate and frame dropping ision, optimal refere frame selection, adaptation to a wide range of bitrates and computational load balancing between different frames. 1 Introduction The goal of video compression, in simple words, is to deliver a video of as high quality as possible using as few bits and utilizing as few computations as possible. Since these desiderata are conflicting, tradeoff between different resources and compression quality is a key problem in video coding. Usually referred to as rate-distortion optimization (RDO), the problem is actually broader and in addition to bitrate can include additional resources such as computational time or power dissipation [2, 1]. With a little exaggeration, it can be said that what distinguished a good video co from a bad one is the mechanism performing such a tradeoff. The target application we consider as a motivation for this paper is video distribution in a modern networked ecosystem of consumer electronic devices. A typical high-definition (HD) video arrives to our home from a cable or satellite broadcast, usually compressed using the MPEG-2 standard, at a bitrate up Department of Computer Scie, Technion Israel Institute of Technology, Haifa, 32000, Israel. 1

2 to 20 Mbps. After transcoding, this video source can then be distributed to multiple destinations: over the local wireless network to a local networked or portable device, or over the broadband to a remote Internetconnected location. The target bitrates in these two cases can vary dramatically: the bandwidth of a wireless network can typically range from 10 to 2 Mbps (whether using n or g standards), while the typical bandwidth of a broadband connection is Kbps. To make the problem even more complicated, the available resource may change dynamically: the bandwidth may drop depending on the network utilization by other users and applications, and the available computational resources may vary depending, for example, of the number of channels simultaneously oded. The described video distribution scenario imposes two important requirement on the oder. First, there are multiple operating points in the resource space (corresponding to different bandwidth and computational complexity). The oder must be able to work at any of these points, or in other words, be scalable. Second, since the resources may change in time, the oder must be able to transition from one operating point to another as quickly as possible. Finally, given the available resources, the quality of the video should be as high as possible. All of the above make the resource allocation and control a challenging and non-trivial problem. Here, we discuss a seque-level resource allocation and control scheme. As a refere, we use the H.264 standard, though the concepts are generic and can be applied to other MPEG-type oding standards. This paper is organized as follows. Section 2 presents a general background on video compression and the problem of resource allocation and control. In Section 3, we present a system-level view and the main features of the proposed resource allocation and control scheme, focusing on the seque-level controller. Section 5 describes the buffer models. Section 4 describes the models of PSNR, bit production, oding complexity, as well as their training and adaptation mechanisms. In Section 6, we describe the flow of the seque-level control algorithm. Section 7 deals with seque resource allocation. Sections 9 and 10 are dedicated to frame oding and oding order optimization. 2 Background 2.1 General principles of video coding Video compression takes advantage of these redundancies, trying to reduce the size of the video data while maintaining the content almost without visible loss. Such compression is referred to as lossy (as opposed to lossless), since the original data cannot be recovered exactly. Most modern cos operate by constructing 2

3 a prediction of the video frame by using the redundant information. The prediction parameters are usually much more compactly represented than the actual pixels. The predicted frame is usually similar to the original one, yet, differes may appear at some pixels. For this purpose, the residual (prediction error) is transmitted together with the prediction parameters. The error is also usually compactly represented using spatial transform, like the discrete cosine transform (DCT) with subsequent quantization of the transform coefficients. The quantized coefficients and the prediction parameters are entropy-oded to reduce furthermore their redundancy. Inter-prediction accounts for the major percentage of compression of the video data size in video cos. It takes advantage of temporal redundancy, predicting a frame from nearby frames (referred to as refere frames) by moving parts of these frames, thus trying to compensate for the motion of the objects in the scene. The description of the motion (motion vectors) are the parameters of the prediction which are transmitted. This type of prediction is used in MPEG-like cos. The process of producing intra-prediction is called motion estimation. Typically, it is performed in a block-wise manner, when many modern co support motion estimation with blocks of adaptive size. At the other end, intra-prediction takes advantage of spatial redundancy and predicts portions of a frame (blocks) from neighbor blocks within the same frame. Such prediction is usually aware of spatial structures that may occur in a frame, namely smooth regions and edges. In the latter case, the prediction is directional. Such type of prediction is used in the recent H.264 Advances Video Co (AVC) standard [?] and does not appear in previous versions of the MPEG standard. The actual reduction in the amount of transmitted information is performed in the transmission of the residual. The residual frame is divided into blocks, each of which undergoes a DCT-like transform. The transform coefficients undergo quantization, usually performed by scaling and rounding. Quantization allows represent the coefficients by using less precision, thus reducing the amount of information required. Quantized transform coefficients are transmitted using entropy coding. This type of coding a lossless and utilizes further redundancy that may exist in the transform coefficients. The oder usually operates with video represented in the YCbCr color space, where the achromatic Y channel is called the luma and the chromatic channels (Cb and Cr) are called chroma. The chroma is typically downsampled to half frame size in each direction, which is referred to as the 4 : 2 : 0 format. The performance of a lossy video co is measured as the tradeoff between the amount of bits required to describe the data and the distortion introduced by the compression, referred to as the rate-distortion (RD) curve. The RD curve of a better co is closer to the idealistic and theoretically impossible case of zero distortion and zero rate. As the distortion criterion, the mean squared error (MSE) of the luma channel is 3

4 usually used. The MSE is often converted into logarithmic units and represented as the peak signal-to-noise ratio (PSNR), ( ) y 2 p = 10 log max 10 d (here d is the MSE and y max is the maximum allowed value of the luma pixels, typically 255 if the luma data has an 8-bit precision and is represented in the range 0,..., 255). (1) 2.2 H.264 video co The H.264 AVC is one of the most recent standard in video compression, offering significantly better compression rate and quality compared to the previous MPEG-2 and MPEG-4 standards and targeted to high definition (HD) content. For example, H.264 delivers the same quality as MPEG-2 at a third to half the data rate. The oding process goes as follows: a frame undergoing oding is divided into non-overlapping macroblocks, each containing luma pixels and 8 8 chroma pixels (in the most widely used 4 : 2 : 0 format). Within each frame, macroblocks are arranged into slices, where a slice is a continuous raster scan of macroblocks. Each slice can be oded independently of the other. Main slice types are P and I. An I slice may contain only I macroblocks; a P slice may contain P or I macroblocks. The macroblock type determines the way it is predicted. P refers to inter-predicted macroblocks; such macroblocks are subdivided into smaller blocks. I refers to intra-predicted macroblocks; such macroblocks are divided into 4 4 blocks (the luma component is divided into 16 blocks; each chroma component is divided into 4 blocks). In I mode (intra-prediction), the prediction macroblock is formed from pixels in the neighbor blocks in the current slice that have been previously oded, oded and reconstructed, prior to applying the inloop deblocking filter. The reconstructed macroblock is formed by imitating the oder operation in the oder loop. In P mode (inter-prediction), the prediction macroblock is formed by motion compensation from refere frames. The prediction macroblock is subtracted from the current macroblock. The error undergoes transform, quantization and entropy coding. According to the length of the entropy code, the best prediction mode is selected (i.e., the choice between an I or a P macroblock, the motion vectors in case of a P macroblock and the prediction mode in case of an I macroblock). The oded residual for the macroblock in the selected best mode is sent to the bitstream. A special type of frame referred to as instantaneous data refresh (IDR) is used as a synchronization 4

5 mechanism, in which the refere buffers are reset as if the oder started freshly from the beginning of the seque. IDR frame is always an I-frame. The use of IDR allows, for example, to start oding a bitstream not necessarily from its beginning. A seque of P and I frames between two IDRs is called a group of pictures (GOP). A GOP always starts with an IDR frame. The maximum GOP size is limited by the standard. The operating point on the RD curve is controlled by the quantization parameter, determining the aggressiveness of the residual quantization and the resulting distortion. In the H.264 standard, the quantization parameter is an integer in the range 0,..., 51, denoted here by q. The quantization step doubles for every increment of 6 in q. Sometimes, it is more convenient to use the quantization step rather than the quantization parameter, computed according to q = q (2) In the following, we will use q and q interchangeably. 2.3 Resource allocation and control Since only the oder is standardized in the MPEG standards, many aspects of the way in which the oder produces a standard-compliant stream is left to the discretion of a specific oder design. The main differe between existing oders is the ision process carried out by the bitrate controller that produces the oder control parameters. Usually, the oder parameters are selected in a way to achieve the best tradeoff between video quality and bitrate of the produced stream. Controllers of this type are referred to as RDO. Parameters controlled by the bitrate controller typically include: frame type and refere frame or frames if the frame is a P-frame on the seque level, and macroblock type and quantization parameter for each macroblock on the frame level. For this reason, it is natural and common to distinguish between two levels of bitrate control: seque- and frame-level. The seque-level controller is usually responsible for the frame type selection and allocation of the bit budget for the frame, and the frame-level controller is responsible for selection of the quantization parameter for each macroblock within the frame. In a broader perspective, RD optimization is a particular case of the optimal resource allocation problem, in which the available resources may be the bitrate, computational time, power dissipation, etc. are distributed in a way that maximizes some quality criterion. He, hereinafter we use the term resource allocation referring to RD optimization-type problems discussed in this paper. The resources we consider 5

6 specifically are computational time and bitrate. Theoretically, optimal resource allocation requires running the oding with different sets of parameters and selecting the best outcome. Such an approach is impossible due to a very large number of possible combinations of parameters, which result in a prohibitive computational complexity. Suboptimal resource allocation approaches usually try to model some typical behavior of the oder as function of the parameters; if the model has an analytical expression which can be efficiently computed, the optimization problem becomes practically solvable. However, since the model is only an approximate behavior of the oder, the parameters selected using it may be suboptimal. 3 System-level view The presented resource allocation and control mechanism are generic and in principle can be used with any MPEG-type oder. To be more specific, in the following we assume that the underlying oding pipeline implements the H.264 AVC oder, with independent slice oders working simultaneously. Such an implementation is typical of real-time hardware oders. A system-level block diagram of the proposed controllers is shown in Figure 1. The system consists of the seque-level controller, which is responsible for frame type and refere selection and allocation of the bit budget and initial quantization parameter for the currently oded frame. The frame-level seque controller allocates the resources on sub-frame level, in order to utilize the total frame bit budget and achieve the highest perceptual visual quality. For this purpose, a perceptual quality model and a PSNR shaping mechanism are employed. Both controllers are based on the oder model, which predicts the behavior of the oder given a set of oding parameters. The model, in turn, utilizes visual clues, a set of descriptors of the video data. The main focus of this paper are the oder models and the seque-level controller. Among the most important features of the proposed seque-level controller are: Content-aware oder model: a accurate oder model based on visual clues, allowing to optimize the resource allocation prior to oding; Optimal frame type ision, including IDR insertion at scene cuts; optimization of frame dropping capable of detecting and removing redundant frames e.g. due to telecine conversion; and optimal refere frame selection based on frame predictability, allowing to achieve compression ratios close to those of cos using multiple refere frames at significantly lower computational complexity; 6

7 Perceptual quality model Frame buffer Visual clues computation Encoder model Seque-level controller Frame-level controller Slice 1 oder... Slice N S oder Figure 1: System-level block diagram. Optimal oding order ision, allowing out-of-order oding to achieve better predictability. Optimal frame bit allocation Optimal oding time allocation, allowing to achieve better oding quality by distributing oding time between frames according to their coding complexity. 4 Encoder model The purpose of oder model is to estimate the amount of resources used by the oder without resorting to the expensive process of oding itself. Specifically, we predict the amount of bits produced by the oder, the distortion as the result of oding and the time complexity of the oding process. First, the model gives explicit expressions to the predicted values (amount of bits, distortion, time) as function of oder parameters (q or q, frame type, etc). Second, the model is adaptive, and can be adjusted to current data by means of training. Third, the model is content-aware, i.e. depends on the data on which the oder operates. The content depende is captured by means of visual clues, a measure of content complexity. Specifically, we use the texture and the motion complexity as the visual clues. Finally, since 7

8 the same model is used by seque- and the frame-level controllers, the predictors are linear and can be applied on data units of different granularity (frame, region of macroblocks or macroblock). Specifically for the seque-level controllers, we are interested in frame-wise predictors, which are obtained by summing up or averaging the macroblock-wise predictors. In general, if x is the quantity to be predicted (amount of bits, distortion, time) for the current frame, the predicted value according to the oder model is computed by ˆx = K N MB i=1 ˆx i (θ i ; z i ) (3) where ˆx i is a macroblock-wise predictor, θ i are the oding parameters and z i are the visual clues in macroblock i, N MB is the number of macroblocks in the frame and K is some normalization factor. Specifically, we use the following models: Bit production model: in each macroblock, the amount of bits produced is proportional to the quantization step qi 1 and to the macroblock complexity z i. Depending on the frame type, either texture complexity (in I-frames) of motion complexity (in P-frames) are used. The macroblock-wise predictor is given by ˆbi (q i, z i ) = α 1 + α 2 z i q 1 i, (4) where α 1, α 2 0 are the model parameters obtained by training. Assuming that q is constant for the entire frame, the frame-wise predictor of the amount of bits can be therefore expressed as N MB ˆb = i=1 N MB ˆbi (q, z i ) = N MB α 1 + = α b + β b q 1, i=1 α 2 z i q 1 (5) where the coefficients α b = N MB α 1 and β b = N MB i=1 α 2z i depend on the frame content and frame type. Distortion model: we use luma PSNR as an objective distortion criterion. In each macroblock, the PSNR is inversely proportional to the quantization parameter q i and to the macroblock complexity z i (texture complexity is used for both P- and I-frames). ˆbi (q i, z i ) = β 1 β 2 z i β 3 q i, (6) 8

9 where β 1, β 2, β 3 0 are the model parameters obtained by training. Assuming that q is constant for the entire frame, the predictor of the frame-wise average PSNR can be therefore expressed as ˆp = 1 N MB N MB i=1 = α p + β p q, ˆp i (q, z i ) = β 1 1 N MB N MB i=1 β 2 z i 1 N MB N MB i=1 β 3 q (7) where the coefficients α p = β 1 1 N MB NMB i=1 β 2z i and β p = 1 N MB NMB i=1 β 3 depend on the frame content and frame type. Distortion model for dropped frames: is some cases, the seque-level controller may ide to drop a frame. In such a case, it is assumed that the oder delays the previous frame for a longer duration, equal to the duration of the previous and current (dropped frame). The distortion due to the drop is computed as the average PSNR between a downsampled (16 times in each axis) version the dropped frame and the previous frame. In our implementation, the average PSNR is computed by first calculating the average MSE and then converting it into logarithmic scale (rather than averaging the PSNR values directly). For example, dropping a frame at the beginning of a new shot would result in a low PSNR, while dropping a frame in a seque of frames with slow motion will result in high PSNR. It may appear that though the average PSNR is high, in some small portions of the image the distortion due to frame drop is very large (such a situation is typical for a frame with abrupt motion of a foreground object, while the background is static). In order to take into consideration such situations, together with the average PSNR, we measure the 0.1% lowest quantile of the PSNR values, denoted by ˆp min. Time complexity model: the underlying assumption of our time complexity model is that the oding pipeline used to ode a macroblock consists of multiple stages, some of which are bypassed according to the oded macroblock type, using some ision tree. Consequently, it can be simplistically assumed that oding a macroblock of certain type takes a fixed amount of time. Let ct denote the maximum amount of time it takes to ode a macroblock (nominal oding time), which depends on the frame type and on the complexity scale c. We denote by τ i the oding time assuming, i.e., the fraction of ct is takes to oder macroblock i, such that τ i = ct τ i. According to our model, the 9

10 normalized oding time for macroblock i is given by ˆτ i(q i, z i ) = max{min{γ 1 + γ 2 z i γ 3 q i, 1}, 0}T, (8) where γ 1, γ 2, γ 3 0 are the model parameters obtained by training, and z i are the texture complexity (for I-frames) or motion complexity (for P-frames). The minimum and maximum is taken in order to ensure that the predicted normalized time value is within a valid range of [0, T ] (for a reasonable range of z i and q i, this assumption will hold). Assuming that q is constant for the entire frame, the predictor of the oding time for the entire frame is approximated (neglecting the possible nonlinearity due to the minimum and maximum functions) as ˆτ N MB i=1 = α t c + β t cq, N MB ˆτ i(q, z i )c = N MB γ 1 ct i=1 γ 2 z i ct N MB β 3 ct q (9) where the coefficients α t = N MB γ 1 T N MB i=1 γ 2z i T and β t = N MB β 3 T depend on the frame content and frame type. 5 Buffer models The purpose of the buffer models is to provide a simple means of modeling the oder input and output as well as the hypothetic oder input behavior, imposing constrains in bit and time allocation. The oder model comprises a raw frame buffer constituting the input of the system to which raw frames are written, and the oder bit buffer, constituting the output of the system, to which the oded bitstream is written. The hypothetic oder model comprises a oder bit buffer, connected to the oder bit buffer by a channel responsible for the transport of the coded stream. 5.1 Raw frame buffer model The fullness of the raw frame buffer at time t is denoted by l raw (t). We assume that initially l raw (0) = 0, and raw frames are being filled at constant time intervals of 1 F, where F denotes the input seque frame rate. Though l raw may assume integer values only, in most parts of our treatment we will relax this restriction treating l raw as a continuous quantity. 10

11 The oder starts reading the first frame from the raw buffer as soon as the raw buffer level reaches l raw l raw min, where lraw min 1 denotes the minimum amount of lookahead required by the oder in our implementation. The first frame in the seque starts being oded at the time t 0 = lraw min F. (10) The oding of the first frame takes τ 1 seconds, after which the frame is removed from the raw buffer, reasing its level by 1. The moment this happens is the oding end time, denoted by t 1 = t 0 + τ 1. (11) If l raw (t n + ɛ) < l raw min, for infinitesimal ɛ > 0, oder raw buffer underflow occurs; in this case the oder will stall until the minimum buffer level is reached. We denote this idle time by 1 ) τ1 idle = lraw min lraw (t F. (12) In this notation, the oding of the next frame will start at time t 1 + τ idle 1 = t 0 + τ 1 + τ idle 1. The raw frame buffer is assumed to be capable of holding up to l raw max raw frames; when the buffer capacity reaches this limit an oder raw buffer overflow occurs, and the input process stalls. Whenever the duration of such a stall is not negligible, input data may be lost. The buffer level has to be checked for overflow immediately before the oder finishes oding a frame, and for underflow immediately before the oder starts oding a frame. Combining the contributions of frame production by the input subsystem and consumption by the oder, the raw frame buffer levels are given by as depicted in Figure 2 (first row). l raw (t n ɛ) = t n F n + 1, (13) 5.2 Encoder bit buffer model The fullness of the oder bit buffer at time t is denoted by l (t). We assume that initially l (0) = 0 and at the time t n when the oder completes oding frame n, b n bits corresponding to the access unit of the coded frame are added to the buffer. After the first time the buffer level exceeds l init, the buffer 11

12 starts being drained at a rate r(t) r max (t), determined by the transport subsystem. If at a given time the buffer is empty, the instantaneous draining rate drops to r(t) = 0. This situation is referred to as oder bit buffer underflow. Except for inefficiy of utilizing the full channel capacity, oder buffer underflow poses no danger to the system. The complement situation of oder bit buffer overflow occurs when l (t n ) + b n l tot. As a conseque, the oder will stall until the bit buffer contains enough space to accommodate the oded bits. If the stall lasts for a non-negligible amount of time, unpredictable results (including input raw frame buffer overflow) may occur. The buffer level has to be checked for underflow immediately before the oder finishes oding a frame, and for overflow immediately after the oding is finished. Combining the contributions of bit production by the oder and consumption by the transport subsystem, the oder bit buffer levels are given by t l (t n ɛ) = l (t n n 1 + ɛ) r(t)dt (14) t n 1 l (t n + ɛ) = l (t n ɛ) + b n, (15) as depicted in Figure 2 (second row). For convenie, let us denote by r n = t n 1 t n 1 t n t n 1 r(t)dt (16) the average rate while the frame n being oded. Assuming that r(t) = r max (t) if l (t) > 0 and r(t) = 0 otherwise, we obtain r n = In these terms, we may write t n { 1 t min n 1 l (t n 1 + ɛ), t n t n 1 r max (t)dt }. (17) l (t n ɛ) = l (t n 1 + ɛ) (t n t n 1)r n. (18) 5.3 Decoder bit buffer model Bits drained from the oder bit buffer at time t at rate r(t) are transported by the channel and are appended to the oder bit buffer at time t + δ chan at the same rate. The channel delay δ chan need not to be constant; 12

13 in this case, maximum channel delay may be assumed. The fullness of the oder bit buffer at time t is denoted by l (t). Initially, l (0) = 0; the buffer level remains zero until t = t 1 + δ chan, where the first bits of the first frame start arriving. The oder remains idle until l l init ; once the bit buffer is sufficiently full, the oder removes b 1 bits from the buffer and starts oding the first frame of the seque. The time lapsing between the arrival of the first bit until the first access unit is removed from the bit buffer is given by the smallest δ satisfying We denote by t 1 +δ t 1 r(t )dt l init. (19) t 1 = t 1 + δ chan + δ (20) the time when the first frame starts being oded. The delay passing between production and consumption of the access unit corresponding to a frame is denoted by δ = δ chan + δ. (21) The oder removes access units corresponding to the oded frames at a constant rate F, resulting in the following oding schedule t n = t 1 + n 1 F. (22) The oder bit buffer level assumes highest and lowest values immediately before frame oding start, and immediately after the oding is started, respectively. Combining the contributions of bit production by the transport subsystem and consumption by the oder, the oder bit buffer levels are given by as depicted in Figure 2 (third row). t l (t n ɛ) = l (t n +δ n 1 + ɛ) + r(t)dt (23) t n 1 +δ l (t n + ɛ) = l (t n ɛ) b n, (24) 13

14 6 Controller architecture The seque-level controller flow diagram is depicted in Figure 3. The oder looks ahead for n frames and ides the control seque for their oding based on the current state of the buffers and the channel. 1. Encoding order optimization (optional, Section 10) is invoked to find the best order π of oding the set of frames currently residing in the raw frame buffer. Decoded picture buffer size constrains are imposed to rule out some of the orderings. For the remaining orderings, the order optimization algorithm invokes the frame type ision algorithm. 2. Frame type ision (Section 9) is invoked with a frame reordering π to find the best seque a of oding parameters, which includes the frame types, quantizer values, and complexity scales. The buffer model is invoked to obtain the current status of the oder and estimated oder buffers. A branch-and-bound algorithm is employed to efficiently search through different frame type combinations. For every feasible frame type assignment f, the seque resource allocation algorithm is invoked. 3. Seque resource allocation (Section 7) receives the current buffer status from the buffer model, and invokes the oder model with the set of frame types f to obtain the content-dependent framelevel bit, PSNR, and oding time model parameters α b, β b, α p, β p, α t, β t. The seque resource allocation algorithm establishes the budgets b T and τ T of bits and oding time, respectively, for the seque of n frames, and invokes the frame resource allocation algorithm. 4. Frame resource allocation (Section 8) receives the model parameters frame-level bit, PSNR, and oding time model parameters α b, β b, α p, β p, α t, β t and the budgets b T and τ T, and find the optimal seques q and c of quantizer and complexity scales, respectively. Together with the seque f of frame types, this provides the entire set a of oding parameters (up to an optional frame reordering π), required to ode the seque of n frames. 5. Frame oding: once the best seque a of oding parameters is found and returned by the highest-level oding order optimization algorithm, the first set a 1 of control parameters is used to ode the frame π Update: once the oder has finished oding one frame, the actual oding time and amount of produced bits, as well as the actual rate and future rate estimates are observed and used to update the buffer models. The oded frame is discarded from the raw frame buffer, and the process starts over. 14

15 7 Seque resource allocation Assuming that the oder is oding an i-th frame in the current GOP, and is capable of observing n 1 additional future frames i+1,..., i+n 1 queued in the raw input buffer, the problem of seque resource allocation consists of establishing a resource budget for the latter seque of n frames. Here, we consider allocation of bit and oding time budgets, denoted by b T and τ T, respectively. We denote the i-th frame s display time stamp relative to the beginning of the current GOP by t i. Assuming an estimate ˆδ of the oder delay is available, we denote by t max = max{t gop max, ˆδ} (25) the maximum GOP duration (if the duration is smaller than the estimated delay, t max = ˆδ is used). The minimum GOP duration is set to the minimum temporal distance between two consecutive IDR frames, The maximum amount of time remaining till the GOP end is denoted by t min = t idr min. (26) t rem = t max t i, (27) from where the number of remaining frames n rem = t rem F (28) is obtained. 7.1 Effective bit rate estimation We denote by r the current rate, at which the oder bit buffer is being drained, and by ˆr the estimated rate at the oder at the time when the i-th frame will be oded. ˆr r(t + δ). (29) Assuming the simplistic model of the rate remaining constant r in the interval [t, t + δ], and then remaining constant ˆr, the average bit rate of the oder buffer drain in the interval [t, t + t max ] is 15

16 given by r = ˆδr + (t max ˆδ)ˆr. (30) t max Under the assumption of nearly constant channel rate, the oder and oder bit buffers remain synchronized (up to an initial transient), connected by the relation l (t) + l (t + δ) = l tot. (31) However, bit rate fluctuations result in a loss of such a synchronization. For example, in the case of a future rease in the channel capacity (ˆr < r), oding the next frames at the rate r will cause the oder buffer level to drop and, eventually, to underflow, while keeping the oder buffer level balanced. In the converse case of a future increase in the channel capacity (ˆr > r), keeping the video stream rate at r will result in a oder bit buffer overflow. A solution proposed here consists of modifying the rate at which the stream is oded such that both buffers remain as balanced as possible. We refer to such a modified rate as to the effective bit rate, r eff = max{min{0.5(r + ˆr ) + r, max{r, ˆr }}, min{r, ˆr }}, (32) where r = 1 2t max { ˆl l + lmin min : ˆr r lmax lmax l + ˆl : ˆr < r (33) Using r eff instead of r in our first example will cause both oder and oder buffer levels to drop by a less extent. 7.2 Bit budget allocation Given an estimated effective oding bit rate r eff, the amount of bit budget remaining till the end of the GOP is given by b rem = r eff t rem. (34) 16

17 According to the bit production model (6), the estimated amount of bits generated by the oder for an i-th frame given a quantizer q i is given by ˆbi = α b i + βb i q i, (35) where the coefficients α b = (α b 1,..., αb n) T and β b = (β b 1,..., βb n) T depend on the frame content and frame type f i. The latter is assumed to be assigned by a higher-level frame type ision algorithm detailed in the sequel. We write α b = β b = n i=1 α b i n βi b (36) i=1 to denote the sum of the model coefficients of the observed n frames. 1 Substituting α b and β b into the bit production model yields the total estimate of bits produced by oding the observed frames i,..., i + n 1 with a constant quantizer. Similarly, we denote by αrem b = (n rem n)α b βrem b = (n rem n)β b (37) the model coefficients for estimating the amount of bits of the remaining frames in the current GOP; α b and β b are aying moving averages of αi b and βb i, respectively, of the previously oded frames. The goal of seque bit budget allocation is to maintain a fair distribution of the oding bits throughout the seque, considering the long-term effect of the ided allocation, as well as reacting to changes in channel bit rate and frame texture and motion complexity. Under the simplifying assumption that at the seque level the visual quality is a function of the quantizer only, not depending on the frame content, the problem of seque bit budget allocation can be translated to finding a single quantization scale q, which assigned to the remainder of the GOP frames produces b rem bits. Substituting the appropriate bit production 1 If i is an IDR frame, the contribution of α b i and β b i to the above sums can be increased by a multiplicative factor of (1 + γ b ), where γ b [0, 1]. This ensures that more bits are granted to IDR frames, thus resulting is a slight quality boosting at the beginning of each GOP. 17

18 models, ( ) 1 α b + αrem b + β b + βrem b q = b rem (38) and solving for 1/q yields 1 q = b rem α b + αrem b β b + βrem b. (39) Substituting the latter result into the bit production model yields the following bit budget for the seque of the observed n frames b T = s { α b + β b brem α b +α b rem β b +β b rem b rem n rem : else, : β b + β b rem > 0 (40) where s = l tot + ( ɛ 1h ɛ 1 1 ) ˆl ltot + ( 1 h ɛ 1 1 ) ˆl, (41) is a scaling factor reasing the bit budget for higher, and increasing for lower oder bit buffer levels, respectively; ˆl = ˆl (ti ɛ) is the estimated oder bit buffer fullness immediately prior to oding the i-th frame, ɛ 1 = 3, and h is the target relative oder bit buffer level, defined as { { h = max min t i t min t max t idr min ˆr } } r eff + t max ltot, 0.6, 0.4. (42) To guarantee some level of bit allocation fairness, the budget is constrained to be within 25% to 200% of the average bit budget, resulting in b T = n r eff F, (43) b T = min{max{b T, 0.25b T }, 2b T }. (44) 18

19 Encoder and oder bit buffer constrains are further imposed, yielding the following final expression for the seque bit budget { { b T = min min b T, l max l + n r F } }, ˆl n ˆr + lmin, (45) F where l is the oder bit buffer level immediately prior to oding the i-th frame. 7.3 Encoding time budget allocation Similar to the bit budget allocation, the time budget allocation problem consists of assigning the target oding time τ T for the observed frames n. However, due to the stricter constrains imposed by the oder raw frame buffer, oding time allocation operates with shorter terms. Since frames are added to the oder input buffer at the rate of F frames per second, the average oding time of a frame is 1/F, resulting in the average budget τ T = n F (46) for the seque of n observed frames. Using this budget and assuming the oder has oded n frames so far (including the dropped ones), the time at the oder should be n /F. However, since the actual oding time may differ from the allocated budget, the time at the oder immediately prior to oding the i-th frame (relative to the time t 0 when the first frame in the seque starts being oded) usually differs from the ideal value. We denote by t dif = n F t t 0 (47) the time differe between the ideal and actual oding time; if t dif > 0 the oder is faster than the input raw frame rate and the oding time budget has to be increased in order to avoid raw buffer underflow. Similarly, if t dif < 0, the oder is lagging behind the input and the time budget has to be reased in order to prevent an overflow. Demanding the oder to close the time gap in n resp frames (n resp /F seconds), yields the following oding time budget τ T = τ T + n t dif n resp. (48) 19

20 Typically, n resp 5, depending on l raw max. To guarantee some level of fairness, the budget is constrained by τ T = max{min{τ T, 1.5τ T }, 0.75τ T }. (49) Encoder bit buffer constrains are further imposed yielding the final oding time budget where τ T = max{min{τ T, τ max }, τ min }, (50) τ min = l + n ˆb i=1 i lmax { r l + n 1 i=1 τ max = max ˆb } i lmin, τ min, (51) r and ˆb i are the expected bits produced by oding the i-th frame. Due to the depende on ˆb i, the time allocation must be preceded by bit allocation. 8 Frame resource allocation The problem of frame resource allocation consists of distributing a given budget of resources between n frames in a seque. More formally, we say that the vector x = (x 1,..., x n ) T is an allocation of a resource x, where each x i quantifies the amount of that resource allocated to the frame i. For example, x can represent the amount of coded bits or oding time. Ideally, we would like to maximize the x i for each frame; however, the allocation has to satisfy some constrains, one of which is n x i = x T, (52) i=1 where x T is the resource budget assigned to the seque by a higher-level seque resource allocation algorithm. Other constrains stemming, for example, from the oder buffer conditions apply as well. Formally, we say that an allocation vector x is feasible if it satisfies that set of conditions, and infeasible otherwise. Resource allocation can be therefore thought of as finding a feasible solution to the maximization 20

21 problem max x s.t. x is feasible. (53) x Since usually the oder is controlled using a set of parameters different from the allocated resource itself (e.g., though coded bits are allocated as a resource, the amount of produced bits is controlled through the quantization parameter), it is more convenient to express x as a function of some vector of parameters θ = (θ 1,..., θ m ) T. This yields the following optimization problem max x(θ) s.t. x(θ) is feasible. (54) θ Note, however, that since the maximized objective is a vector rather than a scalar quantity, there exists no unique way to define what is meant by maximum x. Here, we adopt the following notion of vector objective optimality. Definition 1 (Max-min optimality). A vector of resources x = (x 1,..., x n ) T is said to be max-min optimal if it is feasible, and for any 1 i n and a feasible y = (y 1,..., y n ) T for which x p < y p, there is some j with x i y j and x j > y j. Informally, this means that it is impossible to increase the resource allocated to a frame i without reasing the resources allocated to frames that have already a smaller resource allocation than x i, and without violating the feasibility constrains. For this reason, a max-min optimal resource allocation can be thought of as fair. As to notation, given the vector of resources x as a function of some parameters θ and feasibility constrains θ Ω, we will heforth interpret the vector objective maximization problem θ = arg max x(θ), (55) θ Ω as finding a max-min optimal resource allocation x = x(θ ). In the remainder of this section, we are going to explore and formulate allocation problems for two types of resources considered here, namely coded bits and oding time. 8.1 Bit allocation The first problem we are going to address is the distribution of the budget of b T bits between the frames, which can be denoted as a bit allocation vector b = (b 1,..., b n ) T. However, since the oder is not 21

22 controlled directly by b, but rather by the quantizer value, we reformulate the problem as allocating a vector q = (q 1,..., q n ) T of quantizer scales (or, equivalently, a vector q of quantization parameters). Though q and q may assume only discrete values, we relax this restriction by making them continuous variables. In these terms, the quantizer allocation problem can be formulated as max-min allocation of coding quality, in our case, average frame PSNR values p = (p 1,..., p n ) T ), as function of q and subject to feasibility constrains, max p(q) s.t. b(q) is feasible. (56) q Note that the feasibility constrains are imposed on the amount of bits produced by the oder with the control seque q. Since in practice the exact amount of bits produced for a given quantizer is unknown a priori, b(q) has to be substituted with the estimated coded bits ˆb(q). Such an estimate depends on the frame types, which are assumed to be ided by a higher-level frame type ision algorithm, detailed in the sequel. One of the feasibility constrains is, clearly, ˆb(q) = b T. However, this condition is insufficient, as it may happen that the produced bits violate the buffer constrains for a frame i < n. In order to formulate the buffer constrains on the allocated bits, let us denote by t = (t 1,...t n ) T the oding timestamps of the frames, assuming without loss of generality that t 1 = 0 and that the frames are numbered in the increasing order of t i. We denote the estimated oder bit buffer fullness immediately before and immediately after frame i is oded by ˆl i = ˆl (t i ɛ) and ˆl i+ = ˆl (t i + ɛ), respectively. We assume that the initial oder buffer fullness ˆl 1 = l 0 is given as the input. In vector notation, we can write ˆl (q) = l0 Kˆb(q) + ˆr t (57) ˆl + (q) = l0 Jˆb(q) + ˆr t (58) where J = , (59) K = J I, and ˆr is the estimated average oder bit buffer filling rate on the time interval [t 1, t n + 1 F ]. 22

23 The constrained optimal allocation problem becomes max p(q) q min q q max s.t. 1 Tˆb(q) = bt ˆl (q) lmax ˆl + (q) lmin, (60) where l min and l max are the minimum and the maximum oder bit buffer levels, respectively. Since it is reasonable to assume a monotonically reasing depende of the PSNR in the quantization parameter (or the quantizer scale), the maximizer of (60) coincides with the minimizer of min q q min q q max s.t. 1 Tˆb(q) = bt ˆl (q) lmax ˆl + (q) lmin. (61) A numerical solution of the max-min allocation problem (61) can be carried out using a variant of the bottleneck link algorithm, summarized in Algorithm 1. In the algorithm, we assume that the amount of coded bits produced for a frame i as a function of q i is given by the model ˆbi = α b i + βb i q i, (62) where the coefficients α b = (α b 1,..., αb n) T and β b = (β b 1,..., βb n) T depend on the frame content and frame types. In vector notation, this yields ˆb = α b + βb q, (63) where vector division is interpreted as an element-wise operation. The algorithm can be easily adopted to other models as well by replacing the closed-form solution in Steps 1,8, and 9 with the more general equation b T = n ˆbi (q) (64) i=1 with respect to the scalar q. 23

24 input : n 1 vectors α b, β b of bit model parameters; n 1 vector of oding time stamps t starting with t 1 = 0; initial oder bit buffer level l0 ; bit budget b T ; estimated average oder buffer rate ˆr. output : n 1 optimal max-min quantizer allocation vector q. Find optimal equal quantizer allocation and set q = q 1. if ˆl (q) lmax Set q = q. else and { { } } 1 T β b q = max min b T 1 T α b, q max, q min ˆl + (q) l min then 5 Find the smallest 1 m n for which one of the constrains is violated. 6 if ˆl m (q) lmax then set b 0 T = l 0 + ˆr t m+1 l max else set b 0 T = l 0 + ˆr t m 7 Set b 0 Find T = max{b0 T, 0}. 8 { { q = max min 1 T (β b 1,...βb m) T b 0 T 1T (α b 1,..., αb m) T, q max }, q min } l min. 9 and set q 1 =... = q m = q. Compute remaining bit budget { b T = max b T 1 T (α1, b..., αm) b T 1 } q 1T (β1 b,..., βm) b T, Recursively invoke the algorithm with α = (αm+1 b,..., αb n) T, β b = (βm+1 b,..., βb n) T, t = (t m+1,..., t n ) T t m+1, and l 0 = l0 b 0 T + ˆr t m+1 to fix the remaining quantizer values (qm+1,..., q n). end Algorithm 1: Max-min frame bit allocation. 8.2 Encoding time allocation The second problem we are going to address is the allocation of the oding time budget τ T between the frames, which can be denoted as an oding time allocation vector τ = (τ 1,..., τ n ) T. As in the case of bit allocation, since the oder is not controlled directly by τ, but rather by the complexity scale, we 24

25 reformulate the problem as allocating a vector c = (c 1,..., c n ) T of complexity scales. Again, we think of c as of a continuous variable, though in practice it may be restricted to a set of discrete values. We denote by ˆt = (ˆt 1,...ˆt n ) T the time at which frame i s oding is complete, assuming without loss of generality that the first frame starts being oded at time 0. Furthermore, we assume that the oding end time for a frame i coincides with the oding start time for the frame i + 1. In this notation, ˆt = J ˆτ (c), (65) where ˆτ = (ˆτ 1,..., ˆτ n ) T denotes the estimated oding times. Assuming the quantizer allocation has fixed the estimated amount of bits ˆb produced by oding each frame, our goal is to maximize the coding quality p(c) subject to the oder buffer constrains. The oder buffer constrains include both the oder bit buffer constrains, applying to the output bit buffer, and the oder raw buffer constrains, applying to the input raw frame buffer. Similar to the bit allocation problem, we denote the estimated oder bit buffer fullness immediately before and immediately after frame i s oding is complete by ˆl i = ˆl (ˆt i ɛ) and ˆl i+ = ˆl (ˆt i + ɛ), respectively. The initial buffer level l0 = l (0) is assumed to be known from a direct observation. Using this notation, the buffer levels are given by ˆl raw i ˆl (c) = l0 + Kˆb rj ˆτ (c) (66) ˆl + (c) = l0 + Jˆb rj ˆτ (c) (67) In the same manner, and ˆl raw i+ denote the estimated oder raw frame buffer fullness immediately before and immediately after frame i s oding is complete. The initial buffer level at time 0 is denoted by l raw 0 and is available from a direct observation. Since the filling rate of the input buffer is F, we have and ˆl raw (ˆt i + ɛ) = ˆl raw (ˆt i ˆl raw (ˆt i ɛ) = l raw 0 + ˆt i F (i 1) (68) ɛ) 1. In vector form, this yields raw ˆl (c) = l0 raw + F J ˆτ (c) K1 (69) raw ˆl + (c) = l0 raw + F J ˆτ (c) J1. (70) 25

26 The buffer-constrained oding complexity allocation problem can be expressed as max c c min c c max s.t. 1 Tˆτ (c) = τ T ˆl + (c) lmax ˆl (c) l min ˆl raw (c) l raw max ˆl raw + (c) l raw min, (71) However, this formulation suffers from two potential problems. The first drawback stems from our assumption that oding end time for a frame i coincides with the oding start time for the frame i + 1. For example, if the i frame is dropped and the input buffer level falls below l raw min, the oder will stall until the minimum buffer level is reached. This will make τ i non-negligible, which in turn will require c i to be very high (or even infinite if we assume that the nominal oding time for a dropped frame is strictly zero). We overcome this problem by relaxing the constrain c c max. If some elements of the optimal complexity allocation vector c exceed the maximum complexity scale, we will say that the oder yields a portion of its CPU time to other processes potentially competing with it over the CPU resources. This fact can be quantified by introducing a vector of CPU utilization η = (η 1,..., η n ) T, where 0 η i 1 expresses the fraction of the CPU time used by the oder process in the time interval [ˆt i 1, ˆt i ], ˆt 0 = 0. Setting, η i = c max c, (72) i and c = min{c, c max }, the oding complexity scale of frames with c i exceeding c max will be set to c max, and the utilizable CPU time will be lower than 100%. The second difficulty in the allocation problem (71) stems from the fact that sometimes the input and output buffer constrains may be conflicting (or, more formally, the feasible region may be empty). Definition 2 (Ordered constrains). Given an ordered m-tuple of indicator functions χ = (χ 1,..., χ m ) T, χ i : R n {0, 1}, a vector x is said to satisfy the ordered constrains χ(x) if there exists no y with χ(y) χ(x), where denotes the lexicographic order relation between binary strings. Informally, this definition implies that in case of conflicting constrains, satisfying the first constrain is more important than satisfying the second one, and so on. Using this relaxed notion of constrained optimality, 26

27 allocation problem (71) can be rewritten as max c c c min s.t. raw ˆl + (c) lmin raw ˆl raw (c) l raw max ˆl + (c) l max ˆl (c) l min 1 Tˆτ (c) = τ T, (73) where the constrains are interpreted as ordered constrains. The time complexity allocation (73) is solved using Algorithm 2 similar to Algorithm 1 for bit allocation. In the algorithm, we assume that the oding time of a frame i as a function of c i is given by the model ˆτ i = αt i + βt i q i c i η i = γt i c i η i, (74) where the coefficients and α t = (α t 1,..., αt n) T and β t = (β t 1,..., βt n) T depend on the frame content and frame types, and γ t = (α t 1 + βt 1 q 1,..., αt n + β t nq n) T. 8.3 Joint bit and oding time allocation Allocation problems (61) and (73) tacitly assume that allocation of quantization and oder time complexity scales are independent. However, while it is reasonable to assume that the coding quality p is a function of q only (and, thus, is independent of c), the amount of produced bits b clearly depends on both parameters. This depende couples together the two problems through the more accurate expression for the oder bit buffer levels and the oder bit buffer levels ˆl (q, c) = l0 + Kˆb(q, c) rj ˆτ (c) (75) ˆl + (q, c) = l0 + Jˆb(q, c) rj ˆτ (c), (76) ˆl (q, c) = l0 Kˆb(q, c) + ˆr t (77) ˆl + (q, c) = l0 Jˆb(q, c) + ˆr t, (78) 27

Scalable resource allocation for H.264 video encoder: Frame-level controller

Scalable resource allocation for H.264 video encoder: Frame-level controller Michael M. Bronstein Technion Israel Institute of Technology September 7, 2009 Abstract Tradeoff between different resources