W Arithmetic Coding (BAC). BAC is a variable in to

Size: px

Start display at page:

Download "W Arithmetic Coding (BAC). BAC is a variable in to"

Donna Allison
6 years ago
Views:

1 1546 EEE TRANSACTONS ON NFORMATON THEORY, VOL. 39, NO. 5, SEPTEMBER 1993 Block Arithetic Coding for Source Copression Charles G. Boncelet Jr., Meber EEE Abstruct- We introduce Block Arithetic Coding (BAC), a technique for entropy coding that cobines any of the advantages of ordinary strea arithetic coding with the siplicity of block codes. The code is variable length in to fixed out (V to F), unlike Huffan coding which is fixed in to variable out (F to V). We develop two versions of the coder: 1) an optial encoder based on dynaic prograing arguents, and 2) a suboptial heuristic based on arithetic coding. The optial coder is oetial over all V to F coplete and proper block codes. We show that the suboptial coder achieves copression that is within a constant of a perfect entropy coder for independent and identically distributed inputs. BAC is easily ipleented, even with large codebooks, because the algoriths for coding and decoding are regular. For instance, codebooks with 232 entries are feasible. BAC also does not suffer catastrophic failure in the presence of channel errors. Decoding ers a confined to the block in question. The encoding is in practice reasonably efficient. With i.i.d. binary inputs with P(1) = 0.95 and 16 bit codes, entropy arguents indicate at ost 55.8 bits can be encoded; the BAC heuristic achieves 53.0 and the optial BAC achieves Finally, BAC appears to be uch faster than ordinary arithetic coding. ndex Ters-Arithetic codes, block codes, variable to fixed codes, entropy copression.. NTRODUCTON E introduce a ethod of source coding called Block W Arithetic Coding (BAC). BAC is a variable in to fixed out block coder (V to F) unlike Huffan codes which are fixed in to variable out (F to V). BAC is siple to ipleent, efficient, and usable in a wide variety of applications. Furtherore, BAC codes do not suffer catastrophic failure in the presence of channel errors. Decoding errors are confined to the block in question. Consider the input, X = 21x2~3... to be a sequence of independent and identically distributed input sybols taken fro soe input alphabet A = {al, a2,..., a}. The sybols obey probabilities p3 = Pr(xl = a3). Assue that p3 > 0 for all j. The entropy of the source is H(X) = - CPZ log. (1) z=1 (We will easure all entropies in bits.) The copression efficiency of entropy encoders is bounded below by the entropy. For F to V encoders, each input sybol requires on Manuscript received Septeber 9, 1991; revised February 15, This paper was presented in part in The 1992 Conference on nforation Sciences and Systes, Princeton, NJ, March This work was perfored in part while the author was visiting at the University of Michigan. The author is with the Departent of Electrical Engineering, University of Delaware, Newark, DE EEE Log Nuber /93$ EEE average at least H(X) output bits; for V to F encoders, each output bit can on average represent at ost,1/h(x) input sybols. The ost iportant entropy copressor is Huffan coding [9]. Huffan codes are F to V block codes. They block the input into strings of q letters and encode these strings with variable length output strings. t is well-known that the nuber of output bits of a Huffan code is bounded above by qh(x) + 1 for all block sizes. n special cases, this bound can be tightened [5], [17]. As a result the relative efficiency of Huffan codes can be ade as high as desired by taking the block size to be large enough. n practice, however, large block sizes are difficult to ipleent. Codebooks are usually stored and ust be searched for each input string. This storage and searching liits block sizes to sall values. (However, see Jakobsson [ 111.) Huffan codes can be ipleented adaptively, but it is difficult to do so. The process of generating Huffan code trees is sufficiently cubersoe that adapting the tree on a sybol by sybol basis is rarely coputationally efficient. n recent years, a class of entropy coders called arithetic coders has been studied and developed [S, [13], [M-[21], [25]. Arithetic coders are strea coders. They take in an arbitrarily long input and output a corresponding output strea. Arithetic coders have any advantages. On average, the ratio of output length to input length can be very close to the source entropy. The input probabilities can be changed on a sybol by sybol basis and can be estiated adaptively. However, arithetic codes have certain disadvantages. n soe applications, the encoding and decoding steps are too coplicated to be done in real tie. The entropy coder ost siilar to BAC is due to Tunstall [23], attributed in [12]. Tunstall s encoder is a V to F block coder that operates as follows: Given a codebook with K output sybols, a codebook with K output sybols is generated by splitting the ost probable input sequence into successors, each fored by appending one of the a3. Unfortunately, the encoding of input sequences appears to require searching a codebook. Thus, large block sizes appear to be prohibitive, both in creating the code and in searching it. This has the practical effect of liiting block sizes to sall values. Recently, a V to F block arithetic coder has been proposed by Teuhola and Raita [22]. While siilar in spirit to this work, their coder, naed FXAR, differs in any details fro BAC. FXAR parses the input differently fro BAC and encodes each string differently. The analysis of FXAR is entirely different fro the analysis of BAC presented in Section 111. FXAR does appear to be fast.

2 BONOCELET BLOCK ARiTHMETC CODNG FOR SOURCE COMPRESSON 1547 Other V to F entropy coders are runlength [7] and Ziv- Lepel [14], [24], [27]. Runlength codes count the runs of single sybols and encode the runs by transitting the nuber in each run. They work well when the probabilities are highly skewed or when there is significant positive correlation between sybols. Runlength codes are often used for sall alphabets, especially binary inputs. Then the runlength code is particularly siple since, by necessity, the runs alternate between l s and 0 s. Runlength codes are occasionally cobined with Huffan to for a variable in to variable out (VtoV) schee. For exaple, consider the CC T facsiile code [lo] or the ore general VtoV runlength codes of Elias [6]. Ziv-Lepel codes work on different principles than BAC codes or the other codes considered so far. They find substrings of input sybols that appear frequently and encode each with a fixed length codeword. Ziv-Lepel codes work particularly well on English text, as they can discover coplex dependencies between the sybols. One advantage of ost fixed output length block coders, including BAC, is that the effects of transission errors (sybols received incorrectly) are confined to the block in question. The.decoder does not lose synchronis with the incoing bit strea. While the corrupted block ay be decoded incorrectly, reaining blocks will be decoded correctly. However, there ay be insertions or deletions in the decoded strea. These insertions and deletions can usually be dealt with at the application level. n contrast, Huffan codes and arithetic codes can lose synchronis and suffer catastrophic failure in the presence of channel errors. (t is often argued that Huffan codes are self-synchronizing, but this is not guaranteed. Various techniques have been proposed to cobat this proble. See, e.g., [16].) One subtle point is that fixed output length codes can still suffer catastrophic failure if ipleented in an adaptive fashion. f the probability estiates depend on previously transitted sybols, then a single error can affect other blocks. n particular, Ziv-Lepel codes are usually ipleented in an adaptive fashion and are susceptible to catastrophic failure if their dictionaries get corrupted. BAC codes will be vulnerable to the sae catastrophic failure if the probabilities are estiated adaptively based on prior sybols. Finally, there is a growing recognition that V to F codes can outperfor F to V codes with strong dependencies between sybols [M, [26]. The idea is that, for a fixed nuber of sybols, V to F codes can atch long input strings. For exaple, a runlength coder with K codewords can atch a run of up to K - 1 sybols. This run can be encoded with approxiately log K bits, resulting in an exponential gain. n contrast, Huffan codes only give arithetic gains. For K large enough, the V to F coder is ore efficient. This paper is organized as follows: Section 1 develops the optial and heuristic BAC s. n Section 111, we present the principal theoretical result of this paper: that BAC s are within an additive constant of the entropy for all code sizes. Section V presents coputations illustrating the efficiency of BAC s. Section V discusses extensions to tie varying and Markov sybols, Section V discusses the EOF proble and proposes two new solutions, and Section V1 presents conclusions and discussion. 11. DEVELOPMENT OF BLOCK ARTHMETC CODNG The BAC encoder parses the input into variable length input strings and encodes each with single fixed length output codewords. The following definitions are taken fro Jelinek and Schneider [12]: Let W(K) = {wi : 1 5 i 5 K} be a K eleent collection of words, wi, with each word a substring of X. W(K) is proper if, for any two words, w; and wj with i # j, wi is not a prefix of wj. W(K) is coplete if every infinite length input string has a prefix in W. The characterization in [15] is succinct: a code is coplete and proper if every infinite length input string has one and only one prefix in W(K). t is useful to think of coplete and proper fro the viewpoint of parsing trees. Proper eans that the codewords are all leafs of the tree; coplete eans that all the nodes in the tree are either leafs or they are internal nodes with children. The nuber of codewords in coplete and proper sets is 1 + L( - 1) for L = 1,2,... [12]. This result is also easy to derive fro the viewpoint of trees. The root has children (L = 1). Each tie a leaf node is replaced by children, a net gain of - 1 leafs result. Below, we will allow L = 0 so that one codeword can be allowed in appropriate sets. We assue that to each output codeword is assigned an integer index i with i = 1,2,..., Ks. Let S be the set of output codewords. Consider a subset of codewords with contiguous indices. Denote the first index by A and the last by B. The nuber of codewords in the subset is K = B - A + 1. BAC proceeds by recursively splitting S into disjoint subsets. With each input sybol, the current subset is split into nonepty, disjoint subsets, one for each possible input letter. The new subset corresponding to the actual letter is chosen and the process continues recursively. When the subset has fewer than codewords, BAC stops splitting, outputs any of the codewords in the subset, reinitializes, and continues. The ideal situation is that each final subset contain only 1 codeword; otherwise, the extra codewords are wasted. Denote the nuber of codewords in each of the disjoint subsets by Kl(K), where K is the nuber of codewords in the current set. Usually, we will suppress the dependence on K and. denote the nuber siply by Kl. The question of how the Kl should be selected is central to the reainder of this paper. We can identify five criteria that are necessary or desirable: 1) Kl > 0. 2) CEKl L K. 3) C; =,Ki = K. 4) Kl = 1 + Ll( - 1) for soe L = 0,1,2,.... 5) f pj > pi, then Kj 2 Ki. Criteria Q1 and Q2 are necessary to ensure that the parsing is coplete and proper and that the encoding can be decoded. Each subset ust have at least one codeword and the su of all the subsets cannot exceed the total available. Criteria Q3, Q4, and Q5 are desirable in that eeting each results in a ore efficient encoding. n effect, Q3 says that one ought to use all the codewords that are available. Q4 assures that the nuber of codewords in each subset is equal to one (a leaf) or is the nuber in a coplete and proper set. For

3 1548 EEE TRANSACTONS ON NFORMATON THEORY, VOL. 39, NO. 5, SEPTEMBER 1993 /* BAC Encoder */ K = Ks; A = 1; while (( = getinput()) # EOF) { Copute fl, Kz,...,f,,,; A = A + C,; K = f(; if (f < ) doeof(a, (); f* BAC Decoder *f { Output code(a); A= 1; f = fs; Fig. 1. Basic BAC Encoder. while ((C = getindexcode()) # EOF) f = KS; A = 1; while (f 1 ) t Copute Cl, fz,...,c,,,; Find 1 s.t. A + output ai; f = f1; 1 undoeof(a, f); A = A +.E!=; K,; Fig. 2. Basic BAC Decoder C; _< C < A + E!=, K,; exaple, consider = 3 and K = 5. The subsets should be divided into soe perutation of 1,1,3, and not 1,2,2. n the forer case, the set with 3 eleents can be divided one ore tie; in the latter case, none of the subsets can. Note, Q3 and Q4 can both be satisfied if the initial K satisfies K = 1 + L( - 1) and ELl Ll = L - 1. Q5 erely reflects the intuitive goal that ore probable letters should get ore codewords. n theory, Q5 can always be et by sorting the Kl s in the sae order as the pl s. Soeties in practice, e.g., in adaptive odels, it ay be coputationally difficult to know the sorted order and eeting Q5 ay be probleatic. The BAC encoding algorith is given in Fig. 1. The BAC decoding algorith is given in Fig. 2. We assue that the various functions behave as follows: [getinputo] Returns the index of the next input sybol to be coded. f a, is the next letter, then it returns r. [code(a)] Returns the codeword corresponding to codeword index, A. [doeof(a, K)] Handles input end-of-file (EOF). Discussed below in Section V. [getindexcodeo] Returns the index of the next codeword taken fro the channel. [undoeofo] Undoes the effect of doeof(). As exaples of the kinds of parsings available, consider the following for binary inputs (a1 = O,a2 = 1): Let K1 = K - 1 and K2 = 1. These are runlength codes for runs of 0 s. Siilarly, if K1 = 1 and K2 = K - 1, we get runlength codes for runs of 1 s. The parsing trees are as uch left or right as possible. Let K1(Ks) = Ks/2 and Kz(K.9) = Ks/2. f the first sybol is a 1, then let K1 = K - 1 and K2 = 1 for all other K s; else, let K1 = 1 and K2 = K - 1. These are syetric runlength codes, useful when the starting value is unknown. Let K1 = K/2 and K2 = K/2. These codes ap the input to output in an identity-like fashion. With the conditions above, we can state the following theore: Theore 1 Block Arithetic Codes are coplete and proper. Furtherore, all coplete and proper parsings can be developed as BAC s. Proojl t is easiest to develop this equivalence with parsing trees. BAC codes are coplete and proper fro their construction. They are proper because no outputs are generated at nonleaf nodes; they are coplete because each terinal node has - 1 siblings. For the converse, consider any coplete and proper parsing as a tree. At each node of the tree, a BAC coder can reproduce the sae tree by appropriately choosing the subset sizes, Kl. For instance, at the root node, count the nuber of leafs in each branch. Use each of these nubers for Kl, respectively. The process can clearly continue at each node. 0 n the case considered so far of independent and identically distributed input sybols, it is straightforward to derive an expression for the efficiency of a code. Let N(K) denote the expected nuber of input sybols encoded with K output sybols. Due to the recursive decoposition, one obtains the following recursive forula: The 1 is for the current sybol, zj. The su follows fro considering all possible choices for z. With probability pl, the input is a1 and the subset chosen has Kl codewords. The expected nuber of input sybols encoded fro here onward is N(Kl). Since this equation is crucial to the rest of this developent, we also present an alternate derivation. N(K), as the expected nuber of input sybols encoded with K output sybols, can be written as where ni is the nuber of bits in wi. Parse wi = aljw: where al,; is the first letter in w;. Then n: = n; - 1 and, by

4 BONOCELET BLOCK ARTHMETC CODNG FOR SOURCE COMPRESSON 1549 independence, Pr(w;) = Pr(a1,;) Pr(wi). Thus, K i=l K i=l 111 na (3) follows because Kl codewords start with al. The boundary conditions are easy and follow fro the sjple requireent that at least codewords are needed to encode a single sybol: (3) N(K)=O K=0,1,..., -1. (4) Note, one application of (2) yields N() = 1. While hard to copute N(K) in closed for, except in special cases, (2) and (4) are easy to progra. For an exaple of an easy case, consider ordinary runlength coding for binary inputs. Letting N,(K) refer to the expected nuber of input sybols encoded with K output codewords using runlength encoding, then the recursion siplifies to N,(K) = 1 + pn?.(k - 1) + qn,(l) = 1 +pn,(k - l), (5) where p is the probability of a sybol in the current run and q = 1 - p. (5) follows since N,(1) = 0. The solution to this linear difference equation is N,(K) = (1 - pk-1)/(1 - p). (6) nterestingly, this solution asyptotically approaches 1/( 1 - p). We arrive at the well-known conclusion that V to F runlength codes do not work well for large block sizes. We propose two specific rules for deterining the size of each subset. The first is an optial dynaic prograing approach, letting N,(K) be the optial expected nuber of input sybols encoded using K output codewords: subject to the constraints Kl 2 1 and CEKl = K. This optiization is readily solved by dynaic prograing since all the Kl obey Kl < K. However, if is large, a full search ay be coputationally prohibitive. The optial solution can be stored in tabular for and ipleented with a table lookup. Note, the optial codebook does not have to be stored, only the optial subset sizes. The second is a heuristic based on ordinary arithetic coding. n arithetic coding, an interval is subdivided into disjoint subintervals with the size of each subinterval proportional to the probability of the corresponding sybol. We propose to do the sae for BAC. Kl should be chosen as close to plk /* Good Heuristic to deterine the L s (Kt = 1 + Ll( - 1)) */ Sort the sybols so that p1 5 p p. q = 1.0; f,=& for ( = 1; l 5 ; ++) { fi = P1/% 4 = fif, + ((fi(2 - ) - l)/( - 1) + 0.5)]; if (L < 0) L = 0; f, = i - Lt; q=q-p1. Fig. 3. Good probability quantizer. Coputes L1 for 1 = 1,2,..., given L. Note, all quantities not involving L can be precoputed. as possible. Let Kl = [plk], where [S is the quantization of s, with the proviso that each Kl 2 1. One way to do the quantization consistent with Q144 above is described in Fig. 3. The idea is to process the input letters fro least probable to ost probable and, for each letter, to keep track of the nuber of codewords reaining and the total probability reaining. The algorith actually coputes Ll, not Kl, as the forer is soewhat easier. Note, Q5 ay not be satisfied due to quantization effects. However, our coputations and siulations indicate that it alost always is. Furtherore, Q5 can be satisfied if desired by sorting the Lz s. One drawback of the Good quantizer in Fig. 3 is that the Ll s are coputed one at a tie fro the least probable letter up to the letter input. Potentially, this loop ay execute ties for each input sybol. For large, this ay be too slow. One way to circuvent this proble is the Fast quantizer described below. The idea behind the fast quantizer is to copute a cuulative L function, Rl, or, equivalently, a cuulative K function, Ql, and for either Ll or Kl = 1 + (- 1)Ll by taking a difference. The equations for R and L take the for: Ro = 0 (8) R1 = [(L - 1) -& j=1 Ll = R - Rl-1 Those for Q and K are as follows: (9) (10) Qo = 0 (11) Ql = 1 + (- 1)Rl (12) KZ = Q - Qz-1 (13) The Fast quantizer satisfies Q144, though not necessarily Q5. For large alphabets it is slightly less efficient than the Good quantizer, but is uch faster because there is no loop over all the sybols. The algorithic coplexity of BAC can be suarized as follows: Using the good heuristic, the encoder can require up to - 1 ultiplications per input sybol, and a siilar

5 1550 EEE TRANSACTONS ON NFORh4tWON THEORY, VOL. 39, NO. 5, SEPTEMBER 1993 nuber of additions; using the fast heuristic, it only requires two ultiplications and a fixed nuber of additions and coparisons per input sybol. The decocler does the sae coputations of Kl (or L) as the encoder, but also has to search for the proper interval. This searching can be done in O(1og ) operations. Thus, the encoder can be ipleented in 0(1) or O() operations per input sybol and the decoder in O(1og) or O() operations, depending on whether the fast or the good heuristic is used. n the special case of binary inputs, there is little difference, both in coplexity and perforance, between the two heuristics. After the Kl s have been coputed, the optial BAC can be ipleented with no ultiplications and a siilar nuber of additions to the fast heuristic. However, the one tie cost of coputing the optial Kl s ay be high (a brute force search requires approxiately O(LS) operations if is large, where KS = l+ls(-l)). n soe applications, it is convenient to cobine the optial and heuristic approaches in one hybrid approach: use the optial sizes for sall values of K and use the heuristic for large values. As we argue in Section V, both encoders incur their greatest inefficiency for sall values of K. For instance, an encoder can begin with the heuristic until codewords are left and then switch to the optial. Letting Nh(K) represent the expected nuber of input sybols encoded with K output codewords under the heuristic, the expression for N(K) becoes Nh(K) = 1 + CPlNh(blK1). (14) For the oent, if we ignore the quantizing above, relax the boundary conditions, and consider N(.) to be a function of a continuous variable, say s, we get the following equation: enough. See, e.g., Fig. 3.) n this section, we will take all logs to be natural logs to the base e. Theore 1 For independent and identically distributed inputs, logk - H(X)Nh(K) 5 c, (17) for all K and soe constant C. (C depends on the probabilities, the quantization, and on the size of the alphabet, but not on K.) Proofi We will actually show a stronger interediate result, naely that D logk - H(X)Nh(K) G(K) = C - -, K (18) where D > 0. Clearly C - D/K 5 C. The proof is inductive. There are three constants to be deterined: C and D, and a splitting constant, Kc. First, select any KC such that KC > 2/pl, where pl > 0 is the iniu of the pl s. The basis is as follows: For all K 5 Kc, and for any D > 0, we can choose C so that D c 2 logk - H(X)Nh(K) + D 2 log Kc , 1 (19) where, clearly, Nh(K) 2 0 and D/K 5 D. At this point, Kc has been deterined and C has been specified in ters of D and the proposition holds for all K 5 KC. For the inductive step, we now assue that (18) applies for all K K for soe K 2 Kc. Then, we find a condition on D so that (18) applies for all K. Let K = K + 1, then logk - H(X)Nh(K) where the subscript e refers to entropy. The entropy rate for - H(X) - CPlW)Nh(blKl) V to F codes, satisfies this equation. Thus we see that, in the absence of quantization effects, BAC codes are entropy codes. As we argued above, for large K the quantization effects are sall. Furtherore, logk is a slowly varying function of K. We ight expect the heuristic to perfor close to the optial, entropy rate. n fact, we prove in Section 11 below that Nh (K) is within an additive constant (which depends on the p s and ) of the entropy rate for all K ASYMPT~TC RESULTS Consider the heuristic encoder discussed above. Assue the quantization is done so that, for K large enough, blk] = prk + 61, where Assue further that CZ, 61 = 0. (Such quantization can always be achieved for K large We believe this entropy solution uniquely solves (15), but have been unable to prove uniqueness. The inductive hypothesis is used in (20). The last ter above is easily bounded since G(-) is nondecreasing:

6 BONOCELET BLOCK ARTHMETC CODNG FOR SOURCE COMPRESSON 1551 f , then entropy optial heuristic o - runlength -- Siilarly, if 61 < 0, rpi K z (By construction, plk + 61 > 0.) The first ter of (20) can be bounded as follows: CPl(log(PlK) - log(b1k)) -1 f (1 + Za1jK-j 3=1 (25) where ai3 are coefficients that are polynoial functions Of the 61's. They do not otherwise depend on K. Since E;"=, 61 = 0, we get the following: (26) Where Pj = E;"=, 6lalj are polynoial functions of the 61's and do not otherwise depend on K. Now we can bound each ter separately. By construction, 1 + Si/(p;K) > 1/2. Then, The second ter of (26) can be bounded as follows: -1 - PjK-j ( - l)k-'+x Pjl. (28) j= log K Fig. 4. Calculated N(K) versus log K for binary inputs with p = Cobining (21), (27), and (28) and letting D' = ( - 1)2" axj Pjl, we get the following: log K - H(X)Nh(K) G(K - 1) + G(K). Substituting in for G(.), we need to select inequality in (29) is valid: or, equivalently, +-<c-- D D' D C-- K-1 K2- K' D'K-~ (29) D so that the K-1 -DK + D'- K <-DK+D, (31) which is valid for all D > D'. This copletes the proof. 0 Clearly, the optial approach is better than the heuristic and worse than entropy. So, cobining with the theore above, we have the following ordering: -- log K logk-c H(X) - Ne(K) L No(K) 2 Nh(K) L. (32) H(X) V. COMPUTATONAL RESULTS n this section, we copute BAC efficiencies for the heuristic and optial ipleentations and copare these with entropy values. We also copare an ipleentation of BAC and the arithetic coder of Witten, Neal, and Cleary [25], both operating on a binary input, and show that BAC is uch faster. Source code for the tests presented here can be found in [4]. n Fig. 4, we present calculations of N(K) for a hypothetical entropy coder, the optial BAC, the good heuristic BAC, and a siple V to F runlength coder versus log K = 0,1,...,16. The input is binary with p = Pr(z = 1) = Note that the optial and heuristic curves closely follow the entropy line. The axiu difference between both curves and entropy is 4.4 and occurs at log K = 3. n contrast, for log K = 16, the optial is within 2.3 input sybols of the entropy and the heuristic within 2.8. This behavior sees to be consistent over a wide range of values for p. The axiu difference occurs for a sall value of logk. The coders iprove slightly in an absolute sense as log K grows. At soe point the daerence sees to level off. The relative efficiency increases as the nuber of codewords grow. At log K = 3, it

7 1552 EEE TRANSACTONS ON NFOWON THEORY, VOL. 39, NO. 5, SETEMBER entropy - optial good... - fast TABLE r EXECUTON TMES FOR BAC AND WNC FOR A 1 MLLON BT FLE Version Tie (secs) BAC Encoder 4.2 BAC Decoder 3.7 WNC Encoder 21.6 WNC Decoder 29.5 WNC Encoder (fast) 22.2 WNC Decoder (fast) 29.8 O Speed (encoder) 2.0 O Speed (decoder) 1.7 P Entropy Optial Heuristic Runlength Ne(K) No(K) 6 Nh(K) NR(K) is 58% for both and, at logk = 16, it is 96% for the optial BAC and 95% for the heuristic BAC. n Fig. 5, we present calculations of N(K) versus log K for a 27 letter English alphabet taken fro Blahut [l, p Plotted are N(K) for the entropy, optial, and good and fast heuristic BAC s. On this exaple, the curves for the optial and the good heuristic are alost indistinguishable. n Table, we present calculated N(K) for binary inputs and selected p s with log K = 16, i.e., 16 bit codewords. One can see that both the heuristic and optial BAC s are efficient across the whole range. For instance, with p = 0.80, both exceed 98.7% of entropy; with p = 0.95, both exceed 95.0%; and with p = 0.99, both exceed 91.5%. The coputational results support an approxiation. For all K large enough, H(X)Nh(K) M log K - C. n general, it sees to be that < C, where C is taken fro Theore 2. Also in Table are coputed values of C = logk - H(X)Nh(K) for K = 216. As another experient, we ipleented BAC and the coder of Witten, Neal, and Cleary (referred to as WNC) [25] to assess the relative speeds. Source code for WNC appears in their paper, so it akes a good coparison. Speed coparisons are adittedly tricky. To ake the best coparison, both BAC and WNC were optiized for a binary input and the encoders for both share the sae input routine and the decoders the sae output routine. There are two versions of WNC, a straightforward C version, and an optiized C version. We ipleented both and found that, after optiizing for binary inputs, the straightforward version was actually slightly faster than the optiized version. Both were tested on the sae input file, 220 independent and identically distributed bits with a probability of a 1 equal to The execution ties on a SPARC PC are listed in Table 11. The ties are repeatable to within 0.1 seconds. Also listed are the ties for input and output (O Speed), i.e., to read in the input one bit at a tie and write out the appropriate nuber of bytes, and to read in bytes and write out the appropriate nuber of bits. The O Speed nubers do not reflect any coputation, just the input and output necessary to both BAC and WNC. We see that in this test, the BAC encoder is approxiately 5 ties faster than WNC and the BAC decoder is 8 ties faster than the WNC decoder. ndeed, the BAC ties are not uch longer than the inputloutput operations alone. V. EXTENSONS BAC s can be extended to work in ore coplicated environents than that of i.i.d. sybols considered so far. For instance, consider the sybols to be independent, but whose probabilities are tie varying: p(z,j) = Pr(xc, = Q). (33) Then denote the nuber of input sybols encoded with K output sybols starting at tie j by N(K, j). Then, in analogy to (2) we get the following: N(K(j),j) = 1 + Cp(Wvqj.+ l>,j + ), (34) where Kl(j) is the nuber of codewords assigned to a1 at tie j. The heuristic is easy: Choose Kl(j+l) = b(z,j)k(j)]. We can also present an optial dynaic prograing solution: No(K(j),j) = 1+ This proble can be solved backward in tie, fro a axiu j = K to a iniu j = 1. As another exaple consider a tie invariant, first order Markov source. Let p(ilz) = Pr(x(j) = ailx(j - 1) = a). Then the recursion for N(K) splits into two parts. The first is for the first input sybol; the second is for all other input sybols: N(K) = 1 + xpin(ki(z) (36) (37) i=l

BONOCELET BLOCK ARTHMETC CODNG FOR SOURCE COMPRESSON 1553 where N(KZ) is the nuber of input sybols encoded using K codewords given that the current input is al.

8 BONOCELET BLOCK ARTHMETC CODNG FOR SOURCE COMPRESSON 1553 where N(KZ) is the nuber of input sybols encoded using K codewords given that the current input is al. The heuristic again is easy: Choose Ki = p(ilz)k]. The optial solution is harder, although it can be done with dynaic prograing. For every 1, the optial N,(KZ) can be found because each Ki < K. n [3], we show a siilar optiality result to Theore 2 for a class of first order Markov sources. n soe applications it is desirable to adaptively estiate the probabilities. As with strea arithetic coding, BAC encodes the input in a first-in-first-out fashion. The only requireent is that the adaptive forula depends only on previous sybols, not on present or future sybols. V. THE EOF PROBLEM One practical proble that BAC and other arithetic coders have is denoting the end of the input sequence. n particular, the last codeword ay be only partially selected. The siplest solution to this proble is to count the nuber of sybols to be encoded and to send this count before encoding any. t is an easy atter for the decoder to count the nuber of sybols decoded and stop when the appropriate nuber is reached. However, this schee requires that the encoder process the data twice and incurs a transission overhead to send the count. The EOF solution proposed in the strea arithetic coder of Witten, Neal, and Cleary [25] and used in FXAR [22] is to create an artificial letter with inial probability. This letter is only transitted once, denoting the end of the input. When the decoder decodes this special letter, it stops. We coputed BAC s lost efficiency for binary inputs for a variety of probabilities and codebook sizes and found that it averages about 7%. For the English text exaple, the loss averaged about 3.5% over a wide range of codebook sizes. We propose two alternatives to the siple schees above (see also [2]). The first assues the channel can tell the decoder that no ore codewords reain and that the decoder is able to lookahead a odest nuber of bits. The idea is for the encoder to append to the input sequence as any least probable sybols as necessary (possibly zero) to flush out the last codeword. After transitting the last codeword, the encoder transits the nuber of encoded sybols to the decoder. Even with 232 codewords, at ost 31 extra input sybols are needed. This nuber can be encoded with 5 bits. The decoder looks ahead 5 bits until it detects that no ore sybols are present. t then discards the appropriate nuber of appended sybols. The overhead caused by this schee is odest: 5 bits at the end and one possibly wasteful codeword. The second schee assues the channel can not tell the decoder that no ore codewords reain. We suggest devoting one codeword to specify the end of the codeword sequence. (Note, we are not suggesting an extra input letter, but one of the K codewords.) Then append the extra nuber of bits as above. The inefficiency here includes the 5 bits above and the loss due to one less codeword. Using the approxiation discussed in Section V above, one coputes the relative inefficiency as follows: log K - C - log(k - 1) + C M (K(1og K - Cl))- (38) log K - C For K large enough, this overhead is negligible. W. CONCLUSONS AND DSCUSSON We believe that BAC represents an interesting alternative in entropy coding. BAC is siple to ipleent, even for large block sizes, because the algorith is top-down and regular. n contrast, Huffan s and Tunstall s algoriths are not topdown and generally require storing and searching a codebook. BAC is efficient, though probably slightly less efficient than ordinary arithetic coding. The coparison with Huffan is a little ore coplicated. The asyptotic result for BAC is, on the surface, weaker than for Huffan. The BAC bound uses C which ust be coputed on a case by case basis, while the Huffan bound is 1. Both coders can be ade as relatively efficient as desired by selecting the block size large enough. However, BAC can use uch larger block sizes than is practical for Huffan. The BAC asyptotic result is uch stronger than that for Tunstall s coder. For BAC, the difference between entropy and the heuristic rates is bounded. Tunstall shows only that the ratio of rates of entropy and his coder approaches 1 as K n the proof for Theore 2, we have shown that a constant, C, exists that bounds the difference between entropy and the heuristic BAC for all K. We have ade no effort to actually evaluate the constant fro the conditions given in the proof. This is because such an evaluation would be worthless in evaluating the perforance of BAC. As shown in Section V, the constant in practice is quite sall. One iportant need for future research is to provide tight bounds for C and, perhaps, to characterize the difference between the entropy rate and BAC ore accurately. One advantage of BAC copared to Huffan and strea arithetic coding is that BAC uses fixed length output codewords. n the presence of channel errors, BAC will not suffer catastrophic failure. The other two ight. We have also argued that BAC can accoodate ore coplicated situations. Certainly the heuristic can handle tie varying and Markov probabilities. t can estiate the probabilities adaptively. t reains for future work to prove optiality results for these ore coplicated situations. REFERENCES [l] R. E. Blahut, Principles and Practice of nforation Theory. Reading, MA: Addison-Wesley, [2] C. G. Boncelet, Jr., Extensions to block arithetic coding, in Proc Confnfor. Sci. and Syst., Princeton NJ, Mar. 1992, pp [3] -, Block arithetic coding for Markov sources, in Proc nfor. Thov Syp., San Antonio, TX, Jan [4] -, Experients in block arithetic coding, Univ. Delaware, Dep. Elec. Eng., Tech. Rep ,1993. [5] R. M. Capocelli and A. De Santis, New bounds on the redundancy of Hufhnan codes, EEE Trans. nfor. Theory, vol. 37, pp , July [6] R. Elias, Universal codeword sets and representation of the integers, EEE Trans. nfor. Theory, vol. T-21, pp , Mar [7] S. W. Golob, Run-length encodings, EEE Transhafor. Theory, vol. T-12, pp , July 1966.,

9 1554 EEE TRANSACTONS ON NFORMATON THEORY, VOL. 39, NO. 5, SEPTEMBER 1993 [8] M. Guazzo, A general iniu-redundancy source-coding algorith, EEE Trans. nfor. Theory, vol. T-26, pp , [9] D. A. Huffan, A ethod for the construction of iniu-redundancy codes, Proc. RE, pp , Sept [lo] R. Hunter and A. H. Robinson, nternational digital facsiile coding standards, Proc. EEE, vol. 68, pp , July [ll] M. Jakobsson, Huffan coding in bit-vector copression. n& Proc. Left., vol. 7, no. 6, pp , Oct [12] F. Jelinek and K. Schneider, On variable-length-to-block coding, EEE Trans. nfor. Theory, vol. T-18, pp , Nov [13] G. G. Langdon, An introduction to arithetic coding, BM J. Res. Develop., vol. 28, pp , Mar [14] A. Lepel and J. Ziv, On the coplexity of finite sequences, EEE Trans. nfor. Theory, vol. T-22, pp , Jan [15] N. Merhav and D. L. Neuhoff, Variable-to-fixed length codes provide better large deviations perforance than fixed-to-variable length codes, EEE Trans. nfor. Theory, vol. 38, pp , Jan (161 B. L. Montgoery and J. Abraha, Synchronization of binary source codes, EEE Trans. nfor Theory, vol. T-32, pp , Nov [17] -, On the redundancy of optial binary prefix-condition codes for finite and infinite sources, EEE Trans. nfor. Theory, vol. T-33, pp , Jan [18] W. B. Pennebaker, J. L. Mitchell, Jr. G. G. Langdon, and R. B. Arps, An overview of the basic principles of the q-coder adaptive binary arithetic coder, BM J. Res. Develop., vol. 32, no. 6 pp , Nov [ 191 J. Rissanen, Generalized kraft inequality and arithetic coding. BM J. Res. Develop., pp , May [20] J. Rissanen and G. G. Langdon, Arithetic coding, BM J. Res. Develop., vol. 23, no. 2, pp , Mar [21] J. Rissanen and K. M. Mohiuddin, A ultiplication-free ultialphabet arithetic code. EEE Trans. Coun., vol. 37, pp , Feb [22] J. Teuhola and T. Raita, Piecewise arithetic coding. in Proc. Data Copression Con& 1991, J. A. Storer and J. H. Reif, Fds., Snow Bird, UT, Apr. 1991, pp. 3342, [23] B. P. Tunstall, Synthesis of noiseless copression codes, Ph.D. dissertation, Georgia nst. Techno]., [24] T. A. Welch, A technique for high-perforance data copression, EEE Copuf. Mag., pp. 8-19, June [25]. H. Witten, R. M. Neal, and J. G. Cleary, Arithetic coding for data copression. Coun. ACM, vol. 30, pp , June [26] J. Ziv, Variable-to-ked length codes are better than fixed-to-variable length codes for Markov sources, EEE Tra. nfor. Theory, vol. 36, pp , July [27] J. Ziv and A. Lepel, Copression of individual sequences via variable-rate coding, EEE Trans. nfor. Theory, vol. T-24, pp , Sept

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t. CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when