Universal Finite Memory Coding of Binary Sequences

Size: px

Start display at page:

Download "Universal Finite Memory Coding of Binary Sequences"

Janis Baldwin
5 years ago
Views:

1 Deartment of Electrical Engineering Systems Universal Finite Memory Coding of Binary Sequences Thesis submitted towards the degree of Master of Science in Electrical and Electronic Engineering in Tel-Aviv University by Doron Rajwan December 2000

2 Deartment of Electrical Engineering Systems Universal Finite Memory Coding of Binary Sequences Thesis submitted towards the degree of Master of Science in Electrical and Electronic Engineering in Tel-Aviv University by Doron Rajwan doron This research work was carried out at Tel-Aviv University in the Deartment of Electrical Engineering - Systems, Faculty of Engineering, under the suervision of Prof. Meir Feder December 2000

3 Acknowledgments This whole work was made ossible due to the devoted guidance and suort of Prof. Meir Feder, my suervisor. His enthusiasm and ideas insired me during this research, and I thank him for that. Also, I would like to thank my family for their love and suort, secially, my wife, Ofira. i

4 Abstract This work considers the roblem of universal coding of binary sequences, where the universal encoder has limited memory. Universal coding refers to a situation where a single, universal, encoder can achieve the otimal erformance for a large class of models or data sequences, without knowing the model in advance, and without tuning the encoder to the data. In the revious work on universal coding, secific universal machines, whose erformance attained the theoretical limits, were suggested. However, these machines require unlimited amount of memory. This work investigates the case where the universal machines are with limited resources, i.e., finite-state machines. To simlify the roblem, this work considers universal finite-state machines that assign robabilities to the next bit. It ignores the additional number of states needed to translate the assigned robabilities into code bits. In most cases examined, this work rovides lower bounds on the erformance, which describe the otimal rate (as a function of number of states) that the entroy, or the emirical entroy, can be attained. In addition, in most cases, this work resents secific machines whose erformance is comared to these bounds. While the general roblem of universal coding with limited resources is still oen, this work rovides a set of basic results that will be useful in analyzing the general roblem. These results thus rovide an imortant ste in understanding the roblem of limited resources universal coding. ii

5 Contents 1 Introduction Review of Universal Source Coding Universal Coding with Limited Resources Thesis Outline Data Settings and Machine Tyes System Architecture Data Settings Machine Tyes Minimum Redundancy Goal Otimal Probability Axis Quantization Min-Max Criterion High Resolution Asymtotic Quantization Comarison with Uniform Point Allocation Bernoulli Setting Random Machine Deterministic Machine Time-Variant Machine iii

6 5 Markovian Setting (q-th order) Multi-Dimensional Quantization Limit Deterministic Machine Random Machine Deterministic Setting (single-state reference) Minimal Circles Deterministic Machine Random Machine Time-Variant Machine Conclusion and Further Work 53 iv

7 List of Figures 2.1 System architecture Otimal quantization oints Random machine for the Bernoulli setting Simulation of robabilistic transitions by deterministic machine Cover s 4-state rocess Minimal circles Deterministic machine for the deterministic setting Time-variant machine for the deterministic setting v

8 Chater 1 Introduction This work considers the roblem of universal coding of binary sequences, where the universal encoder is a finite-state machine. More accurately, the roblem considered in this work is universal robability assignment for the next outcome of a binary sequence, under the self-information loss. The selfinformation is the ideal codelength, associated with the assigned robability model, for encoding the next outcome. Thus, ignoring the number of states required to convert the assigned robability to code bits, say, by arithmetic encoder, the work essentially considers the universal finite memory lossless coding roblem. In the recent years the universal coding roblem has been extensively investigated. As it is well known, otimal lossless coding is achieved by designing an encoder that fits the robability model of the data, to achieve the minimal codelength, i.e., the entroy. Universal coding refers to a situation where a single, universal, encoder can achieve the otimal erformance for a large class of models, without knowing the model in advance. In the revious work on universal coding, secific universal machines, whose erformance attained the theoretical limits, were suggested. However, these machines re- 1

9 quire unlimited amount of memory. This work investigates the case where the universal machines are with limited resources, i.e., finite-state machines. The universal coding roblem can traditionally be resented in two different settings. In the robabilistic setting the data sequence is generated by an unknown robabilistic source, e.g., Bernoulli i.i.d. source or q-th order Markov source with unknown robabilities, and the goal is to attain the source entroy. In the deterministic setting, the data sequence is an arbitrary deterministic individual sequence, and the goal is to attain, e.g., the sequence emirical entroy, which is the minimal codelength of an encoder tuned to this sequence. This work considers both settings, and examine how to attain the source entroy, or the sequence emirical entroy, with a finite-state encoder. While the work restricts the encoder to be finite-state, it examines various cases where the universal encoder is deterministic, randomized, timeinvariant or time-variant. In most cases it rovides lower bounds on the erformance, which describe the otimal rate (as a function of number of states) that the entroy, or the emirical entroy, can be attained. In addition, in most cases, the work resents secific machines whose erformance is comared to these bounds. Many of the results resented in this work have been summarized in a aer resented in the 2000 Data Comression Conference (DCC) [19]. 1.1 Review of Universal Source Coding Before describing the secific results of the thesis, we resent in the following section a review of universal source coding, including an elaboration on the stochastic and deterministic settings, and the relation between data 2

10 comression and rediction. A reader familiar with these toics may ski to section 1.2 on age 14. Source coding Lossless source coding is a rocess in which an encoder assigns a sequence of bits to an information source. Later, a decoder, related to the encoder, can decode the sequence of bits into its original form. For examle, information source can be a document written in English. This document can be converted into a sequence of bits using ASCII encoding. In this encoding, each letter gets an 8 bit reresentation. For examle, the letter A is converted into the bits Also, control information, like sace between words, end of line, end of aragrah, etc., is converted into bits. Later, the English characters are being decoded by a decoder secialize in converting the bit sequence to English text. These English characters can be dislayed on-screen, rinted, or even read through the seaker using voice-coder. Although translation of the characters into bits, without loss of information, is erformed by this method, it is not the most effective way to reresent the English language. First, the 8-bit sequences does not aear at the same number of times, on average. In almost all English documents the 8 bits reresenting the letter e will aear more than the 8 bits reresenting the letter q, which will aear much more than the 8 bits reresenting a vertical tabulation (a control character which is not in use today). Another inefficiency is that the English letters are not indeendent. There are deendencies within a single word, e.g., the air qu is more likely to aear than the air uq in English words. Also, there are deendencies between words, because of secial linguistic rules alied to the language. 3

11 Data comression Lossless data comression is a method to take a sequence of symbols from a known source, like ASCII encoded English above, and convert it into a more comact form, meaning, less bits on average. From a simle counting argument, this cannot be done for all sequences. While some sequences get shorter, others are getting longer. The comlicated art is to encode the English bit-stream in such a way that almost all English documents will be shortened. The decoder, naturally, exands the bit sequence from the comact form into the original form, with erfect accuracy. A simle way to comress this tye of data is to use a redefined dictionary. For examle, encoder and decoder will use an English dictionary with 32, 000 words (which also contains all the single-letter characters). The encoder will encode each of the words that exist in this dictionary into 15 bits, which is suitable for selecting a secific dictionary entry. This is less than the number of bits allocated for 2 letters in the raw data. Each unknown word will be encoded, letter by letter, into 15 bits er letter, which is more than the original form. This rocess has several drawbacks. First, different sectors use different vocabulary. For examle, the English language used by an Engineer is different from the English language used by a Doctor (not to mention the hand writing... ). Thus, dictionary secific to each subject will robably contain more than 32, 000 words, causing each word to use more bits in the encoded sequence. It seems that one dictionary cannot fit both. Second, this rocess is not caable of learning. For examle, where a document is reeatedly using a name, this rocess will encode it to 15 bits er letter each time, which is way too much. A more efficient way is to add this word to the dictionary. 4

12 Third, this rocess is not robust; it can fail miserably. For examle, a document with the letter x reeated 1, 000, 000 times have a raw size of 8mbit, and a comressed size of 15mbit. This means that a document described in a single sentence will have a huge raw size, and will become almost 2 times bigger after comression. These three examles set an enormous need for a more efficient, learning, coding method. Universal coding Universal coding is a coding scheme in which the encoder can accet more than one tye of inut, leading to a more robust rocess. A universal encoder can comress efficiently different tyes of English vocabulary, used by different sectors, and maybe, even other languages. A simle-to-understand universal encoder can be a combination of several (N) non-universal encoders, in a method known as two-ste code. The encoder will first collect all the raw data. Then, it will encode the data using all N non-universal encoders. It will select the encoder with the most comact reresentation of the data. Then, it will encode it using a two-ste code. First, it will transmit the selected encoder, using log 2 (N) bits. Then, it will transmit the encoded data, given the selected encoder. The universal decoder in this case is a trivial extension of the nonuniversal decoders. First, it detects the decoder to use, from the first log 2 (N) coded bits. Then, it asses the rest of the data to the selected (non-universal) decoder. This basic examle demonstrate some well-known roerties of every universal code. First, each universal code has this tradeoff between features and comactness. In this examle, one can enlarge N, thus enable detection 5

13 of a more suitable non-universal code, but the increase in flexibility results in a ayment in a form of larger redundancy for the first art. Second, each universal code can reresent only few data tyes effectively, while it cannot efficiently reresent other tyes. Third, each universal code has some inherent redundancy over a non-universal code that is tuned to a secific tye of data. It turns out that it is ossible to define an effective universal code, with the following features: It is effective for many data tyes, or even asymtotically infinite number of data tyes. It has a small, almost zero, normalized redundancy over a non-universal code tuned for each of these data tyes. It is general; not a collection of different codes. Data can be encoded on-the-fly, with only one ass, maintaining a relatively small state. When it fails, data is not exanded much more than its original size. Some codes even have a structure that enables in-deth theoretical insection. Entroy, and the divergence distance measure When the robability of a, say, binary event is known in advance (to both the encoder and decoder), one can design the erfect code for it. Denote as the robability of a bit x to be one. Also, denote the codelength associated for x as l 1 bits if x turns out to be one, and l 0 bits if x turns out to be zero. 6

14 Both codelengths must be ositive numbers, although they may be a fraction. Following Kraft inequality [13], l 1 and l 0 cannot be arbitrary small: 1 2 l i 1. (1.1) i=0 In order to have minimal redundancy over the otimal code, a code should at least satisfy this inequality at equality. It means, that one can define some arbitrary arameter, q, which is a number in the range [0,..., 1], and define l 1 = log 2 (q) bits, and l 0 = log 2 (1 q) bits, thus Equation (1.1) is satisfied at equality. In this case, the average codelength is: log 2 (q) (1 ) log 2 (1 q). (1.2) Also, define the binary entroy function: h() log 2 () (1 ) log 2 (1 ) (1.3) which is ositive because 0 1, and the divergence distance measure between and q: D ( q) log 2 ( q ) + (1 ) log 2( 1 ). (1.4) 1 q Combining these definitions, the average codelength is the sum of the binary entroy function of the real robability, and the divergence between this robability and the arbitrary arameter q. By roving that D ( q) 0, and that it is zero iff = q, we show that the minimal codelength is h() bits, and it is achieved by using the codelength which is the self-information associated with, thus, l 1 = log 2 () bits, and l 0 = log 2 (1 ) bits. So, the arameter q is actually the robability assigned by the universal coding rocess, and the extra bit-rate above the entroy is D ( q). This codelength, although not an integer, is the number of bits that the encoder sends to the decoder for encoding each source bit. For examle, if 7

15 = 0.9, it is ossible to encode a result of x = 1 by bits, and a result of x = 0 by bits, which gives bits in average. If the universal coding rocess assumes that q = 0.89, the extra bit-rate above the entroy is bits er source bit, and if q = 0.91, the extra bit-rate is bits er source bit. Thus, there is some enalty for misdetection of. The roof that D ( q) 0 follows from Jensen inequality: E [log 2 (X)] log 2 [E(X)] (1.5) for any ositive random variable X, and equal iff X is a constant. Using this inequality, and defining X as a random variable with value of q/ in robability and (1 q)/(1 ) in robability 1 : ( ) ( ) q 1 q D ( q) = log 2 + (1 ) log 2 1 = E [log 2 (X)] log 2 [E(X)] ( = log 2 q ) q + (1 )1 1 = log 2 (1) = 0. (1.6) Probability assignment Any lossless coding method is essentially a method for sequential robability assignment. In order to imlement an encoder, one can take the assigned robability t for the next bit x t, to be one, given the revious observations x 1,..., x t 1, and convert it into code, sequentially. If this assigned robability is the true robability, the resulting code is otimal. On the other hand, any encoder is actually assigning robabilities. If the code for a given sequence of symbols is n bits, it means that the encoder assigned a robability of 2 n for that sequence of bits to aear at this osition. 8

16 A non-universal encoder has a fixed, redefined robabilities. For examle, the robability that ASCII encoder assigns to the letter A is 1/256, without regarding the context. A universal encoder, on the other hand, has to learn the robability assignment when it encodes the bits, in order to give a more accurate estimate for the next bit. When the robabilities are known, it is easy to encode the data into bits. This translation can be done by the technique of [12] and other related techniques. Stochastic setting What is the true robability anyway? Is it ossible to comress a sequence of bits to any factor, or, is there a lower limit? In the stochastic setting, data is modeled as the outut of a stochastic rocess, with a deterministic, redefined set of robabilities. In this case, the true robability is the robabilities as defined by the data model. There is a lower limit on the average comression rate of the data Shannon entroy [1]. If the encoder (and the decoder) knows the data model, it is ossible to use a simle, non-universal code. When fed with the right data, this encoder will encode the data erfectly, to its entroy. But, if the data model is incorrect, or inaccurate, this encoder will fail. Universal encoder, on the other hand, will try to learn the data model, and then assign robabilities sequentially. Meaning, each universal encoder tries to adat the data to a set of models. The simlest examle is the universal encoder for Bernoulli data models. In these models, data bits are indeendently identical distributed, with fixed robability P r{x t = 1}. Every universal coding rocess should estimate this unknown arameter. For examle, it can do it by using the Krichevsky- 9

17 Trofimov estimate [9]: ˆ t+1 = N t(1) t + 1 (1.7) where N t (1) is the number of ones in x 1,..., x t. Unlike the two-ste code described above, this encoder works on-the-fly, utilizing a single ass over the data. The estimate of gets more accurate over time, as needed in order to comress efficiently. For examle, if the data length is 1, 000 bits, there is no need for the encoder to describe with more than 3 decimal digits of accuracy. This is exactly what this rocess does the encoder imlicitly sends to the decoder with an accuracy of 1/n. If the data is Bernoulli, the redundancy of this rocess, er bit, will converge to zero. But if, for examle, this encoder will be fed with the non- Bernoulli data 0, 1, 0, 1, 0, 1,..., it will not be able to comress it at all! At least it will never exand the data too much. A stronger universal encoder uses the Markovian model of some fix order q. In this model, robability of each bit deends on revious q bits. Thus, in order to fully describe the data model, learning 2 q indeendent arameters is needed. A universal encoder (for order q) can assign riorities using Equation (1.7) for each of the 2 q arameters indeendently. In the above examle, if the code is designed for q 1, data will be comressed asymtotically to zero code bits er source bit. This scheme imlies setting the Markovian order, q, in the encoder and the decoder before seeing the actual data. This is a comlicated task. One advantage of using small q is the fast convergence rate. A disadvantage is the convergence to a sub-otimal oint, if the actual data has higher order Markovian elements. Lemel and Ziv suggested a well-known method [7, 8] that effectively uses higher order of q when collecting more data. This universal encoder is widely 10

18 imlemented today, being used in rograms like gzi and others. Still, it is hard to create a universal encoder that can encode any stochastic sequence. Consider the following: robability of the next bit (x t+1 ) to be one is 0.01, 0.5 or 0.99, deending on the binary number reresented by (x 1,..., x t ) modulo 3. Because this sequence is not Markovian, of any finite order q, it will take a very heavy Markovian encoder to encode this sequence efficiently, although it will converge eventually, for high enough q, as roven in [15]. Deterministic setting A different setting is the deterministic data model. This setting is somewhat more subtle to define and understand. In this setting, data is an arbitrary individual sequence, with a finite or infinite size. For examle, Windows NT service ack 7 (SP7), this L A TEX file, or the infinite binary reresentation of the number π. Since there is no statistic model for these bit sequences, there is no defined Shannon entroy, and there is no comression lower-limit. When calculating the size of a comressed file, should the size of the decomression rogram itself be added? Normally, the answer should be no. But then, one can create a decomression rogram that exands the bits 0, 0, 0 to SP7 (by integrating SP7 into the decomression rogram). Does it mean that SP7 is comressible to 3 bits? Counting comression rogram size, one can create a CPU that when executing a single assembly command it will exand the whole SP7 into its memory. Should the number of transistors in the CPU be counted as well? The answer was given by Kolmogorov [2]. He defined a quantity, analogous to the entroy, for a deterministic sequences, as the minimal size of a 11

19 rogram that runs on a universal Turning machine, needed to generate this sequence. Thus, the entroy of SP7 is the size of the smallest executable that will exand itself to SP7, when running on a standard Turing machine. Of curse, in ractice, no one can comute the Kolmogorov entroy of SP7. It turned out, that it can not be done even in theory, by running all ossible rograms and see if their outut is SP7, due the halting roblem. It is imossible to comute Kolmogorov entroy of π as well, but it is easy to rove that it is finite, meaning, that the entroy er bit, of the number π, is zero. If the rocess is stochastic, there is a strict relation between Kolmogorov entroy and Shannon entroy they are (almost) the same, with robability one, for long enough data sequences. A more feasible method to measure the comlexity of deterministic sequences is to check its emirical Markovian entroy of order q. For any of the 2 q histories, one needs to comute the emirical entroy of the next bit. The weighted average of these comutation is the emirical entroy of the sequence, of order q. It is easy to rove that the emirical Markovian entroy is a monotonic non-increasing function of q. Also, for any finite sequence, with only N bits, the emirical Markovian entroy will converge to zero when q N. This is exactly what haened in the examle above of comressing SP7 into 3 bits. On the other hand, when the sequence is infinite, the emirical Markovian entroy will usually not converge to zero. If the rocess is stochastic and ergodic, there is a strict relation between the emirical Markovian entroy of high enough order (q ) and the Shannon entroy they are (almost) the same, with robability one, for long enough data sequences. 12

20 If π is a normal number 1, its emirical Markovian entroy is maximal, i.e., 1 coded bit er source bit, meaning that it cannot be comressed at all using a finite-state machine. On the other hand, as discussed before, its Kolmogorov entroy er bit is 0, meaning that it is comressible to zero coded bits er source bit, using a Turing machine. This demonstrates, to the extreme, the limitations of the emirical Markovian entroy, of any finite order. Personally, I doubt that any ractical encoder can ever break the emirical Markovian entroy limit. Comression vs. rediction In the comression roblem, or the riorities assignment roblem, the encoder has to sequentially assign robabilities for the next bit to be one, with a self-information criterion. In the rediction roblem, on the other hand, the redictor has to redict whether the next bit is going to be one or zero, with, say, a criterion of minimal number of errors. The rediction roblem is somehow simler. For examle, in the comression roblem the encoder need to distinguish between the cases where the robability for the next bit to be one is or 0.124, while in the rediction roblem the redictor should redict zero in both cases. There is no one-to-one relation between the comressibility and redictability of a sequence using a finite-state machine (with number of states q ). Only in the extreme case such a relation exists: a sequence is fully redictable iff it is comressible to zero; a sequence is fully unredictable iff it is also uncomressible. As found in [15], the comressibility of a sequence, (x), is bounded from 1 A number whose digits are random-like, in any sequence length. 13

21 above and below by its redictability, π(x), by the equation: 2π(x) (x) h (π(x)) (1.8) where h( ) is the binary entroy function, as defined in Equation (1.3). Also, it is noted there that both lower bound and uer bound are achievable. Consider, for examle, a sequence that contains 4 zero bits, and then a single random bit, with robability = 0.5 to be one. Consider another sequence that is a Bernoulli with = 0.1. In both cases, redictability is π(x) = 0.1, meaning that, on average, there are 10% for rediction errors er each bit. The comressibility, however, is different for the sequences. In the first case, comressibility is (x) = 0.2, which is the lower limit. In the second case, comressibility is (x) = h(0.1) = 0.469, which is the uer limit. 1.2 Universal Coding with Limited Resources In real life the comlexity of the coding rocess is always limited. However, the universal encoders resented above, even in the most simle case where data comes from a Bernoulli rocess, require unlimited resources. Secifically, the suggested universal coding scheme assigns robabilities according to Equation (1.7). This equation imlies that the encoder and the decoder should know t and N t (1), for every t, meaning that they need to have an infinitely growing memory in order to store these two infinitely-large integer numbers with a erfect accuracy. In order to overcome the infinite memory requirement for the accurate imlementation of Equation (1.7), two straight-forward aroximate solutions can be suggested. One solution is to use simle counting when ossible, but then freeze both counters when t reaches its uer limit, i.e., when there are no more states to hold bigger values of t and N t (1). In this solution, 14

22 all successive data bits do not modify the estimation ˆ of. This means, however, that if the data model is slightly wrong, and is slowly changing in time, the encoder cannot track the variations and may fail. Another solution is to reset both counters to zero when t reaches its uer limit. In this case, the encoder works in blocks, returning to ˆ = 1 from 2 block to block, indeendent of the data. This encoder will erform better if the model is incorrect, thus, it is more robust, at the exense of its accuracy. Later on in the thesis it will shown that both solutions are far from otimal, and another, better, solution will be resented for this Bernoulli case. In order to analyze the limited resources coding roblem we should first define how to measure the comlexity of a coding rocess. Should the comlexity measure be the size of the binary code imlementing it on a PC? Should it be the die size, or the number of gates, in hardware imlementation? There can be many measurements. In this work, as in other revious works on limited resource universal machines, the comlexity limitation is chosen to be limited memory. The machine is constrained to be a finite-state machine with K states. Previous results limited memory universal machines Probably the first set of results on data inference using finite-state machines was obtained by Cover [3] and Hellman and Cover [4, 5, 6]. This work considers hyothesis testing with finite memory. It determined the minimal extra error robability, as a function of the number of states, associated with finite-state limitation. Later on, Leighton and Rivest considered a roblem of assigning a robability to a sequence, using a finite-state machines, where the goal was to aroach the true robability that governs the data, with a minimal square 15

23 error criterion [11]. These results are extensively utilized in this thesis. As noted above, universal rediction is closely related to universal coding. The finite memory universal rediction roblem has been investigated in [17, 18]. The main results there is the introduction and analysis of two finite memory universal redictors: the sliding window redictor and the saturated counter redictor. Both redictors attain the redictability under the Bernoulli stochastic setting and the single-state redictability for the deterministic setting, but the saturated counter achieves that at a higher rate. The encoders suggested in this thesis have some similarities to these redictors. 1.3 Thesis Outline We are now ready to resent the thesis outline in secific terms. The thesis considers universal coding, or universal robability assignment, of binary sequences, where the universal encoder is constrained to have a finite memory, i.e., it is constrained to be a finite-state machine with K states. It actually considers the roblem of universal robability assignment using a finite memory machine. As noted above, the comlexity, or the amount of states, needed to translate the assigned robability into code bits, is ignored. This translation to code bits has a known cost er number of states and can be done, e.g., by the technique of [12]. The work considers three different settings for the binary sequences: It can be stochastic or individual sequence; if stochastic it can come from a Bernoulli source, or a q-th order Markov source; if deterministic, the goal is to comete with a single-state reference machine. The universal finite-state machines analyzed in the thesis can be either deterministic or randomized, 16

24 and can be either time-variant or time-invariant. The urose is to find, for each case, lower and uer bounds on the redundancy, and to secifically describe a machine that attains the uer bound. So far, the behavior in some of the cases is not comletely known. Unlike the classical work in universal coding, that analyzes the redundancy as a function of the observation length, this work focuses on the effect of the number of states, K, on the redundancy at a steady-state, i.e., after long enough (infinite) time. It turns out that, in all the cases considered in this work, the redundancy goes to zero as the number of states goes to infinity. This is not a trivial result; in the rediction roblem, only a random machine can achieve it, as shown in [15]. This thesis is organized as follows. In chater 2, the architecture of the coding system is described, as well as the secification of the various cases considered in the thesis. Then, chater 3, discusses the otimal way to quantize the robability values that are assigned to states. Chater 4 analyzes universal K-states encoders for the Bernoulli case, and in chater 5, this analysis is carried on for the q-th order stochastic Markovian case. Chater 6, considers the deterministic setting, where the goal (reference) is to comete with a single-state machine tuned to the data. The thesis is concluded at chater 7. 17

25 Chater 2 Data Settings and Machine Tyes The universal coding roblem, with limited resources, can be defined at different settings, reflecting assumtions on the data. For examle in the stochastic setting the data comes from a unknown stochastic rocess and the goal is to reach the entroy of that rocess. In the deterministic setting, the data is arbitrary, but the goal is to reach the erformance of a constrained batch encoder. In addition, the roblem is defined by the secification of the limited resources universal finite-state machine, which can be deterministic, stochastic, time-invariant or time-variant. This chater begins with a schematic descrition of the architecture of the finite-state coding system analyzed in the thesis. The resented scheme emhasizes that the finite-state machine is used for robability estimation. Then, the chater summarizes the various cases (data settings and machine tyes) that is analyzed later in the work. The chater ends with the definition of the minimum redundancy goal. 18

26 2.1 System Architecture The coding system analyzed in this work consists of an encoder, which comresses the data bits into coded bits, and a decoder, which exands the comressed bits to their original form, with a erfect accuracy. The encoder assigned robabilities using a K-state universal robability estimate, and then translates the actual bit x t into coded bits, using arithmetic encoder [12]. The decoder decodes the coded bits, using arithmetic decoder, and assigns robabilities for the next bit, sequentially. The block noted by D is a single bit delay block. The system architecture is illustrated in Figure 2.1. The K-state universal robability estimate block is common to the encoder and the decoder. It accets the data bits, with a single bit delay, i.e., x 1,..., x t 1, and assigns robability for x t to be one. In both laces, it is initialized with the same values, and fed with the same bits. Thus, it will rovide the same robability estimation, as needed, say, for the arithmetic coding rocess. Note that when the encoder is a randomized machine, we assume that the encoder and the decoder has the same seudo-random seed generator, which is not known to the source of the data sequence. The scheme above can corresond to both universal and non-universal coding rocesses. In this thesis universal schemes are considered in which the robability estimate does not deend on assumtion for the data model, and should rovide good erformance for a large set of ossible models. Thus, it should have some mechanism that essentially learns the relevant arameters of the incoming data, and assign robabilities accordingly. 19

27 Encoder X t Arithmetic Encoder D X t-1 K-State Universal Probability Estimate P*(X t ) Coded Bits Decoder X t X t Arithmetic Decoder K-State Universal Probability Estimate D X t-1 P*(X t ) Figure 2.1: System architecture 20

28 2.2 Data Settings Bernoulli setting The simlest setting for universal coding is when data comes from an i.i.d. Bernoulli source, with unknown robability P r{x t = 1}. The reference comression rate is the source entroy, as defined in Equation (1.3): reference h(). (2.1) Markovian setting A more advanced stochastic setting is when data has Markov distribution of a known order q, with unknown values of the conditional robabilities P r(x t = 1 x t 1,..., x t q ) = (x t s), at each Markov state s. The reference comression rate is the entroy of the source [13]: 2 q reference H q = P s h((x t s)) (2.2) where s denotes the Markovian state, secified by the revious q symbols, and P s is the stationary robability to be at state s. s=1 Deterministic setting In the deterministic setting, the data is an arbitrary infinite individual sequence. However, the erformance of the universal encoder is comared to a reference value, which can be the emirical entroy of the sequence. This reresents the best ossible comression of a constrained batch encoder. The reference comression rate is the single-state emirical entroy: ( ) Nt (1) reference ρ 1 (x) = lim su h. (2.3) t t 21

29 The goal is to attain this rate for ANY sequence. Higher order finite-state emirical entroy, or, finite-state comressibility, is also defined [15]: where ρ(x) lim lim su min ρ ( ) g; x1 t, (2.4) S g G s ρ ( ) S g; x1 t = s=1 t N t (s) h t ( ) Nt (s, 0). (2.5) N t (s) This thesis does not consider this higher order deterministic reference. 2.3 Machine Tyes The main constraint imosed on the universal machine analyzed here is that it has only K states. This work shall distinguish between three cases. The simlest case is where the machine is deterministic and time-invariant. A more flexible machine is a random machine, where the state transition function may be stochastic. This work shall also consider a time-variant machine, where the state transition function, and the robability assigned to each node, can vary in time. Deterministic machine A finite-state machine is a machine with a finite number of states, K. Following the rocessing of t data bits, x 1,..., x t, the machine remembers only S t, the state at time t. S 0 is the initial state of the machine. The robability that the machine assigns for the bit x t+1 is deend only on S t, meaning that each state has a fixed robability assigned to it, by the function: ˆ t+1 = f (S t ). (2.6) 22

30 The machine has a state-transition function, which is a function of the current state, and the next bit: S t+1 = g (S t, x t+1 ). (2.7) Random machine In a random machine, the initial state, S 0, and the state-transition function, g( ), are both random functions, i.e., S 0 = S 0 (Ω 0 ) (2.8) S t+1 = g (S t, x t+1, Ω t+1 ) (2.9) when Ω i are indeendent random variables. As shown in [6], random machines having the same K-state limitation as deterministic machine can rovide better results. We shall see that same henomena in this case as well. Time-variant machine In a time-variant machine, the robability function, f( ), and the statetransition function, g( ), are both time-variant functions, i.e., ˆ t+1 = f (S t, t) (2.10) S t+1 = g (S t, x t+1, t). (2.11) 2.4 Minimum Redundancy Goal The self-information of an event x is the amount of information revealed by the fact that x = 0 or x = 1, given the a-riori robability for this event 23

31 to occur: log 2 (1 ) x = 0 i(x ) log 2 () x = 1 (2.12) The actual comression rate, using idle code, of a given sequence is the sum of the self-information of all the events, divided by the length of the sequence: actual lim su T T t=1 i(x t t ) T (2.13) and the redundancy is: redundancy actual reference. (2.14) The goal is to minimize the redundancy, universally, by selecting the otimal machine, i.e., selecting S 0, f( ) and g( ). 24

32 Chater 3 Otimal Probability Axis Quantization This chater analyzes an otimal quantization of the robability axis, where the goal is to minimize a worst case redundancy associated with the quantization rocess. This redundancy is an essential ingredient of any roblem where a finite-state machine assigns robabilities. This is because a finite-state machine, with only K states, can assign at most K distinct values of robabilities t. Thus, the ossible robability range [0, 1] has to be quantized into K values. Assuming data is coming from a Bernoulli source with (real) robability, and this robability is estimated, after quantization, by some value i, where i {1,..., K}. When encoding this data into coded bits, the encoder will have an inaccurate data model, in the case that i, causing it to be sub-otimal. Since can have any value, and i can have only K values, this haens often. The average extra bit-rate over the entroy is the divergence between and i, D ( i ), as defined in Equation (1.4). This rovides a fundamental limitation on the achievable erformance of 25

33 any finite-state machine. There is always a value of for which the difference between and the closest estimate i is at least 1/2K. 3.1 Min-Max Criterion One ossible way to design an otimal quantization for robability assignment is to use the following min-max criterion: ( [ 1,..., K ] ot = arg min [ 1,..., K ] max ( ) ) min (D ( i )) (3.1) i where the inner minimum is simly the selection of the closer (in divergence sense) oint between the two closest quantization oints of from both sides. Denote in the sequel: ( ) D max max min (D ( i )) i (3.2) which is the maximal divergence given [ 1,..., K ]. The otimal quantization is defined as the one that minimizes D max. 3.2 High Resolution Asymtotic Quantization Theorem 3.1 In the Bernoulli setting, for K 1, min D max = higher order terms, (3.3) [ 1,..., K ] K2 and it is achieved by the quantization oints: ( π(i 1 ot i = sin 2 ) 2. 2K ; i = 1,..., K (3.4) as: Proof: At high resolution, the divergence function can be aroximated D ( ± ɛ) ɛ 2 2 ln(2)(1 ) 26 ɛ ɛ 1 (3.5)

34 Define a density function, λ(), which is the density of the quantization oints near oint, λ() 0, 1 0 λ()d = K, (3.6) and we get ɛ max () 1 2λ() 1 D max () 8 ln(2)(1 )λ 2 () (3.7) (3.8) Since min-max causes the D max () to be constant for any oint : λ() ((1 )) 1/2 (3.9) which gives: K λ() = π (1 ) (3.10) π 2 D max () 8 ln(2)k K 2 (3.11) The restoration oints, { i } K 1, can be located by integration: i 0 λ()d = i 1 2 ; i = 1,..., K (3.12) which leads to Equation (3.4). This is a lower bound on the erformance of any finite-state encoder. To achieve this limit the encoder should lock on the right state at all times. As will be seen later, a time-variant machine can do that! Grahical interretation It turns out that there is a grahical interretation for these restoration oints. The oints are arranged on a trajectory of a linear-saced circle over 27

35 the robability axis, as demonstrated in Figure 3.1. Note that the decision oints of the quantizer are not the trajectory of the slices they are slightly different. P 1 P 2 P 3 P 4 P 5 Figure 3.1: Otimal quantization oints Emirical results When comaring this high-resolution quantization with the otimal one, for a given finite K, it seems that the differences are minor (even for reasonable K). For examle, for K = 100, the distance between the otimal oints and oints given by Equation (3.4) is no more than , which is only about 1% of the distance between the oints. The otimal oints were calculated using an algorithm which is similar to Lloyd scalar quantization algorithm [13]. A Java rogram created for comaring these results can be requested via to the author. 28

36 3.3 Comarison with Uniform Point Allocation Under the divergence difference measure, the uniform oint allocation erforms oorly. The maximal divergence between a quantization oint and a true robability oint occurs at edges, i.e., when = 0 or = 1. The divergence at these oints is: ( D max (0) = log 1 1 ) K K which is Θ ( ) ( 1 K instead of Θ 1 ) K in the otimal quantization. 2 (3.13) Under a Bayesian criterion, where it is assumed that is uniform in the range [0, 1], quantization with uniform oint allocation does not erform well either. This is because the divergence at the edges is, as above, Θ ( 1 K ). Aroximating the difference for each oint by ɛ = 1, which is the worst 2K case, leads to: 1 2 D average 2 D ( + ɛ)d = 0 ln(k) 8 ln(2)k 2 (3.14) While this calculation is only aroximate, we easily observed that for uniform distribution D average behaves as: D average = Θ ( log K K 2 ). (3.15) 29

37 Chater 4 Bernoulli Setting This chater considers the Bernoulli setting, and describes various finitestate machines that coe with it. It turns out that in this case there is an otimal redundancy rate for time-invariant machines of Θ ( 1 K ). This work first introduces a random finite-state machine, and roves that it achieves this otimal convergence rate. Then, it rovides a deterministic machine whose rate is Θ ( ) log K K. This machine is constructed out of the random machine, by sacrificing few states, and using the true randomness of the Bernoulli data. Finally, this work rovides a time-variant machine whose rate is 1.78 (K 2) 2, i.e., it achieves the quantization limit itself(!), including the exact otimal constant. 4.1 Random Machine Leighton and Rivest investigated an estimation roblem using finite-state machines [11], related to our roblem, but in a different context. In their aer, they used mean square error criterion, i.e., minimizing (ˆ ) 2, and not the coding redundancy, i.e., minimum D ( ˆ). It is shown there that the best finite-state machine (either random or deterministic) cannot do better 30

38 than Θ ( 1 K ), for every. They also resented a random machine that attains this rate. This machine has uniform quantization structure. In our case, same results are obtained by following their derivations, where instead of using a uniform quantization of the robability axis, we use the otimal quantization, given by Equation (3.4). This can be done because in high-resolution quantization the divergence function is actually a square error function, as showed by Equation (3.5), since the factor (1 ) becomes a constant. This gives a lower bound on the redundancy of Θ ( 1 K ). The suggested universal random machine with K states is given at Figure 4.1. This machine is essentially the otimal random machine suggested at [11], where the robability value at each node, and the transition robabilities, are set using the otimal quantization formula, given by Equation (3.4). These robabilities are denoted r 1,..., r K in the figure, where r i +r K i+1 = 1. This machine attains the otimal Θ ( 1 K ) error, for every. r 1 r K r 2 r K-1 r 3 r K-2 r 3 r K-1 r 2 1 r 1 r 2 r 3 r K-1 r K qr qr qr qr qr K-1 K q1 qr K-1 qr K-2 qr 2 qr 1 Figure 4.1: Random machine for the Bernoulli setting Sliding-window interretation The machine above yields aroximately the same results as a slidingwindow deterministic machine, that remembers the last K 1 data bits exlicitly, using 2 K 1 states. The machine state, i, reflects that there are 31

39 (K 1) i set bits in the buffer. Then, if the machine gets a set bit, x t+1 = 1, it should remove x t K+2 from the buffer, and aend the set bit. Because the machine does not know the value of the removed bit, it assumes that it is one in robability i. Combined with the set bit we need to add, the machine remains in the same state in robability i, and advances one state in robability 1 i = K i+1. Stable equilibrium interretation In order to continuously move towards the correct state, the machine should be in a stable equilibrium in that state. At any time t, the ressure on the machine is defined as the exectation of the state transition, given the current state: ressure E{S t+1 S t S t }. (4.1) By defining the estimation error at time t, ɛ t ˆ t, the ressure can be calculated: ressure = (1 ˆ t ) q ˆ t = (1 ɛ t ) (1 ) ( + ɛ t ) = ɛ t. (4.2) When ˆ t =, the ressure is zero. When ˆ t = the ressure on ˆ t is towards, at the magnitude of the difference between them. Effectively, this gives a Gauss distribution for ˆ where average is and its variance is in the order of Θ ( 1 K ). 32

40 4.2 Deterministic Machine Following [11], the random machine above can be converted into deterministic machine, by using the true randomness of the Bernoulli data. This conversion exands each of the K states into Θ (log K) states, thus reducing the convergence rate to Θ ( ) log K K. q 1/2 1 q 2 3 q i 1/2 j i j (robabilistic) (deterministic) Figure 4.2: Simulation of robabilistic transitions by deterministic machine The basic idea is to look at two data bits at a time, skiing all occurrences of 00, 11, and using only 01, 10. Since the data is Bernoulli, the robability to get 01 before 10, and vise versa, is the same (50%), indeendent of. Figure 4.2 demonstrates this method, for i = 1/2. In order to imlement it for, say, i = 5/8, we need 3 stages, or 9 intermediate states. In general, in order to get an accuracy of 1/Q we need Θ (log Q) states. In order to imlement the random machine suggested in Figure 4.1, we need an accuracy based on Equation (3.4), of Q K 2 at the edges of the robability axis. This means that each state is exanded into Θ (log K) states, thus reducing the convergence rate to Θ ( ) log K K, as described above. 33

41 4.3 Time-Variant Machine One of the main results of this thesis is that the quantization limit, as derived in the revious chater, is asymtotically achievable by a time-variant machine. This means that time-variant machine can have the same erformance as time-invariant machine, but with a smaller, square root, number of states. Theorem 4.1 There exists a time-variant machine whose redundancy is,, at most ɛ ; ɛ > 0 (4.3) (K 2) 2 Proof: Given K, set K 2 quantization oints using the otimal quantization scheme, as defined in Equation (3.4). These restoration oints will slice the [0, 1] range into K 2 intervals, each interval corresonds to all robability values associated with a secific quantization value. For every ositive ɛ, the machine converges in a finite time and stays at the correct interval with robability 1 ɛ. The worse divergence, under the assumtion that the machine is at the correct interval, was shown to be 1.78 (K 2) 2. Since it converges in finite time to this interval, the average (over time) of the divergence will also not exceed this value. To rove that the convergence occurs in finite time, one can utilize the results of Cover [3]. It was shown there that with only 4 states and infinite time, it is ossible to recognize, with robability one, whether the machine is above or below a given decision oint. This is done using iteration between 4 time-grous, as described in Figure 4.3. In order that this rocess will converge to the right oint in robability one, there is a need to enlarge N 0 and N 1 towards infinity exactly at the rate secified in [3]. 34

42 < limit > limit,q q q, check: < limit zeros detect N 0 times q, decide: < limit,q q Once q,,q check: > limit q ones detect N 1 times q, decide: > limit q,q Once Figure 4.3: Cover s 4-state rocess 35

43 Let fix δ > 0. It was also shown in [3] that there exists a finite time interval, T (δ, α), so that with robability 1 δ the machine can determine within this time interval whether the source robability is below or above α. In this case there are K 3 such threshold oints, each between two quantization oints. Using this, it can be demonstrated how a time-variant machine can determine the correct interval with high enough robability, and within a finite duration (which is actually the sum of time durations require to check all these thresholds). Secifically, the machine first checks whether is below or above the first decision oint d 1. After the finite time T(δ, d 1 ), and using the 4 states, it makes the decision whether is below or above d 1, and the robability of error in this decision is at most δ. If the machine decides that is below d 1, it goes into a state that serves as a sink for the first quantization value. Otherwise it moves to the second stage. At the n-th stage the machine checks whether is below or above the n-th decision oint d n. Only n 1 states are needed in order to remember if the machine entered a sink at any revious decision, and extra 4 states for making the threshold decision regarding d n. This takes T(δ, d n ) time. At the last stage, where n = K 3, the machine needs K 4 states in order to remember revious sinks, and extra 4 states for the decision rocedure, a total of K states. After T(δ, d K 3 ) time it ends u at a sink, i.e., it is locked at a state, and ignores the rest of the data. Each state has a robability of no more than δ to err. By setting δ = ɛ, K 3 the total robability of error will be no more than ɛ throughout this rocess. 36

Feedback-error control

Chater 4 Feedback-error control 4.1 Introduction This chater exlains the feedback-error (FBE) control scheme originally described by Kawato [, 87, 8]. FBE is a widely used neural network based controller