P. Fenwick [5] has recently implemented a compression algorithm for English text of this type, with good results. EXERCISE. For those of you that know

Size: px

Start display at page:

Download "P. Fenwick [5] has recently implemented a compression algorithm for English text of this type, with good results. EXERCISE. For those of you that know"

Ophelia Newton
6 years ago
Views:

1 Six Lectures on Information Theory John Kieer 2 Prediction & Information Theory Prediction is important in communications, control, forecasting, investment, and other areas. When the data model is known, it is understood how to do optimal prediction. When the data model is not known, one needs a prediction algorithm, called a universal prediction algorithm, that will perform well no matter what the data model turns out to be. These few pages represent a brief introduction to the interesting subject of universal prediction. 2.1 Shannon's Work in Prediction Information theorists nd prediction particularly interesting because of Claude Shannon's interest in this subject. In Section 2.1, we discuss Shannon's contributions to prediction Compression of English text via prediction Shannon [16] performed an experiment in which a human subject was asked to try to predict the next letter in an English text given all the previous letters. If the subject made a wrong prediction, the subject was allowed to make further guesses until the next letter was guessed correctly. The number of guesses to determine each letter was then recorded. Shannon reported one outcome of this experiment as follows: T H E R E I S N O R E V E R S E O N A M O T O R C Y C L E A F R I E N D O F M I N E F O U N D T H I S O U T R A T H E R D R A M A T I C A L L Y T H E O T H E R D A Y Beneath each letter of the text is given the number of guesses that were needed to guess that letter correctly. Suppose the subject always makes predictions in the same way, based on what the subject has seen in the past. Then the text is completely recoverable from the sequence consisting of the numbers of guesses: 1; 1; 1; 5; 1; 1; 2; : : :; 1; 1; 1; 1; 1 (1) For example, the subject looks at the rst integer in this sequence and sees a \1". The subject then knows that the rst letter of the text is the letter that the subject would make as his/her rst guess as the rst letter of any text. Presumably, the subject has a rule that tells him/her to always make the initial guess for the rst letter of any text to be \T". The rst letter of the text is therefore decoded correctly. One can arrive at a data compression algorithm based on Shannon's idea. If x 1 ; x 2 ; : : : ; : : : ; x i?1 represents the rst i? 1 letters of any text, then you have to specify a permutation of the set of letters (x 1 ; : : : ; x i?1 ) = ( 1 ; 2 ; : : : ; 28 ) (2) fa; b; c; : : : ; z; spaceg such that 1 is your rst prediction (guess) for x i, the i-th letter of the text, 2 is your second prediction (guess) for x i, etc. The family of all these permutations (x 1 ; : : : ; x i?1 ), as i varies and x 1 ; x 2 ; : : : ; x i?1 vary, determines a data compression algorithm. You rst represent your text x 1 ; x 2 ; : : : ; x N as the sequence of positive integers j 1 ; j 2 ; : : : ; j N in which j i = j if and only if the j-th term of the sequence (x 1 ; : : : ; x i?1 ) in (2) is equal to x i. Then, one compresses the sequence j 1 ; j 2 ; : : : ; j N using an adaptive arithmetic coder. 1

2 P. Fenwick [5] has recently implemented a compression algorithm for English text of this type, with good results. EXERCISE. For those of you that know how to do adaptive arithmetic coding, arithmetically encode the sequence (1), and thereby determine the number of codebits needed to represent Shannon's text above. (If you don't know how to do adaptive arithmetic coding, you can get an approximate upper bound to the arithmetic codeword length by computing log 2 J, where J is the product of the terms in the sequence (1).) Mind-reading machines At Bell Laboratories David Hagelbarger [8] built a simple mind-reading machine, whose purpose was to play the \penny matching" game. In this game, a player chooses head or tail, while a mind-reading machine tries to predict and match his/her choice. Hagelbarger's simple 8-state machine was able to match the \pennies" of its human opponent 5218 times over the course of 9795 plays. Hagelbarger's machine recorded and analyzed its human opponent's past choices, looking for patterns that would foretell the next choice; since it is almost impossible for a human to avoid falling into such patterns, Hagelbarger's machine could be successful more than 50% of the time. Inspired by Hagelbarger's success, Shannon then built a dierent 8-state mind-reading machine. In a duel between the two machines, Shannon's machine won. A description of Shannon's machine may be found in [17]. An account of Shannon's philosophy in building mind-reading machines and other game-playing machines may be found in [18]. 2.2 Predictors for Known and Unknown Data Models If one is dealing with a data sequence whose probabilistic model is known, then one can optimally predict the samples in the sequence. This is discussed in Section If one does not know the data model, one must use a predictor that works well no matter what the data model is, that is, one must use a universal predictor. Section pins down for us the notion of a universal predictor Optimum predictor for a known data model Let A be a nite alphabet. Let X 1 ; X 2 ; X 3 ; : : : be an innite random data sequence in which each random data sample X i takes its value in A. A predictor applied to this sequence shall generate a sequence of predictions ^X 1 ; ^X2 ; : : :, in which ^Xi is the predicted value of X i for every i. The predictions must be nonanticipatory; that is, for each i 2, ^Xi is determined by X 1 ; X 2 ; : : : ; X i?1. A predictor can either be deterministic or random. To dene a deterministic predictor, you specify for each i a function f i from A i?1 into A. You then dene the predictions by ^X i = f i (X 1 ; X 2 ; : : : ; X i?1 ) a.s. To dene a random predictor, you specify a probability distribution q i (jx 1 ; x 2 ; : : : ; x i?1 ) on A for each sequence of values x 1 ; x 2 ; : : : ; x i?1 of X 1 ; X 2 ; : : : ; X i?1, respectively. Then, the conditional distribution of ^X i given X 1 = x 1 ; X 2 = x 2 ; : : : ; X i?1 = x i?1 is taken to be q i (jx 1 ; : : : ; x i?1 ). At this point, we need to make some technical comments. Having specied a predictor, one needs to be able to compute the prediction error probability Pr[ ^Xi 6= X i ] for each i. To accomplish this, we make the implicit assumption that the conditional probability coincides with the conditional probability Pr[ ^Xi = ^x i jx 1 = x 1 ; X 2 = x 2 ; : : : ; X i?1 = x i?1 ] (3) Pr[ ^Xi = ^x i jx 1 = x 1 ; X 2 = x 2 ; : : : ; X i = x i ] This allows one to compute the joint conditional probability as the product of the conditional probability Pr[X i = x i ; ^Xi = ^x i jx 1 = x 1 ; X 2 = x 2 ; : : : ; X i?1 = x i?1 ] (4) 2

3 Pr[X i = x i jx 1 = x 1 ; : : : ; X i?1 = x i?1 ] and the conditional probability in (3) (which comes from q i ). From the probabilities (4), one can compute the joint probability distribution of X i and ^Xi, from which one determines Pr[ ^Xi 6= X i ]. The aim of predictor design is to design a predictor so that Pr[ ^Xi 6= X i ] is minimized for every i. Such a predictor is called an optimum predictor. If the probabilistic model of the random sequence fx i g is known (that is, the probability distribution of (X 1 ; X 2 ; : : : ; X i ) is known for every i), then an optimum predictor is dened as follows: ^X i = arg max a2a Pr[X i = ajx 1 ; X 2 ; : : : ; X i?1 ] Note that this optimum predictor is a deterministic predictor. We see that, in the known data model case, there is no need to consider random predictors. EXAMPLE. Let p() be a probability distribution on A. Suppose the model for the data sequence fx i g is known to be the i.i.d. model in which each X i has the distribution p(). Then, the best predictor as given in (3) can be dened in the following way. Pick a 2 A so that p(a) p(x) for every x 2 A. Dene ^X i = a; i = 1; 2; 3; : : : Prediction for an unknown data model If the model for the data sequence fx i g is not known, then the optimum predictor may not be known, since it may depend on the model. In this case, one can try to nd a \universal predictor", a predictor that will make Pr[ ^Xi 6= X i ] asymptotically small as i! 1, no matter what the model may be. In order to make precise what the notion of a universal predictor is, let us simplify matters by assuming that our random data sequence fx i g is a stationary sequence. Suppose we pick the best predictor in the sense that Pr[ ^Xi 6= X i ] is minimized for every i. Then it can be shown that the average prediction error probability n?1 n X Pr[ ^Xi 6= X i ] converges as n! 1 to a number we shall denote as (M), where M is the probabilistic model of the sequence fx i g. (You could take M to be the family of nite-dimensional distributions of the random sequence fx i g.) We suppose now that the model M of the data sequence fx i g is unknown, and lies in some known class M of stationary data models. (For example, M could be the class of i.i.d. models, or the class of stationary Markov models.) We say that a predictor is a universal predictor for the class M if n?1 n X Pr[ ^Xi 6= X i ]! (M) as n! 1 (5) where the data sequence fx i g employed in (5) is allowed to have a model M which can be any model in M. EXAMPLE. Suppose we have a binary alphabet (A = f0; 1g). Let N 0 (x 1 ; x 2 ; : : : ; x i?1 ) denote the number of zeroes in the binary sequence (x 1 ; x 2 ; : : : ; x i?1 ) and let N 1 (x 1 ; x 2 ; : : : ; x i?1 ) denote the number of ones in this same sequence. Consider the following predictor: ^X i = 0; N0 (X 1 ; : : : ; X i?1 ) N 1 (X 1 ; : : : ; X i?1 ) 1; otherwise It is easy to show that this predictor is a universal predictor for the class of i.i.d. binary data models. (Hint: Let fx i g be an i.i.d. binary data sequence, and let p = Pr[X 1 = 0]; it is easy to show that (M) = min(p; 1? p), and using the law of large numbers, it is easily seen that the limit on the left side of (5) is also min(p; 1? p).) (6) 3

4 No matter what the class of stationary data models M may be, there will exist at least one universal predictor for M. Of these universal predictors, which are the best? One should look for universal predictors for the class M for which the speed of convergence to the limit in (5) is fast rather than slow. For example, Merhav et al. [11] have shown that for the binary i.i.d. model class and for the universal predictor in (6), the convergence in (5) takes place with speed O(1=n) as n! 1. In addition, they obtain similar results for a universal predictor for the class of binary Markov models. Results of this type for more general model classes are needed. 2.3 Construction of Universal Predictors We discuss two information theory methods for constructing universal predictors. In Section 2.3.1, we show how to construct universal predictors from arithmetic encoders. In Section 2.3.2, we discuss the MDL method for constructing universal predictors Universal predictors from arithmetic encoders Prediction and data compression are connected in the sense that good data compressors can be used to construct good predictors (and vice-versa). Data compression can be accomplished using what are called arithmetic encoders. From an arithmetic encoder that works well, we show in this section how to construct a universal predictor. The reader need not know anything about the mechanics of arithmetic encoding in order to understand this section. Let fq i : i = 1; 2; : : :g denote a family of conditional probability distributions in which, for each i, and each x 1 ; x 2 ; : : : ; x i?1 from the data alphabet A, q i (jx 1 ; x 2 ; : : : ; x i?1 ) is a probability distribution on A. The family fq i g denes an arithmetic encoder which encodes the rst n samples X 1 ; X 2 ; : : : ; X n of any random data sequence fx i : i = 1; 2; : : :g using roughly nx? log q i (X i jx 1 ; X 2 ; : : : ; X i?1 ) (7) codebits, where the logarithm throughout is to base two. (The actual number of codebits diers from the quantity in (7) by at most 2.) The distributions fq i g used to dene an arithmetic encoder shall be called the coding distributions for the arithmetic encoder. Suppose the model of the data sequence fx i g is known. What is the optimum arithmetic encoder for this data sequence? For each i, let p i denote the conditional distribution of X i given X 1 ; X 2 ; : : : ; X i?1. In other words, p i (x i jx 1 ; x 2 ; : : : ; x i?1 ) = Pr[X i = x i jx 1 = x 1 ; : : : ; X i?1 = x i?1 ] It can be shown that the optimal arithmetic encoder for the data sequence fx i g is the arithmetic encoder whose coding distributions fq i g satisfy q i = p i for every i. We suppose now that the model M of the data sequence fx i g is stationary. Let H(M) (called the entropy of the model M) be the number dened by H(M) = lim n!1 n?1 E " nx? log p i (X i jx 1 ; : : : X i?1 ) Let M be a family of stationary models. An arithmetic encoder with coding distributions fq i g is called a universal arithmetic encoder for M if n?1 E for all models M in M. models. " nx? log q i (X i jx 1 ; X 2 ; : : : ; X i?1 ) # #! H(M) as n! 1 There exist universal arithmetic encoders for any family of all stationary data EXAMPLE. The following is known to dene a universal arithmetic encoder for the family of all binary i.i.d. data models: 4

5 q i (ajx 1 ; : : : ; x i?1 ) = ( N0(X 1;:::;Xi?1)+1=2 i ; a = 0 N 1(X 1;:::;Xi?1)+1=2 i ; a = 1 The following result shows how a universal predictor can be obtained from a universal arithmetic encoder. Theorem 1 Let M be any family of stationary data models. Let fq i g be the family of coding distributions for a universal arithmetic encoder for M. For each i, let i be the conditional probability distribution in which Consider the predictor dened by X i?1 i (ajx 1 ; : : : ; x i?1 ) = (i? 1)?1 q j (ajx i?j ; : : : ; x i?1 ) j=1 ^X i = arg max i (ajx 1 ; : : : ; X i?1 ) a2a Then this predictor is a universal predictor for the class M. EXAMPLE. Ryabko [14] [15] discovered a universal predictor for the class of stationary data models which he constructed from an arithmetic encoder. We shall not present Ryabko's predictor here, because it is too complicated. However, we do present the results of Ryabko's case study, in which he compared the performance of his universal predictor to the performance of the well-known Laplace predictor. Ryabko's performance criterion was to see how well a predictor performs as a betting strategy in betting on the outcome of the World Cup championship games from 1950 through First, the performance of Laplace's predictor was tested. Laplace's predictor is a random predictor which yields predictions f ^Xi g for a binary sequence fx i g as follows: ( N0(X1;:::;Xi?1)+1 ; a = 0 i+1 Pr[ ^Xi = ajx 1 = x 1 ; : : : ; X i?1 = x i?1 ] = N 1(X 1;:::;Xi?1)+1 ; a = 1 i+1 In each World Cup year, a prediction was made as to whether the winner would be a European team (E) or a non-european team (N). Bets were made by spreading the gambler's current capital according to the probabilities of the two possible predictions. An initial capital of $1000 was assumed. The unrealistic assumption was made that the betting odds were equal in each year. Table 1: Performance of Laplace's Predictor Year Winner N E N N E N E N E N E Pr[E] 1/2 1/3 1/2 2/5 1/3 3/7 3/8 4/9 2/5 5/11 5/12 Pr[N] 1/2 2/3 1/2 3/5 2/3 4/7 5/8 5/9 3/5 6/11 7/12 Capital The amount of capital after the termination of each year's World Cup is listed. The results show that the Laplace predictor does not yield a good betting strategy, since the capital dwindled from $1000 to $369 over the course of the 11 World Cups. Ryabko's predictor gave the following results when applied to World Cup betting: Table 2: Performance of Ryabo's Predictor Year Winner N E N N E N E N E N E Pr[E] 1/ / Pr[N] 1/ / Capital

6 Note that Ryabko's predictor greatly outperformed Laplace's predictor in this case study. Using Ryabko's predictor to govern the betting strategy, the initial capital increased from $1000 to $1920! MDL-based universal predictor design In information theory, \MDL" stands for \minimum description length". The minimum description length criterion, developed largely by J. Rissanen, selects a data model from a parametric family of data models so that the total number of codebits needed to encode the model parameter together with the observed data is minimized. The selected model can then be used to construct a predictor for the next data sample. The predictor selected by this MDL-based technique can be a universal predictor in some cases. We now furnish the details of the MDL method. Let be a parameter space. Let fm : 2 g be a family of data models indexed by. Each model M is a sequence of probability distributions fp (x n ) : n = 1; 2; : : :g, where each p (x n ) is the distribution of the vector x n consisting of the rst n random data samples from the nite alphabet A, assuming that the true value of the model parameter is. One partitions the parameter space into nitely many subsets, and selects a parameter value from each subset in the partition. The set of selected parameters forms a nite subset ^ of the parameter space. Let! ^ be the mapping which maps each parameter in into the parameter ^ from ^ which lies in the same subset in the partition of as does. The mapping! ^ quantizes the parameters in the parameter space. Next, one selects a binary codeword of length L() to represent each parameter in ^. Suppose now that the rst i? 1 data samples x i?1 have been observed,and it is desired to form the prediction ^x i of the next sample. To do this, one rst selects so that = arg min 2 [L(^)? log p^(x i?1 )] Then, one assumes that the true data model is the model M, and forms the prediction ^x i based on this model: ^x i = arg max p (ajx i?1 ) a2a where the conditional probability p (ajx i?1 ) is computed as the ratio p (ajx i?1 ) = p (x i?1 ; a)=p (x i?1 ) The reader can consult the works [12] [13] to see instances in which the predictor dened as above turns out to be a universal predictor for the family of models fm g. 2.4 Nonparametric Universal Predictors In nonparametric universal prediction theory, one wishes to predict as well as possible the terms in completely arbitrary data sequences, not governed by any probabilistic model. Universal predictors can be found in this nonparametric context, but they must be random predictors. One can draw the following moral from this: For a random data model, known or unknown, one can always choose a good deterministic predictor, provided that the model is suciently \smooth" (such as a stationary model). For a completely arbitrary data model, to get a good predictor, you must choose a nondeterministic predictor. In order to describe nonparametric prediction theory, we need the concept of nite-state predictor. For a given deterministic predictor, suppose there is a nite set S and two functions f : S A! A and g : S A! S such that the predictions ^x 1 ; ^x 2 ; : : : ; ^x n arising from applying the predictor to an arbitrary nite-length sequence x 1 ; x 2 ; : : : ; x n are generated by the formulae: ^x i = f(s i?1 ; x i?1 ) s i = g(s i?1 ; x i?1 ) where s 0 is some xed element of S. Then the predictor is said to be a nite-state predictor. 6

7 We are now ready to pose a general nonparametic prediction design problem. Let x n denote an arbitrary binary sequence (x 1 ; x 2 ; : : : ; x n ) of length n. If some prediction rule applied to x n yields the sequence of predictions ^x n = (^x 1 ; ^x 2 ; : : : ; ^x n ), then the average prediction error d n (x n ; ^x n ), which we must control, is dened by where d(x; ^x) is the Hamming distance function given by 0; x = ^x d(x; ^x) = 1; x 6= ^x d n (x n ; ^x n ) = n?1 n X d(x i ; ^x i ) (8) Let be a nite set of nite-state predictors. For each predictor 2, let d (x n ) denote the average prediction error arising when is applied to the sequence x n. Can we design a random predictor so that, if ^x n is the sequence of random predictions resulting from applying the predictor to x n, then max x n je[d n(x n ; ^x n )]? min d (x n )j! 0 as n! 1 2 The answer is yes! The design of the random predictor asked for in the preceding is somewhat tricky. We content ourselves here with an elaboration of the form of this universal random predictor in a special case. Let the data alphabet be binary. We consider two elementary deterministic prediction rules. Rule 0 is the prediction rule in which every prediction is taken to be 0, and Rule 1 is the prediction rule in which every prediction is taken to be 1. For Rule 0 applied to x n, the average prediction error is N 1 (x n )=n, and for Rule 1, the average prediction error is N 0 (x n )=n. There must be a random predictor for which max x n je[d n(x n ; ^x n )]? min(n 0 (x n )=n; N 1 (x n )=n)j! 0 as n! 1 What form does this predictor take? Letting q i (jx 1 ; x 2 ; : : : ; x i?1 ) denote the distribution of ^x i as determined by the nondeterministic data samples x 1 ; x 2 ; : : : ; x i?1, this universal predictor takes the form q(1jx 1 ; : : : ; x i?1 ) = 8 < : 1=2+i?N 0(x 1;:::;xi?1) 2i 0; N 0 (x 1 ; : : : ; x i?1 ) > (1=2 + i )(i? 1) 1; N 0 (x 1 ; : : : ; x i?1 ) < (1=2? i )(i? 1) ; otherwise where the numbers i are positive numbers which! 0 as i! 1. In other words, if the fraction of ones seen in the past lies within the bounds 1=2 i, a random estimate is made for x i ; otherwise, a deterministic estimate is taken. For a treatment of nonparametric universal prediction theory, the reader is invited to consult the references [7], [2], [3], and [9]. 2.5 String-Matching Predictors In examples of universal predictors presented so far, prediction of the value of a sample is based upon frequencies of patterns appearing in the sequence of past samples. A string-matching type predictor operates in a simpler manner one simply looks in the past to nd one pattern that agrees with a pattern formed by the most recently observed samples, taking the prediction to be the sample appearing right after the pattern found. We illustrate one example of a string-matching type predictor. Suppose the samples that have been observed are The prediction ^x i is made by following these three steps: x 1 ; x 2 ; : : : ; x i?1 (9) 7

8 Step 1: Find the longest pattern appearing at the end of the sequence (9) which has a matching pattern in the past. Step 2: Identify the rst place where a pattern matching is found as one moves back into the past, and construct the reduced sequence obtained by deleting all symbols from (9) appearing to the left of the position of this matching pattern. Step 3: Take ^x i to be the sample following the pattern at the beginning of the reduced sequence. EXAMPLE. Let the past samples be 1; 1; 0; 1; 0; 0; 0; 0; 1; 0; 1; 0; 1; 1; 0; 0; 1; 0 and let us predict the next sample. Step 1 identies the longest matching pattern to be \0; 0; 1; 0". Step 2 gives us the reduced sequence 0; 0; 1; 0; 1; 0; 1; 1; 0; 0; 1; 0 Step 3 tells us to take our predicted value to be \1" because this value follows the pattern \0; 0; 1; 0" at the beginning of the reduced sequence above. The predictor dened above can be shown to be a universal predictor for the class of all i.i.d. data models with alphabet A. It is an open problem to nd the largest class of stationary data models for which this predictor is a universal predictor. References [1] P. Algoet, \Universal schemes for prediction, gambling and portfolio selection," Ann. Prob., vol. 20, pp , [2] D. Blackwell, \Controlled random walk," in Proceedings of the 1954 Congress of Mathematicians, vol. III, pp , Amsterdam, North Holland. [3] T. Cover, \Behavior of sequential predictors of binary sequences," Proc. 4th Prague Conference on Information Theory, Stat. Dec. Functions, and Random Processes, pp , Prague, [4] T. Cover, \Universal portfolios," Math. Finance, vol. 1, pp. 1-29, [5] P. Fenwick, Technical Report, Dept. of Computer Science, University of Auckland, New Zealand. [6] M. Feder, \Gambling using a nite-state machine," IEEE Trans. Inform. Theory, vol. 37, pp , [7] M. Feder, N. Merhav and M. Gutman, \Universal prediction of individual sequences," IEEE Trans. Inform. Theory, vol. 38, pp , [8] D. Hagelbarger, \SEER, A SEquence Extrapolating Robot," I.R.E. Trans. Electronic Computers, vol. 5, pp. 1-7, [9] F. Hannan, \Approximation to Bayes risk in repeated plays," in Contributions to the Theory of Games, vol. III, Annals of Mathematics Studies No. 39, pp , Princeton, [10] J. Kelly, \A new interpretation of information rate," Bell Sys. Tech. J., vol. vol. 35, pp , [11] N. Merhav, M. Feder, and M. Gutman, \Some properties of sequential predictors for binary Markov sources," IEEE Trans. Inform. Theory, vol. 39, pp , [12] J. Rissanen, \Universal coding, information, prediction, and estimation," IEEE Trans. Inform. Theory, vol. 30, pp ,

9 [13] J. Rissanen, STOCHASTIC COMPLEXITY IN STATISTICAL INQUIRY. Singapore: World Scientic, [14] B. Ryabko, \Prediction of random sequences and universal coding," Problemy Peredachi Informatsii, vol. 24, pp. 3-14, [15] B. Ryabko, \The complexity and eectiveness of prediction algorithms," Journal of Complexity, vol. 10, pp , [16] C. Shannon, \Prediction and entropy of printed English," Bell System Tech. J., vol. 30, pp. 50{64, [17] C. Shannon, \A mind-reading machine," Bell Laboratories memorandum, (Reprinted in the Collected Papers of Claude Elwood Shannon, IEEE Press, pp ) [18] C. Shannon, \Game Playing Machines," Journal Franklin Institute, vol. 260, pp ,

3F1 Information Theory, Lecture 3

3F1 Information Theory, Lecture 3 Jossy Sayir Department of Engineering Michaelmas 2013, 29 November 2013 Memoryless Sources Arithmetic Coding Sources with Memory Markov Example 2 / 21 Encoding the output