Information Theory: Entropy, Markov Chains, and Huffman Coding

The University of Notre Dame A senior thesis submitted to the Department of Mathematics and the Glynn Family Honors Program Information Theory: Entropy, Markov Chains, and Huffman Coding Patrick LeBlanc Approved: Professor Liviu Nicolaescu

Contents Notation and convention 2. Introduction 3 2. Entropy: basic concepts and properties 3 2.. Entropy 3 2.2. Joint Entropy and Conditional Entropy 5 2.3. Relative Entropy and Mutual Information 6 2.4. Chain Rules for Entropy, Relative Entropy, and Mutual Information 8 2.5. Jensen s Inequality and its Consequences 9 2.6. Log Sum Inequality and its Applications 2 3. The Asymptotic Equipartition Property for Sequences of i.i.d. Random Variables 3 3.. Asymptotic Equipartition Property Theorem 3 3.2. Consequences of the AEP: Data Compression 5 3.3. High-probability Sets and the Typical Set 6 4. Asymptotic Equipartition Property for Markov Chains 7 4.. Background Information 7 4.2. Entropy of Markov Chains 9 4.3. Asymptotic Equipartition 20 5. Coding and Data Compression 23 5.. Examples of Codes 23 5.2. Kraft Inequality 25 5.3. Optimal Codes 25 5.4. Bounds on the Optimal Code Length 27 5.5. Kraft Inequality for Uniquely Decodable Codes 29 5.6. Huffman Codes 30 5.7. Huffman Coding and the Twenty-Question Game 33 5.8. Generation of Discrete Distributions from Fair Coins 34 References 36

2 Notation and convention We will denote by S the cardinality of a finite set S. We denote by N the set of natural numbers, N = {, 2,... } and we set N 0 := N {0}. We will use capital letters to denote random variables. We will use the symbol E when referring to expectation and the symbol P when referring to probability For any set S, contained in some ambient space X, we denote by I S the indicator function of S {, x S, I S : X {0, }, I S (x) = 0, x S. For a discrete random variable X we will use the notation X p to indicate thet p is the probability mass function (pmf) of X. We will often refer to the range of a discrete random variable as its alphabet.

. Introduction Information theory is the study of how information is quantified, stored, and communicated. If we attempt to quantify, store, or communicate information via electronic means then we inevitably run into the issue of encoding that is, how to store words, letters, and numbers as strings of binary digits. However, in doing so we might encounter a trade off between efficiency and reliability. To illustrate this trade-off consider an example: sending a binary bit over a noisy channel. That is, whenever we send a bit there is a chance p that the bit we send will flip: e.g., if we send a there is a p chance that the receiver will receive a 0. The most efficient way to send a message say the digit for sake of argument would be to send a single ; however, there is a possibly significant chance p that our message will be lost. We could make our message more reliable by sending or instead, but this vastly decreases the efficiency of the message. Claude Shannon attacked this problem, and incidentally established the entire discipline of information theory, in his groundbreaking 948 paper A Mathematical Theory of Communication. But what does information mean here? Abstractly, we can think of it as the resolving of uncertainty in a system. Before an event occurs we are unsure which of many possible outcomes could be realized; the specific realization of one of these outcomes resolves the uncertainty associated with the event. In this paper we first formalize a a measure of information: entropy. In doing so, we explore various related measures such as conditional or joint entropy, and several foundational inequalities such as Jensen s Inequality and the log-sum inequality. The first major result we cover is the Asymptotic Equipartition Property, which states the outcomes of a stochastic process can be divided into two groups: one of small cardinality and high probability and one of large cardinality and low probability. The high probability sets, known as typical sets, offer hope of escaping the trade-off described above by assigning the most efficient code to the sets with the highest-probability. We then examine similar results for Markov Chains, which are important because important processes, e.g. English language communication, can be modeled as Markov Chains. Having examined Markov Chains, we then examine how to optimally encode messages and examine some useful applications. 3 2. Entropy: basic concepts and properties 2.. Entropy. There are two natural ways of measuring the information conveyed in a statement: meaning and surprise. Of these meaning is perhaps the more intuitive after all, the primary means of conveying information is to convey meaning. Measuring meaning, however, is at least as much philosophy as mathematics and difficult to do in a systemic way. This leaves us with surprise as a measure of information. We can relate the amount of surprise in a statement to how probable that statement is; the more probable a statement the less surprising, and the less probable the more surprising. Consider the three statements below: The dog barked yesterday. The dog saved a child drowning in a well yesterday. QQXXJLOOCXQXIKMXQQLMQXQ The second statement ought to be more surprising than the first. Dogs bark every day; that one should do so at some point over the course of a day is quite likely and so unsurprising. On the other hand dogs rarely save drowning children, so the occurrence of such an event is both unlikely and surprising. So far so clear. But how might the first two statements compare with the third statement? The third statement appears to be total gibberish, and so we might be tempted to disregard it entirely. However, in another sense it is actually quite surprising. The letters Q and X do not often come up in English words, so the fact that a message was conveyed with so many Q s and X s ought to be a surprise. But how to measure this surprise?

4 Entropy is one such measure. A formal definition is offered below. We use the convention that 0 log(0) = 0 based on the well known fact lim x log x = 0. x 0 Definition 2. (Entropy). Fix b > 0 and let p be a probability distribution on a finite or countable set X. We define its entropy (in base b) to be the non-negative number H b (p) := x χ p(x) log b (p(x)). (2.) Let X be a discrete random variable with range (or alphabet) contained in a finite or countable set X. We define then entropy of X to be the entropy of p X, the probability mass function of X H b (X) = H b (p X ) = x X p X (x) log b p X (x). (2.2) Remark 2.2. For the rest of this paper, we will use the simpler notation log when referring to log 2. Also we will write H(X) instead of H 2 (X). As the logarithm is taken base 2, the entropy is expressed in terms of bits of information. The entropy of a random variable X does not depend on the values taken by X but rather on the probability that X takes on its values. The law of subconscious statistician shows that if X p(x), the expected value of the random variable g(x) is E p [ g(x) ] = x χ g(x)p(x). (2.3) We can relate the expected value of a transformation of p(x) to the entropy associated with the random variable X. Proposition 2.3. Let X be a discrete random variable with range (or alphabet) contained in a finite or countable set X with probability mass function p : X R. Then, [ ] H(X) = E p log. (2.4) p(x) Proof. E p (log( p(x) ) = p(x) log p(x) = p(x) log p(x) = H(X). x χ x χ Let us return to the idea of entropy as the measure of surprise in the realization of a random variable. For any outcome x X we can think of log p(x) as a how surprised we ought to be at observing x; note that this agrees with our intuitive conception of surprise, for the smaller the probability p(x) the larger the surprise upon its observation. The entropy of p is then the amount of surprise we expect to observe when sampling X according to p. More properly stated, then, entropy is not a measure of surprise upon the realization of a random variable, but the amount of uncertainty present in the probability distribution of said random variable. Obviously H(X) 0. (2.5) From the equality log b (x) = log b (a) log a (x) we deduce immediately that H b (x) = log b (a)h a (x). (2.6)

2.2. Joint Entropy and Conditional Entropy. Thus far we have presented entropy only as a measure of uncertainty of the probability distribution of one random variable. For practical applications it will be useful to have a measure of uncertainty of the joint probability distribution of a sequence of random variables. This leads us to define the notions of joint and conditional entropy. We use the latter idea to derive the Chain Rule for Entropy, Theorem 2.6, which gives us another way to calculate the joint entropy of two or more random variables. Definition 2.4 (Joint Entropy). Consider a pair of discrete random variables (X, Y ) with finite or countable ranges X and Y, and joint probability mass function p(x, y). We view the pair (X, Y ) as a discrete random variable with alphabet X Y. The joint entropy H(X, Y ) of X, Y is the entropy of the random variable (X, Y ). More precisely entropy is H(X, Y ) = x X y Y p(x, y) log(p(x, y)) = E p [ log 5 ]. (2.7) p(x, y) Having defined joint entropy, it is natural to define the conditional entropy of two probability distributions. Definition 2.5 (Conditional Entropy). Let (X, Y ) be a pair of discrete random variables with finite or countable ranges X and Y respectively, joint probability mass function p(x, y), and individual probability mass functions p X (x) and p Y (y). Then the conditional entropy of Y given X, denoted by H(Y X), is defined as H(Y X) := x X p X (x)h(y X = x) = p X (x) p Y (y x) log(p Y (y x)) x X y Y = p(x, y) log(p Y (y x)) x X y Y = E p [ log(p(y X))]. Unsurprisingly, the joint and conditional entropies are intimately related to each other. Theorem 2.6 (The Chain Rule for Entropy). Let (X, Y ) be a pair of discrete random variables with finite or countable ranges X and Y respectively, joint probability mass function p(x, y), and individual probability mass functions p X (x) and p Y (y). Then (2.8) H(X, Y ) = H(X) + H(Y X). (2.9) Proof. We follow the approach in [5, p.7]. H(X, Y ) = p(x, y) log(p(x, y)) x X y Y = p(x, y) log(p X (x)p Y (y x)) x X y Y = p(x, y) log(p X (x)) p(x, y) log(p Y (y x)) x X y Y x X y Y = p X (x) log(p X (x)) p(x, y) log(p Y (y x)) x X x X y Y = H(X) + H(Y X)

6 NB: By symmetry, we also have that H(X, Y ) = H(Y ) + H(X Y ). Corollary 2.7. let (X, Y, Z) be discrete random variables with finite or countable ranges X, Y, and Z respectively. Then H(X, Y Z) = H(X Z) + H(Y X, Z). (2.0) Proof. Follows the proof given of the theorem. Remark 2.8. In general, H(Y X) H(X Y ). However, we do have that Indeed, This will be of use later. H(X) H(X Y ) = H(Y ) H(Y X). H(X) + H(Y X) = H(X, Y ) = H(Y ) + H(X Y ). 2.3. Relative Entropy and Mutual Information. Relative entropy measures the divergence between two distributions of random variables: the inefficiency of assuming that a distribution is q when in actuality it is p. If two distributions are not divergent, then we can broadly expect them to behave in the same manner; if they are divergent then they can behave in such a different manner that we cannot tell much about one distribution given the other. Definition 2.9 (Relative Entropy). The relative entropy or Kullback-Leibler distance between two probability mass functions p(x) and q(x) defined on a finite or countable space X is D(p q) = x X p(x) log p(x) q(x). (2.) If X is a random variable with alphabet X and probability mass function p, [ D(p q) = E p log p(x) ]. (2.2) q(x) We use the convention that 0 log 0 0 = 0, 0 log 0 q = 0, and that p log p 0 = based on continuity arguments. Thus if there is an x such that q(x) = 0 and p(x) > 0, then D(p q) =. Remark 2.0. We think of D(p q) as measuring the distance between two probability distributions; however, it is not quite a metric. While the Information Inequality (Theorem 2.20) will show that it is non-negative and zero if and only if p = q, it can in general fail the other criteria. Nevertheless, it is useful to think of it as measuring the distance between two distributions. We present one intuitive way to think of relative entropy. Let the events A, A 2,..., A n be the possible mutually exclusive outcomes of an experiment. Assume that q(x) consists of the probabilities of these outcomes, while p(x) consists of the probabilities of these outcomes under different circumstances. Then ( log log ) q k p k (2.3) denotes the chance in the unexpectedness of A k due to the different experimental circumstances. D(p q) is equal to the expected value of this change. The interpretation of relative entropy presented above implies that two probability distributions may be related so that knowing one distribution gives us some information about the other. We shall refer to this quantity as mutual information; it measures the amount of information that one probability distribution contains about another. Mutual information be thought of as the expected change in uncertainty in one variable brought about by observing another variable.

7 Definition 2. (Mutual Information). Consider two discrete random variables X, Y with ranges finite or countable ranges X and Y, a joint probability mass function p(x, y) and marginal probability mass functions p X (x) and p Y (y). The mutual information I(X; Y ) is the relative entropy between the joint distribution and the product of the marginal distributions p X (x)p Y (y). I(X; Y ) := x X p(x, y) p(x, y) log p X (x)p Y (y) y Y = D(p(x, y) p X (x)p Y (y)) = E p(x,y) log p(x, y) p X (x)p Y (y). (2.4) The above definition shows that the mutual information enjoys the very important symmetry property I(X; Y ) = I(Y ; X). (2.5) To gain a better understanding of the mutual information note that we can rewrite I(X; Y ) as I(X; Y ) = x,y p(x, y) log p(x, y) p X (x)p Y (y) = x,y = x,y p(x, y) log p X(x y) p X (x) p(x, y) log p X (x) + x,y p(x, y) log p X (x y) (2.6) = x p X (x) log p X (x) + x,y p(x, y) log p X (x y) = H(X) H(X Y ). Thus H(X) = H(X Y ) + I(X, Y ). We may interpret H(X Y ) as the uncertainty in X given the occurrence of Y ; however, this does not capture the whole of the uncertainty in the probability distribution of X. The missing piece is the mutual information I(X, Y ) which measures the reduction in uncertainty due to the knowledge of Y. Also, by the symmetry (2.5) we have I(X; Y ) = H(Y ) H(Y X) so the mutual information between X and Y is also the reduction in uncertainty in Y due to knowledge of X. One consequence of this is that X contains as much information about Y as Y does about X. Moreover, since H(X, Y ) = H(X) + H(Y X) by (2.6), we have that Finally, note that I(X; Y ) = H(X) + H(Y ) H(X, Y ). I(X; X) = H(X) + H(X) H(X X) = H(X), so the mutual information of a variable with itself is the entropy of that variable s probability distribution. Following this logic, entropy is sometimes referred to as self-information. The following theorem summarizes these facts. Theorem 2.2. Let (X, Y ) two discrete random variables with with finite or countable ranges X and Y, joint probability mass function p(x, y), and individual probability mass functions p X (x) and

8 p Y (y). Then we have that I(X; Y ) = H(X) H(X Y ) (2.7) I(X; Y ) = H(Y ) H(Y X) (2.8) I(X; Y ) = H(X) + H(Y ) H(X, Y ) (2.9) I(X; Y ) = I(Y ; X) (2.20) I(X; X) = H(X) (2.2) Proof. Presented above, followed [5, p.2]. 2.4. Chain Rules for Entropy, Relative Entropy, and Mutual Information. We now generalize the previous results and definitions to sequences of n random variables and associated probability distributions. These results will be of use later, particularly when considering the entropy of a stochastic process. Theorem 2.3 (Chain Rule for Entropy). Let the discrete random variables X,X 2,...,X n have finite or countable ranges X, X 2,..., X n, have joint mass function p(x, x 2,..., x n ), and individual probability mass functions p X (x), p X2 (x),..., p Xn (x). Then H(X, X 2,..., X n ) = n H(X i X i,..., X ). (2.22) Proof. We follow [5, p.22]. Recall that p(x,..., x n ) = n i= p(x i x i,..., x ) and consider: H(X, X 2,..., X n ) = p(x, x 2,..., x n ) log p(x, x 2,..., x n ) x,x 2,...,x n = i= x,x 2,...,x n p(x, x 2,..., x n ) log = x,x 2,...,x n i= n p Xi (x i x i,..., x ) i= n p(x, x 2,..., x n ) log p Xi (x i x i,..., x ) n = p(x, x 2,..., x n ) log p Xi (x i x i,..., x ) i= x,x 2,...,x n n = p(x, x 2,..., x i ) log p Xi (x i x i,..., x ) i= x,x 2,...,x i n = H(X i X i,..., X ). i= To generalize the concept of mutual information, we will define conditional mutual information the reduction of uncertainty of X due to knowledge of Y when Z is given and use this quantity to prove a chain rule as we did with entropy. Definition 2.4 (Conditional Mutual Information). Let X, Y, and Z by discrete random variables with finite or countable ranges. Then the conditional mutual information of random variables X and Y given Z is I(X, Y Z) = H(X Z) H(X Y, Z). (2.23)

Theorem 2.5 (The Chain Rule for Information). Let X,X 2,...,X n, Y be discrete random variables with finite or countable ranges. Then n I(X, X 2,..., X n Y ) = I(X i ; Y X i,..., X ). (2.24) Proof. We follow [5, p.24]. i= I(X, X 2,..., X n Y ) = H(X, X 2,..., X n ) H(X, X 2,..., X n Y ) (2.25) n n = H(X i X i,..., X ) H(X i X i,..., X, Y ) (2.26) = i= i= n I(X i ; Y X, X 2,..., X n ). (2.27) i= 2.5. Jensen s Inequality and its Consequences. Jensen s Inequality is a foundational inequality for convex functions. Recall that we have defined entropy to be the sum of functions of the form log( ) as log( ) is convex, general convexity results will be useful in the study of entropy. In particular, Jensen s Inequality can be used to derive important results such as the Independence Bound on Entropy, Theorem 2.24. We begin with a definition. Definition 2.6 (Convexity). Let I R be an interval. (i) A function f : I R is called convex if for every x, x 2 I and λ [0, ] we have f(λx + ( λ)x 2 ) λf(x ) + ( λ)f(x 2 ). (2.28) (ii) A function f : I R is strictly convex if we have equality in the above expression only when λ {0, }. (iii) A function f : I R is called concave if f is convex. Let us observe that when f : I R is convex then, for any x,..., x N I and any p,..., p n 0 such that i p i = we have ( ) f p i x i p i f(x i ). (2.29) i i If f is strictly convex, then above we have equality if and only if x = = x n. It can be awkward to prove that functions are convex from the definition, so we shall present a sufficient criterion. Theorem 2.7. Let I R be an interval, and f : I R be a function. If the function f has a second derivative that is non-negative (resp. positive) over I, then the function is convex (resp. strictly convex) over I. Theorem 2.7 shows that the functions are convex, while the functions x 2, x. a x, a >, x log x (x 0), log x, x, x > 0, are concave. Convexity will prove to be a useful property which is fundamental to many properties of entropy, mutual information, and other such quantities. The next result is well known. 9

0 Theorem 2.8 (Jensen s Inequality). Let f : I R. If f : I R is a convex function and X is a random variable with range contained in I, then E( f(x) ) f( E(X) ). (2.30) Moreover, if f is strictly convex and in (2.30) we have equality, then X = E(X) with probability, i.e. X is a.s. constant. Proof. When the range of X is finite, the inequality (2.30) is the classical Jensen s inequality (2.29). The general case follows from the finite case by passing to the limit. Jensen s Inequality is the key result used to show several important properties of entropy. Theorem 2.9 (Gibb s Inequality). Let p, q : X [0, ], be two probability mass functions on a finite or countable space X. Then D(p q) 0, (2.3) with equality if and only if p(x) = q(x) for all x X. Proof. We follow [5, p.28]. Fix a random variable with Let be the support set of p(x). Then D(p q) = x A A := {x X : p(x) > 0 } p(x) log p(x) q(x) = p(x) log q(x) (2.29) log p(x) q(x) p(x) p(x) x A x A ( ) = log x A q(x) log x X q(x) = log = 0. ( ) Observe that since log t is a strictly convex function of t, we have equality sign in ( ) if and only if the function A x q(x) p(x) R is constant, i.e., there exists c R such that q(x) = cp(x), x A. This implies that p(x) = c. q(x) = c x A x A Further, we have equality in ( ) only if q(x) =, x A q(x) = x X which implies that c =. Thus, D(p q) = 0 if and only if p(x) = q(x) for all x. Corollary 2.20 (Information Inequality). Let X, Y be discrete random variables with finite or countable ranges. Then I(X; Y ) 0, (2.32) with equality if and only if X and Y are independent. Proof. We follow [5, p.28]. Let p(x, y) denote the joint probability mass function of X, Y. Then with equality if and only if p(x, y) = p(x)p(y). I(X; Y ) = D(p(x, y) p(x)p(y)) 0, We may think of a deep cause for this result. If we find two random variables which exhibit some dependence on each other then we cannot mathematically determine which is the cause and which the effect; we can only note the degree to which they are correlated.

Theorem 2.2 (Conditioning Reduces Entropy). Let X, Y be discrete random variables with finite or countable ranges. Then H(X Y ) H(X), (2.33) with equality if and only if X and Y are independent. Proof. We follow [5, p.29]. 0 I(X; Y ) = H(X) H(X Y ) (2.34) This theorem states that knowing another random variable, Y can only reduce and not increase the uncertainty in X. It is important, however, to note that this is only on average. Example 2.22 below shows that it is possible for H(X Y = y) to be greater than, less than, or equal to H(X), but on the whole we have H(X Y ) = y p(y)h(x Y = y) H(X). Example 2.22. Let X and Y have the following joint distribution X 2 Y 3 0 2 8 Then H(X) H( 8, 7 8 ) = 0.544 bit, H(X Y = ) = 0 bits, and H(X Y = 2) = bit. We calculate H(X Y ) = 3 4 H(X Y = ) + 4H(X Y = 2) = 0.25 bit. Thus, the uncertainty in X is increased if Y = 2 is observed and decreased if Y = is observed, but uncertainty decreases on the average. Theorem 2.23. Let X be a discrete random variable with finite or countable range X. Then 4 8 H(X) log X, with equality if and only if X has a uniform distribution over X. Proof. We follow [5, p.29]. Let u = X be the uniform probability mass function over X, and let p(x) be the probability mass function for X. Then D(p u) = p(x) log p(x) = log X H(X) u(x) Thus by the non-negativity of relative entropy we have that 0 D(p u) = log X H(X) Theorem 2.24 (Independence Bound on Entropy). Let X, X 2,..., X n be discrete random variables taking values in a finite or countable range. Then n H(X, X 2,..., X n ) H(X i ), (2.35) with equality if and only if the X i are independent. Proof. We follow [5, p.30]. By the chain rule for entropies, n H(X, X 2,..., X n ) = H(X i X i,..., X ) i= i= n H(X i ), i=

2 which follows from Theorem 2.2. Equality follows if and only if each X i is independent of X i,..., X, i.e. if and only if the X i s are independent. 2.6. Log Sum Inequality and its Applications. In this section we continue to show consequences of the concavity of the logarithm function, the most important of which is the log-sum inequality. This will allow us to prove convexity and concavity for several of our measures including entropy and mutual information. Theorem 2.25 (Log Sum Inequality). For a, a 2,..., a n [0, ) and b, b 2,..., b n [0, ) we have n a i log a ( n ) n i i= a i log a i b n i= i i= i= b, (2.36) i with equality if and only if a i b i is constant. Proof. We follow [5, p.3]. Without loss of generality, we may assume that a i > 0 and b i > 0. The function f(t) = t log t = t log t is strictly convex, since f (t) = t log 2 e > 0, t (0, ). Hence by Jensen s inequality (2.29) we have ( ) αi f(t i ) f αi t i, for α i 0, i α i =. Setting B := n j= we obtain ai bj log a i b i which is the log-sum inequality as b j =, α i = b i B, t i = a i b i, a i bj log a i bj, The log sum inequality will allow us to prove various convexity related results. The first will be a reproving of Theorem 2.9, which states that D(p q) 0 with equality if and only if p(x) = q(x). By the log sum inequality we have that D(p q) = p(x) log p(x) q(x) ( ) p(x) p(x) log = log q(x) = 0, with equality if and only if p(x) q(x) = c. As both p(x) and q(x) are probability mass functions, we have that c =. Thus we have that D(p q) = 0 if and only if p(x) = q(x). Armed with the log-sum inequality, we can now show that relative entropy is a convex function, and that entropy is concave. Theorem 2.26 (Convexity of Relative Entropy). D(p q) is convex in the pair (p, q); that is, if (p, q ) and (p 2, q 2 ) are two pairs of probability mass functions on finite or countable spaces then, for λ [0, ]. D(λp + ( λ)p 2 λq + ( λ)q 2 ) λd(p q ) + ( λ)d(p 2 q 2 ) (2.37) Proof. We follow [5, p.32]. We can apply the log sum inequality to a term on the left hand side of (2.37) to arrive at: (λp (x)+( λ)p 2 (x)) log λp (x) + ( λ)p 2 (x) λq (x) + ( λ)q 2 (x) λp (x) log λp (x) λq (x) +( λ)p 2(x) log ( λ)p 2(x) ( λ)q 2 (x). We use the convention that 0 log 0 = 0, a log a = if a > 0, and 0 log 0 0 0 arguments. = 0, which follow from continuity

3 Summing this over all values of x gives us (2.37). Theorem 2.27 (Concavity of Entropy). H(p) is a concave function of p where p is a probability mass function on a finite or countable space. Proof. We follow [5, p.32]. By Theorem 2.23 we have that H(p) = log X D(p u), where u is the uniform distribution on X outcomes. Then, since D is convex, we have that H is concave. 3. The Asymptotic Equipartition Property for Sequences of i.i.d. Random Variables The Asymptotic Equipartition Property or AEP is central to information theory. In broad terms, it claims that the outcomes of certain stochastic processes can be divided into two groups: one group is small and is far more likely to occur than the other, larger, group. This property allows for efficient data compression by ensuring that some sequences generated by a stochastic process are more likely than others. In this section we show that AEP holds for sequences of i.i.d. random variables 3.. Asymptotic Equipartition Property Theorem. To approach the Asymptotic Equipartition Property, we first need to cover several notions of convergence. Definition 3. (Convergence of Random Variables). Given a sequence of random variables {X i }, we say that the sequence {X i } converges to a random variable X (i) in probability, if ɛ > 0, P( X n X > ɛ ) 0, (ii) in mean square, if E(X n X) 2 0, (iii) with probability or almost surely, if P{lim n X n = X} =. We can now state the Asymptotic Equipartition Theorem. Theorem 3.2 (Asymptotic Equipartition Property). If {X i } is a sequence of independent identically distributed (i.i.d.) variables, with finite or countable range X, associated with the probability mass function p(x), then in probability. n log p(x, X 2,..., X n ) H(X), (3.) Proof. We follow [5, p. 58]. Recall that functions of independent random variables are themselves independent random variables. Then, since the X i are i.i.d. the log p(x i ) are i.i.d. as well. Thus we have n log p(x, X 2,..., X n ) = log p(x i ) n i E log p(x) in probability (by the weak law of large numbers) = H(X). Remark 3.3. We note that if X, X are i.i.d. discrete random variables taking values in a finite or countable space, then P(X = X ) 2 H(X). To see this denote by p the common probability mass function of these two random variables and by X their common alphabet. Then P(X = X ) = x X P(X = X = x) = x X p(x) 2 = x X p(x)2 log p(x) = E[ 2 log p(x)].

4 Using Jensen s inequality for the convex function 2 x we deduce E[ 2 log p(x)] 2 E[log p(x)] = 2 H(X). Definition 3.4. The ɛ-typical set A (n) ɛ, for some ɛ > 0, with respect to p(x) is the set of sequences (x, x 2,..., x n ) X n with the property 2 n(h(x)+ɛ) p(x, x 2,..., x n ) 2 n(h(x) ɛ). (3.2) We can use the AEP to prove several properties about the probability of sets in A (n) ɛ cardinality of A (n) ɛ and the Theorem 3.5. If {X i } is a sequence of independent identically distributed (i.i.d.) variables, with finite or countable range X, associated with the probability mass function p(x), and for some ɛ > 0 A (n) ɛ is the ɛ-typical set with respect to p(x), then (i) If (x, x 2,..., x n ) A (n) ɛ, then H(X) ɛ n log p(x, x 2,..., x n ) H(X) + ɛ. (ii) P(A (n) ɛ ) > ɛ for sufficiently large n. (iii) A (n) ɛ 2 n(h(x)+ɛ), where S denotes the cardinality of the set S. (iv) A (n) ɛ ( ɛ)2 n(h(x) ɛ) for sufficiently large n. Thus the typical set has probability nearly, all elements of the typical set are nearly equiprobable, and the number of elements in the typical set is nearly 2 nh(x). That is, the typical sets have high probability and are small if the entropy is small. Proof. We follow [5, p.59]. The inequality (i) follows from the condition (3.2) in the definition of A (n) ɛ : 2 n(h(x)+ɛ) p(x, x 2,..., x n ) 2 n(h(x) ɛ) = n(h(x) + ɛ) log p(x, x 2,..., x n ) n(h(x) ɛ) = H(X) ɛ n log p(x, X 2,..., X n ) H(X) + ɛ (ii) The weak law of large numbers implies that the probability of the event (X, X 2,..., X n ) A (n) ɛ tends to as n. Indeed, for any ɛ > 0 we have ( lim n P(A(n) ɛ ) = lim P ) n n log p(x, X 2,..., X n ) H(X) < ɛ =. To prove (iii) we note that = and thus we have x X n p(x) x A (n) ɛ p(x) 2 n(h(x)+ɛ) A (n) ɛ, A (n) ɛ 2 n(h(x)+ɛ). (3.3) Finally, to prove (iv) note that, for sufficiently large n, we have that P( A (n) ɛ ) > ɛ. Thus, and thus ɛ < P( A (n) ɛ ) x A (n) ɛ 2 n(h(x) ɛ) = 2 n(h(x) ɛ) A (n) ɛ, A (n) ɛ ( ɛ)2 n(h(x) ɛ).

3.2. Consequences of the AEP: Data Compression. Let X, X 2,..., X n be independent, identically distributed random variables with vales in the finite alphabet X drawn according to the density function p(x). We seek to find an efficient encoding system for such sequences of random variables. Note that, ɛ > 0 we can partition all sequences in X n into two sets: the ɛ-typical set A (n) ɛ and its complement, (A (n) ɛ ) c. Let there be a linear ordering on each of these sets, e.g. induced by lexicographic order on X n defined by some linear ordering on X. Then each sequence in A (n) ɛ can be represented by giving the index of the sequence in the set. As there are at most 2 n(h(x)+ɛ) sequences in A (n) ɛ by Theorem 3.5, the indexing ought to require no more than n(h(x) + ɛ) + bits the extra bit is necessary if n(h(x) + ɛ) is not an integer. These sequences are prefixed by a 0 to distinguish them from the sequences in (A (n) ɛ ) c, raising the total number of bits to n(h(x) + ɛ) + 2. Likewise, we can index each sequence not in (A (n) ɛ ) c using at most n log X + bits. Adding a for identification purposes increases the maximum number of bits to n log X + 2 Remark 3.6. We note the following about the above coding scheme: The code is one-to-one and easily decodable, assuming perfect transmission. The initial bit acts as a flag to signal the following length of code. We have been less efficient than we could have been, since we have used a brute force method to encode the sequences in (A (n) ɛ the fact that the cardinality of (A (n) ɛ ) c. In particular, we could have taken into account ) c is less than the cardinality of X n. However, this method is still efficient enough to yield an efficient description. The typical sequences have short descriptions of length approximately nh(x) Let x n denote a sequence (x, x 2,..., x n ) X n and let l(x n ) be the length of the codeword corresponding to x n. If n is large enough that P( A ɛ (n) ) ɛ, the expected value of the codeword is E[ l(x n ) ] = x n p(x n )l(x n ) = x n A ɛ (n) x n A (n) ɛ p(x n )l(x n ) + p(x n )l(x n ) x n (A (n) ɛ ) c p(x n )(n(h(x) + ɛ) + 2) + p(x x n (A (n) ɛ ) c n )( n log X + 2 ) = P( A (n) ɛ )( n(h(x) + ɛ) + 2 ) + P( (A (n) ɛ ) c )(n log X + 2 ) n( H(X) + ɛ ) + ɛn(log X ) + 2 = n( H(X) + ɛ ), where we have that ɛ = ɛ + ɛ log X + 2 n can be made arbitrarily small by the correct choices of ɛ and n. We have thus proved the following result. Theorem 3.7. Let X,..., X n be i.i.d. random variables with finite or countable range X and associated with the probability mass function p(x). Let ɛ > 0. Then there exists a code that maps sequences x n of length n into binary strings such that the mapping is one-to-one (and therefore invertible) and [ ] E n l(xn ) H(X) + ɛ, for sufficiently large n. Proof. Above, follows [5, p.6]. 5

6 3.3. High-probability Sets and the Typical Set. We have defined A (n) ɛ so that it is a small set which contains a lot of the probability of X n. However, we are left with the question of whether it is the smallest such set. We will prove that the typical set has the same number of elements as the smallest set to a first order approximation in the exponent. Definition 3.8 (Smallest Set). For each n =, 2,..., let B (n) δ P( B (n) δ ) δ. X n be the smallest set such that We shall argue that the typical set has approximately the same cardinality as the smallest set. Theorem 3.9. Let X,..., X n be i.i.d. random variables with finite or countable range X and associated with the probability mass function p(x). For δ < 2 and any δ > 0, if P( B (n) δ ) > δ, then log B(n) n δ > H(X) δ for sufficiently large n. Proof. First we prove a lemma. Lemma 3.0. Given any two sets A, B, such that P(A) > ɛ and P(B) > ɛ 2, we have that P(A B) > ɛ ɛ 2. Proof. By the Inclusion-Exclusion Principle we have that The above lemma shows that P( A (n) ɛ of inequalities. From here we conclude that P(A B) = P(A) + P(B) P(A B) > ɛ + ɛ 2 = ɛ ɛ 2. ( ɛ δ P A (n) ɛ A ɛ (n) B (n) δ B (n) δ ) ɛ δ. This leads to the following sequences 2 n(h(x) ɛ) = A (n) ɛ ) B (n) δ = A (n) ɛ B (n) δ p(x n ) B (n) δ 2 n(h(x) ɛ) B (n) δ 2 n(h(x) ɛ). B (n) δ 2 n(h(x) ɛ) ( ɛ δ) = log B (n) δ n(h(x) ɛ) + log ɛ δ = log B(n) n δ (δ>/2) > H(X) ɛ + log 2 ɛ n = n log B(n) δ H(X) δ, for δ = ɛ log 2 ɛ n which is small for appropriate ɛ and n. The theorem demonstrates that, to first order in the exponent, the set B (n) δ, must have at least 2 nh(x) elements. However, A (n) ɛ has 2 n(h(x)±ɛ. Therefore, A (n) ɛ is about the same size as the smallest high-probability set. We will now introduce notation to express equality to the first order in the exponent.

Definition 3. (First Order Exponential Equality). Let a n and b n be sequences of real numbers. Then we write a n =b n if lim n n log a n = 0. b n Thus, a n =b n implies that a n and b n are equal to the first order in the exponent. We can now restate the above results: If δ n 0 and ɛ n 0, then B (n) δ n = A (n) ɛ n =2 nh(x). 4. Asymptotic Equipartition Property for Markov Chains We now formally extend the results of the previous section from a sequence of i.i.d. random variables to a stochastic process. In particular, we will prove a version of the asymptotic equipartition property for Markov Chains. Throughout this section the random variables will be assumed valued in a finite alphabet X. 4.. Background Information. A (discrete time) stochastic process with state space X is a sequence {X i } i N0 of X -valued random variables. We can characterize the process by the joint probability mass functions p n (x,..., x n ) = P(X = x,..., X n = x n ), (x,..., x n ) X n, n N. One desirable property for a stochastic process to possess is stationarity. Definition 4. (Stationarity). A stochastic process is called stationary if the joint distribution of any subset of the sequence of random variables is invariant with respect to shifts in the time index. That is, P(X = x, X 2 = x 2..., X n = x n ) = P(X +l = x, X 2+l = x 2..., X n+l = x n ), for every n, for every shift l, and for all x, x 2,..., x n X. A famous example of a stochastic process is one in which random variable is dependent only on the random variable which precedes it and is conditionally independent of all other preceding random variables. Such a stochastic process is called a Markov Chain, and we now move to offer a formal definition. Definition 4.2 (Markov Chain). A discrete stochastic process {X i } i N0 with state space X is said to be a Markov Chain if, for any n N, and any x, x 2,..., x n, x n+ X, we have P(X n+ = x n+ X n = x n, X n = x n,..., X = x ) = P(X n+ = x n+ X n = x n ). Here we can deduce that in a Markov Chain the joint probability mass function of a sequence of random variables is p(x, x 2,..., x n ) = p(x )p (x 2 x )p 2 (x 3 x 2 )... p n (x n x n ), p k (x x) = P( X k+ = x X k = x ). This motivates the importance of the next definition. Definition 4.3 (Homogeneity). A Markov Chain is called homogeneous if the conditional probability p n (x x) does not depend on n. That is, for n N 0 P( X n+ = x X n = x ) = P( X 2 = x X = x ), x, x X. Now, if {X i } i N0 is a Markov Chain with finite state space X = {x,..., x m }, X n is called the state at time n. A homogeneous Markov Chain is characterized by its initial state X 0 and a probability transition matrix P = [P i,j ] i,j m, P i,j = P( X = x j X 0 = x i, ). 7

8 If it is possible to transition from any state of a Markov Chain to any other with nonzero probability in a finite number of steps, then the Markov Chain is said to be irreducible. If the the greatest common denominator of the lengths of nonzero probability different paths from a state to itself is, then the Markov Chain is said to be aperiodic. Further, if the probability mass function of X n is p n (i) = P(X n = x i ), then p n+ (j) = x n p n (i)p i,j. In particular, this shows that the probability mass function of X n is completely determined by the probability mass function of the initial state X 0. A probability distribution π on the state space X is called a stationary distribution with respect to a homogeneous Markov chain (X n ) n N0 if P(X n = x) = π(x), n N 0, x X. If X = {,..., m} and the probability transition matrix is (P ij ) then we see that π is stationary if and only if m π(j) = π(i)p i,j, j =,..., m. i= The next theorem describes the central fact in the theory of Markov chains. For a proof we refer to [3, 4]. Theorem 4.4. Suppose that {X n } is an irreducible Markov chain with finite state space X. Then the following hold. (i) There exists a unique stationary distribution π on X, and for any initial data X 0 and any function f : X R we have the ergodic limit n lim f(x k ) = f(x)π(x), almost surely. (4.) n n k= x X (ii) If additionally {X n } is aperiodic, then the distribution of X n tends to the stationary distribution as n, i.e., lim P(X n = x) = π(x), x X. n The ergodic limit (4.) is very similar to the strong law of large numbers. We want to mention a special case of (4.). Fix a state x 0 X and let f : X R be the indicator function of the set {x 0 }, {, x = x 0, f(x) = 0, x x 0. In this case, the random variable, M n (x) = n f(x k ), k= represents the number of times the Markov chain visits the state x in the discrete time interval {,..., n}. The ratio F n (x) = M n(x) n then can be viewed as the frequency of these visits in this time interval. The ergodic limit then states that M n (x) lim = π(x), almost surely. (4.2) n n

As an example to keep in mind suppose that X consists of all the letters of the English alphabet (both capitalized and uncapitalized) and all the signs of punctuation. By parsing a very long English text we can produce a probability mass function π on X that describes the frequency of occurrence of the various symbols. We can also produce a transition probability matrix P = ( p xy ) x,y X, where p xy describes how frequently the symbol x is followed by the symbol y.the string of symbols in the long text can be viewed as one evolution of a Markov chain with state space X with stationary distribution π. 4.2. Entropy of Markov Chains. Suppose we have a simple Markov Chain with a finite number of states X = {x, x 2,..., x m }, with a transition probability matrix p i,j for i, j {, 2,..., m} and a stationary distribution π, π i = π(x i ). If the system is in state x i, then it transitions to state x k with probability p i,k. We can regard the transition from x i to x k as a random variable T i drawn according to the probability distribution {p i,, p i,2,..., p i,n }, and the entropy of X i is n H(T i ) = p i,k log p i,k. k= The entropy of T i will depend on i, and can be intuitively thought of as a measure of the amount of information obtained when the Markov Chain moves one step forward from the starting state of x i. Next, we define the average over the initial states of this quantity with respect to the stationary measure π, n n n H = π i H i = π i p i,k log p i,k. i= i= k= This can be interpreted as the average amount of uncertainty removed when the Markov chain moves one step ahead. H is then defined to be the entropy of the Markov Chain. Moreover, H is uniquely determined by the chosen stationary probability measure π and the transition probabilities p i,k. We now seek to generalize these concepts to the case where we move r steps ahead for r N. If the systems is in state x i, then it is easy to calculate the probability that in the next r trials we find it in the states x k, x k2,..., x kr in turn, where k, k 2,..., k r are arbitrary numbers from to n. Thus the subsequent fate of a system initially in the state A i in the next r trials can be captured by an X r -valued random variable Ti r. Moreover, the associated entropy is Hr i and is regarded as a measure of the amount of information obtained by moving ahead r steps in the chain after starting at x i. Then, in parallel to the above, H (r) = n i= π i H (r) i. This is the r-step entropy of the Markov Chain, which is the average amount of information given by moving ahead r steps in the chain. The one step entropy defined above can be written as H (). Since we have defined H (r) to be the average amount of information obtained in moving ahead r steps in a Markov Chain, then it is natural to expect that for r, s, N we have H (r+s) = H (r) + H (s), or, equivalently, H (r) = rh (). We shall show that this later condition is true by induction on r. For r = the definition holds trivially, as H () = H (). Now suppose that H (r) = rh () for a given r. We will show that H (r+) = (r + )H (). If the system is in a state x i, then the random variable which describes the 9

20 system s behavior over the next r + transitions can be regarded as the product of two dependent variables: (A) the variable corresponding to the transition immediately following x i with entropy H () i and (B) the random variable describing the behavior of the system in the next r trials. The entropy of this variable is H (r) k if the outcome of A was x k. Recall the relation H(A, B) = H(A) + H(B A). This gives us n H (r+) i = H () i + p i,k H (r) k. (4.3) k= Combining this relation with the definition H (r) = n i= π ih (r) = H () + = n k= n i= H (r+) = π i H () i + n i= n k= π i H (r+) i H (r) k i n π i p i,k i= gives us π k H (r) k = H () + H (r) = (r + )H (r+). 4.3. Asymptotic Equipartition. Consider a homogeneous Markov Chain {X n } n N0 with finite state space X = { x,..., x n }, and a stationary distribution π : X [0, ] which obeys the law of large numbers (4.2). That is, in a sufficiently long sequence of s consecutive trials the relative frequency Mn(x) s of the occurrence of the state xx will converge to π(x) in probability, i.e., for any δ > 0 we have ( ) lim P M s (x) π(x) s s > δ = 0. (4.4) We set P k := π(x k ), and we denote by p ij the probability of transition from state x i to the state x j. A result x s X s of a sequence of s consecutive transitions in a Markov Chain be written as a sequence x s = x k x k2... x ks, k,..., k s {, 2,..., n}. Note that the probability of x s materializing depends not on the part of the chain where the sequence begins due to stationarity, and is equal to P( x s ) = P k p k k 2 p k2 k 3... p ks k s. If i, l {, 2,..., n}, and m il is the number of pairs of the form k r k r+ for r s in which k r = i and k r+ = l, then the probability of the sequence x s can be written as n n P( x s ) = P k (p m il i,l ). (4.5) i= l= Theorem 4.5. Given ɛ > 0 and η > 0, there exists S = S(ɛ, η), sufficiently large, such that for all s > S(ɛ, η) the sequences ( x s ) X s can be divided into two groups with the following properties: (i) The probability of any sequence x s in the first group satisfies the inequality log P( x s ) s H < η (4.6)

2 (ii) The sum of the probabilities of all sequences of the second group is less than ɛ. That is, all sequences with the exception of a very low probability group have probabilities lying between a s(h+η) and a s(h η), where a is the base of the logarithm used and H is the one-step entropy of the given chain. This parallels the partition of a probability space into typical sets and non-typical sets. Proof. We follow the approach in [6, p.6]. With δ = δ(s, η) > small enough, to be specified later, we include in the first group the sequences x s X s satisfying the following two conditions (i) P( x s ) > 0, (ii) for any i, l {, 2,..., n} the inequality holds. m i,l sp i p i,l < sδ, i, l {, 2,..., n}. (4.7) All other sequences shall be assigned to group two, and we shall show that this assignment satisfies the theorem if δ is sufficiently small and s sufficiently large. We now consider the first requirement. Suppose that x s belongs to the first group. It follows from (4.7) that m i,l = sp i p i,l + sδθ i,l for θ i,l <, and i, l {, 2,..., n}. We now substitute the expressions for m i,l into (4.5). While doing so, we note that the requirement that the given sequence be a possible outcome requires m i,l = 0 when p i,l = 0. Thus, in (4.5) we must restrict ourselves to nonzero p i,l, which we will denote by an asterisk on the product. Thus P( x s ) = P k (pi,l ) sp ip i,l +sδθ i,l It follows that = log P(x s ) = log P k s i i l = log P k + sh sδ i log P( x s ) s H < s log Pi p i,l log p i,l sδ l i θi,l log p i,l. P k l l + δ i Thus for sufficiently large s and small δ defined by η = δ log p i i,l l log p i,l. θ i,l log p i,l we have that log P( x s ) H s < η. Thus the first requirement of the theorem is satisfied. We now consider the second group. For book-keeping purposes, we note that when we sum the probabilities of the sequences in this group, we exclude the probabilities of those sequences which are impossible as their probabilities are zero. Thus we are interested in the probability P ( x s ) for sequences for which the inequality (4.7) fails for at least one a pair of indices i, l {, 2,..., n}. It suffices to calculate the quantity n n P( m i,l sp i p i,l sδ). i= l= l

22 Fix the indices i and l. By (4.4) we have that sufficiently large s ( ) P m i sp i < δ 2 s > ɛ. Then if the inequality m i sp i < δ 2 s is satisfied, and if we make s is sufficiently large, then m i can be made arbitrarily large. Thinking of the transition i l as a Bernoulli trial with success probability p il we expect that during m i trials we would observe m i p il such transitions. The weak law of large numbers then shows that if s is sufficiently large and thus m i is sufficiently large we have ) m i,l P( p i,l m i < δ > ɛ. 2 Combining these results gives us that the probability of satisfying the inequalities m i sp i < δ 2 s (4.8) exceeds ( ɛ) 2 > 2ɛ. However, it follows from (4.8) that p i,l m i sp i p i,l < p i,l δ 2 s < δ 2 s, which together with (4.9) and the triangle inequality gives us Thus for any i and l, we have that for sufficiently large s ( ) P m i,l sp i p i,l < δs > 2ɛ, which implies that and from this it follows that n i= l= m i,l m i p i,l < δ 2 m i δ 2 s (4.9) m i,l sp i p i,l < δs. (4.0) ( ) P m i,l sp i p i,l > δs < 2ɛ, n ( ) P m i,l sp i p i,l > δs < 2n 2 ɛ. As the right hand side of the above inequality can be made arbitrarily small as ɛ becomes arbitrarily small, we have that the sum of the probabilities of all the sequences in the second group can be made arbitrarily small for sufficiently large s. In the remainder of this section we will assume that the base of the logarithms employed is a = 2. We denote by Xɛ,η s the collection of sequences in the first group and we will refer to such sequences as (ɛ, η)-typical. By construction P( Xɛ,η s ) > ɛ if s is sufficiently large. Now observe that P( x s ) 2 s(h+η) Xɛ,η, s and we conclude x s X s ɛ,η X s ɛ,η 2 s(h+η).

Finally observe that ( ɛ) < P( X s ɛ,η ) = x s X s ɛ,η P( x s ) X s ɛ,η 2 (H η), so that Xɛ,η s > ( ɛ)2 (H η). We see that the (ɛ, η)-typical sets Xɛ,η s satisfy all the properties of the ɛ-typical sets in the Asymptotic Equipartition Theorem 3.5. Arguing as in the proof of Theorem 3.7 we deduce the following result. Theorem 4.6. Consider a homogeneous ergodic Markov chain (X n ) n on the finite set X with stationary distribution π and associated entropy H. Fix ɛ > 0 and denote by B the set of finite strings of 0-s and -s. Then, for every s > 0 sufficiently large there exists an injection c s : X s B such that if l s ( x s ) denotes the length of the string c s ( x s ) we have where X s is the random vector (X,..., X s ). E [ s l s( X s ) ] H + ɛ, 5. Coding and Data Compression We now begin to establish the limit for information compression by assigning short descriptions to the most probable messages, and assigning longer descriptions to less likely messages. In this chapter we will find the shortest average description length of a random variable by examining its entropy. 5.. Examples of Codes. In the sequel X will denote a finite set called alphabet. It will be equipped with a probability distribution p. We will often refer to the elements of X as the symbols or the letters of the alphabet. Think for example that X consists of all the letters (small and large caps) of the English alphabet together with all the signs of punctuation and a symbol for a blank space (separating words). One can produce a probability distribution on such a set by analyzing a very long text and measuring the frequency with each each of these symbols appears in this text. Definition 5. (Codes). Suppose that X and D are finite sets. (i) We denote by D the collection of words with alphabet D, D := n N D n. The length of a word w D is the natural number n such that w D n. (ii) A source code C or simply code for X based on D is a mapping C : X D. For x X we denote by l C (x) the length of C(x). Example 5.2. Suppose X = {yes, no} and D = {0, }. Then C(yes) = 00 and C(no) = is a source code. Definition 5.3 (Expected Length). Let p be a probability mass function on the finite alphabet X, and let D be a finite set. The expected length L(C) of a source code C : X D is the expectation of the random variable l C : X N, i.e., L(C) = E p [ l C ] = x X p(x)l C (x). 23