Information Theory: Entropy, Markov Chains, and Huffman Coding

Size: px
Start display at page:

Download "Information Theory: Entropy, Markov Chains, and Huffman Coding"

Transcription

1 The University of Notre Dame A senior thesis submitted to the Department of Mathematics and the Glynn Family Honors Program Information Theory: Entropy, Markov Chains, and Huffman Coding Patrick LeBlanc Approved: Professor Liviu Nicolaescu

2 Contents Notation and convention 2. Introduction 3 2. Entropy: basic concepts and properties Entropy Joint Entropy and Conditional Entropy Relative Entropy and Mutual Information Chain Rules for Entropy, Relative Entropy, and Mutual Information Jensen s Inequality and its Consequences Log Sum Inequality and its Applications 2 3. The Asymptotic Equipartition Property for Sequences of i.i.d. Random Variables Asymptotic Equipartition Property Theorem Consequences of the AEP: Data Compression High-probability Sets and the Typical Set 6 4. Asymptotic Equipartition Property for Markov Chains Background Information Entropy of Markov Chains Asymptotic Equipartition Coding and Data Compression Examples of Codes Kraft Inequality Optimal Codes Bounds on the Optimal Code Length Kraft Inequality for Uniquely Decodable Codes Huffman Codes Huffman Coding and the Twenty-Question Game Generation of Discrete Distributions from Fair Coins 34 References 36

3 2 Notation and convention We will denote by S the cardinality of a finite set S. We denote by N the set of natural numbers, N = {, 2,... } and we set N 0 := N {0}. We will use capital letters to denote random variables. We will use the symbol E when referring to expectation and the symbol P when referring to probability For any set S, contained in some ambient space X, we denote by I S the indicator function of S {, x S, I S : X {0, }, I S (x) = 0, x S. For a discrete random variable X we will use the notation X p to indicate thet p is the probability mass function (pmf) of X. We will often refer to the range of a discrete random variable as its alphabet.

4 . Introduction Information theory is the study of how information is quantified, stored, and communicated. If we attempt to quantify, store, or communicate information via electronic means then we inevitably run into the issue of encoding that is, how to store words, letters, and numbers as strings of binary digits. However, in doing so we might encounter a trade off between efficiency and reliability. To illustrate this trade-off consider an example: sending a binary bit over a noisy channel. That is, whenever we send a bit there is a chance p that the bit we send will flip: e.g., if we send a there is a p chance that the receiver will receive a 0. The most efficient way to send a message say the digit for sake of argument would be to send a single ; however, there is a possibly significant chance p that our message will be lost. We could make our message more reliable by sending or instead, but this vastly decreases the efficiency of the message. Claude Shannon attacked this problem, and incidentally established the entire discipline of information theory, in his groundbreaking 948 paper A Mathematical Theory of Communication. But what does information mean here? Abstractly, we can think of it as the resolving of uncertainty in a system. Before an event occurs we are unsure which of many possible outcomes could be realized; the specific realization of one of these outcomes resolves the uncertainty associated with the event. In this paper we first formalize a a measure of information: entropy. In doing so, we explore various related measures such as conditional or joint entropy, and several foundational inequalities such as Jensen s Inequality and the log-sum inequality. The first major result we cover is the Asymptotic Equipartition Property, which states the outcomes of a stochastic process can be divided into two groups: one of small cardinality and high probability and one of large cardinality and low probability. The high probability sets, known as typical sets, offer hope of escaping the trade-off described above by assigning the most efficient code to the sets with the highest-probability. We then examine similar results for Markov Chains, which are important because important processes, e.g. English language communication, can be modeled as Markov Chains. Having examined Markov Chains, we then examine how to optimally encode messages and examine some useful applications Entropy: basic concepts and properties 2.. Entropy. There are two natural ways of measuring the information conveyed in a statement: meaning and surprise. Of these meaning is perhaps the more intuitive after all, the primary means of conveying information is to convey meaning. Measuring meaning, however, is at least as much philosophy as mathematics and difficult to do in a systemic way. This leaves us with surprise as a measure of information. We can relate the amount of surprise in a statement to how probable that statement is; the more probable a statement the less surprising, and the less probable the more surprising. Consider the three statements below: The dog barked yesterday. The dog saved a child drowning in a well yesterday. QQXXJLOOCXQXIKMXQQLMQXQ The second statement ought to be more surprising than the first. Dogs bark every day; that one should do so at some point over the course of a day is quite likely and so unsurprising. On the other hand dogs rarely save drowning children, so the occurrence of such an event is both unlikely and surprising. So far so clear. But how might the first two statements compare with the third statement? The third statement appears to be total gibberish, and so we might be tempted to disregard it entirely. However, in another sense it is actually quite surprising. The letters Q and X do not often come up in English words, so the fact that a message was conveyed with so many Q s and X s ought to be a surprise. But how to measure this surprise?

5 4 Entropy is one such measure. A formal definition is offered below. We use the convention that 0 log(0) = 0 based on the well known fact lim x log x = 0. x 0 Definition 2. (Entropy). Fix b > 0 and let p be a probability distribution on a finite or countable set X. We define its entropy (in base b) to be the non-negative number H b (p) := x χ p(x) log b (p(x)). (2.) Let X be a discrete random variable with range (or alphabet) contained in a finite or countable set X. We define then entropy of X to be the entropy of p X, the probability mass function of X H b (X) = H b (p X ) = x X p X (x) log b p X (x). (2.2) Remark 2.2. For the rest of this paper, we will use the simpler notation log when referring to log 2. Also we will write H(X) instead of H 2 (X). As the logarithm is taken base 2, the entropy is expressed in terms of bits of information. The entropy of a random variable X does not depend on the values taken by X but rather on the probability that X takes on its values. The law of subconscious statistician shows that if X p(x), the expected value of the random variable g(x) is E p [ g(x) ] = x χ g(x)p(x). (2.3) We can relate the expected value of a transformation of p(x) to the entropy associated with the random variable X. Proposition 2.3. Let X be a discrete random variable with range (or alphabet) contained in a finite or countable set X with probability mass function p : X R. Then, [ ] H(X) = E p log. (2.4) p(x) Proof. E p (log( p(x) ) = p(x) log p(x) = p(x) log p(x) = H(X). x χ x χ Let us return to the idea of entropy as the measure of surprise in the realization of a random variable. For any outcome x X we can think of log p(x) as a how surprised we ought to be at observing x; note that this agrees with our intuitive conception of surprise, for the smaller the probability p(x) the larger the surprise upon its observation. The entropy of p is then the amount of surprise we expect to observe when sampling X according to p. More properly stated, then, entropy is not a measure of surprise upon the realization of a random variable, but the amount of uncertainty present in the probability distribution of said random variable. Obviously H(X) 0. (2.5) From the equality log b (x) = log b (a) log a (x) we deduce immediately that H b (x) = log b (a)h a (x). (2.6)

6 2.2. Joint Entropy and Conditional Entropy. Thus far we have presented entropy only as a measure of uncertainty of the probability distribution of one random variable. For practical applications it will be useful to have a measure of uncertainty of the joint probability distribution of a sequence of random variables. This leads us to define the notions of joint and conditional entropy. We use the latter idea to derive the Chain Rule for Entropy, Theorem 2.6, which gives us another way to calculate the joint entropy of two or more random variables. Definition 2.4 (Joint Entropy). Consider a pair of discrete random variables (X, Y ) with finite or countable ranges X and Y, and joint probability mass function p(x, y). We view the pair (X, Y ) as a discrete random variable with alphabet X Y. The joint entropy H(X, Y ) of X, Y is the entropy of the random variable (X, Y ). More precisely entropy is H(X, Y ) = x X y Y p(x, y) log(p(x, y)) = E p [ log 5 ]. (2.7) p(x, y) Having defined joint entropy, it is natural to define the conditional entropy of two probability distributions. Definition 2.5 (Conditional Entropy). Let (X, Y ) be a pair of discrete random variables with finite or countable ranges X and Y respectively, joint probability mass function p(x, y), and individual probability mass functions p X (x) and p Y (y). Then the conditional entropy of Y given X, denoted by H(Y X), is defined as H(Y X) := x X p X (x)h(y X = x) = p X (x) p Y (y x) log(p Y (y x)) x X y Y = p(x, y) log(p Y (y x)) x X y Y = E p [ log(p(y X))]. Unsurprisingly, the joint and conditional entropies are intimately related to each other. Theorem 2.6 (The Chain Rule for Entropy). Let (X, Y ) be a pair of discrete random variables with finite or countable ranges X and Y respectively, joint probability mass function p(x, y), and individual probability mass functions p X (x) and p Y (y). Then (2.8) H(X, Y ) = H(X) + H(Y X). (2.9) Proof. We follow the approach in [5, p.7]. H(X, Y ) = p(x, y) log(p(x, y)) x X y Y = p(x, y) log(p X (x)p Y (y x)) x X y Y = p(x, y) log(p X (x)) p(x, y) log(p Y (y x)) x X y Y x X y Y = p X (x) log(p X (x)) p(x, y) log(p Y (y x)) x X x X y Y = H(X) + H(Y X)

7 6 NB: By symmetry, we also have that H(X, Y ) = H(Y ) + H(X Y ). Corollary 2.7. let (X, Y, Z) be discrete random variables with finite or countable ranges X, Y, and Z respectively. Then H(X, Y Z) = H(X Z) + H(Y X, Z). (2.0) Proof. Follows the proof given of the theorem. Remark 2.8. In general, H(Y X) H(X Y ). However, we do have that Indeed, This will be of use later. H(X) H(X Y ) = H(Y ) H(Y X). H(X) + H(Y X) = H(X, Y ) = H(Y ) + H(X Y ) Relative Entropy and Mutual Information. Relative entropy measures the divergence between two distributions of random variables: the inefficiency of assuming that a distribution is q when in actuality it is p. If two distributions are not divergent, then we can broadly expect them to behave in the same manner; if they are divergent then they can behave in such a different manner that we cannot tell much about one distribution given the other. Definition 2.9 (Relative Entropy). The relative entropy or Kullback-Leibler distance between two probability mass functions p(x) and q(x) defined on a finite or countable space X is D(p q) = x X p(x) log p(x) q(x). (2.) If X is a random variable with alphabet X and probability mass function p, [ D(p q) = E p log p(x) ]. (2.2) q(x) We use the convention that 0 log 0 0 = 0, 0 log 0 q = 0, and that p log p 0 = based on continuity arguments. Thus if there is an x such that q(x) = 0 and p(x) > 0, then D(p q) =. Remark 2.0. We think of D(p q) as measuring the distance between two probability distributions; however, it is not quite a metric. While the Information Inequality (Theorem 2.20) will show that it is non-negative and zero if and only if p = q, it can in general fail the other criteria. Nevertheless, it is useful to think of it as measuring the distance between two distributions. We present one intuitive way to think of relative entropy. Let the events A, A 2,..., A n be the possible mutually exclusive outcomes of an experiment. Assume that q(x) consists of the probabilities of these outcomes, while p(x) consists of the probabilities of these outcomes under different circumstances. Then ( log log ) q k p k (2.3) denotes the chance in the unexpectedness of A k due to the different experimental circumstances. D(p q) is equal to the expected value of this change. The interpretation of relative entropy presented above implies that two probability distributions may be related so that knowing one distribution gives us some information about the other. We shall refer to this quantity as mutual information; it measures the amount of information that one probability distribution contains about another. Mutual information be thought of as the expected change in uncertainty in one variable brought about by observing another variable.

8 7 Definition 2. (Mutual Information). Consider two discrete random variables X, Y with ranges finite or countable ranges X and Y, a joint probability mass function p(x, y) and marginal probability mass functions p X (x) and p Y (y). The mutual information I(X; Y ) is the relative entropy between the joint distribution and the product of the marginal distributions p X (x)p Y (y). I(X; Y ) := x X p(x, y) p(x, y) log p X (x)p Y (y) y Y = D(p(x, y) p X (x)p Y (y)) = E p(x,y) log p(x, y) p X (x)p Y (y). (2.4) The above definition shows that the mutual information enjoys the very important symmetry property I(X; Y ) = I(Y ; X). (2.5) To gain a better understanding of the mutual information note that we can rewrite I(X; Y ) as I(X; Y ) = x,y p(x, y) log p(x, y) p X (x)p Y (y) = x,y = x,y p(x, y) log p X(x y) p X (x) p(x, y) log p X (x) + x,y p(x, y) log p X (x y) (2.6) = x p X (x) log p X (x) + x,y p(x, y) log p X (x y) = H(X) H(X Y ). Thus H(X) = H(X Y ) + I(X, Y ). We may interpret H(X Y ) as the uncertainty in X given the occurrence of Y ; however, this does not capture the whole of the uncertainty in the probability distribution of X. The missing piece is the mutual information I(X, Y ) which measures the reduction in uncertainty due to the knowledge of Y. Also, by the symmetry (2.5) we have I(X; Y ) = H(Y ) H(Y X) so the mutual information between X and Y is also the reduction in uncertainty in Y due to knowledge of X. One consequence of this is that X contains as much information about Y as Y does about X. Moreover, since H(X, Y ) = H(X) + H(Y X) by (2.6), we have that Finally, note that I(X; Y ) = H(X) + H(Y ) H(X, Y ). I(X; X) = H(X) + H(X) H(X X) = H(X), so the mutual information of a variable with itself is the entropy of that variable s probability distribution. Following this logic, entropy is sometimes referred to as self-information. The following theorem summarizes these facts. Theorem 2.2. Let (X, Y ) two discrete random variables with with finite or countable ranges X and Y, joint probability mass function p(x, y), and individual probability mass functions p X (x) and

9 8 p Y (y). Then we have that I(X; Y ) = H(X) H(X Y ) (2.7) I(X; Y ) = H(Y ) H(Y X) (2.8) I(X; Y ) = H(X) + H(Y ) H(X, Y ) (2.9) I(X; Y ) = I(Y ; X) (2.20) I(X; X) = H(X) (2.2) Proof. Presented above, followed [5, p.2] Chain Rules for Entropy, Relative Entropy, and Mutual Information. We now generalize the previous results and definitions to sequences of n random variables and associated probability distributions. These results will be of use later, particularly when considering the entropy of a stochastic process. Theorem 2.3 (Chain Rule for Entropy). Let the discrete random variables X,X 2,...,X n have finite or countable ranges X, X 2,..., X n, have joint mass function p(x, x 2,..., x n ), and individual probability mass functions p X (x), p X2 (x),..., p Xn (x). Then H(X, X 2,..., X n ) = n H(X i X i,..., X ). (2.22) Proof. We follow [5, p.22]. Recall that p(x,..., x n ) = n i= p(x i x i,..., x ) and consider: H(X, X 2,..., X n ) = p(x, x 2,..., x n ) log p(x, x 2,..., x n ) x,x 2,...,x n = i= x,x 2,...,x n p(x, x 2,..., x n ) log = x,x 2,...,x n i= n p Xi (x i x i,..., x ) i= n p(x, x 2,..., x n ) log p Xi (x i x i,..., x ) n = p(x, x 2,..., x n ) log p Xi (x i x i,..., x ) i= x,x 2,...,x n n = p(x, x 2,..., x i ) log p Xi (x i x i,..., x ) i= x,x 2,...,x i n = H(X i X i,..., X ). i= To generalize the concept of mutual information, we will define conditional mutual information the reduction of uncertainty of X due to knowledge of Y when Z is given and use this quantity to prove a chain rule as we did with entropy. Definition 2.4 (Conditional Mutual Information). Let X, Y, and Z by discrete random variables with finite or countable ranges. Then the conditional mutual information of random variables X and Y given Z is I(X, Y Z) = H(X Z) H(X Y, Z). (2.23)

10 Theorem 2.5 (The Chain Rule for Information). Let X,X 2,...,X n, Y be discrete random variables with finite or countable ranges. Then n I(X, X 2,..., X n Y ) = I(X i ; Y X i,..., X ). (2.24) Proof. We follow [5, p.24]. i= I(X, X 2,..., X n Y ) = H(X, X 2,..., X n ) H(X, X 2,..., X n Y ) (2.25) n n = H(X i X i,..., X ) H(X i X i,..., X, Y ) (2.26) = i= i= n I(X i ; Y X, X 2,..., X n ). (2.27) i= 2.5. Jensen s Inequality and its Consequences. Jensen s Inequality is a foundational inequality for convex functions. Recall that we have defined entropy to be the sum of functions of the form log( ) as log( ) is convex, general convexity results will be useful in the study of entropy. In particular, Jensen s Inequality can be used to derive important results such as the Independence Bound on Entropy, Theorem We begin with a definition. Definition 2.6 (Convexity). Let I R be an interval. (i) A function f : I R is called convex if for every x, x 2 I and λ [0, ] we have f(λx + ( λ)x 2 ) λf(x ) + ( λ)f(x 2 ). (2.28) (ii) A function f : I R is strictly convex if we have equality in the above expression only when λ {0, }. (iii) A function f : I R is called concave if f is convex. Let us observe that when f : I R is convex then, for any x,..., x N I and any p,..., p n 0 such that i p i = we have ( ) f p i x i p i f(x i ). (2.29) i i If f is strictly convex, then above we have equality if and only if x = = x n. It can be awkward to prove that functions are convex from the definition, so we shall present a sufficient criterion. Theorem 2.7. Let I R be an interval, and f : I R be a function. If the function f has a second derivative that is non-negative (resp. positive) over I, then the function is convex (resp. strictly convex) over I. Theorem 2.7 shows that the functions are convex, while the functions x 2, x. a x, a >, x log x (x 0), log x, x, x > 0, are concave. Convexity will prove to be a useful property which is fundamental to many properties of entropy, mutual information, and other such quantities. The next result is well known. 9

11 0 Theorem 2.8 (Jensen s Inequality). Let f : I R. If f : I R is a convex function and X is a random variable with range contained in I, then E( f(x) ) f( E(X) ). (2.30) Moreover, if f is strictly convex and in (2.30) we have equality, then X = E(X) with probability, i.e. X is a.s. constant. Proof. When the range of X is finite, the inequality (2.30) is the classical Jensen s inequality (2.29). The general case follows from the finite case by passing to the limit. Jensen s Inequality is the key result used to show several important properties of entropy. Theorem 2.9 (Gibb s Inequality). Let p, q : X [0, ], be two probability mass functions on a finite or countable space X. Then D(p q) 0, (2.3) with equality if and only if p(x) = q(x) for all x X. Proof. We follow [5, p.28]. Fix a random variable with Let be the support set of p(x). Then D(p q) = x A A := {x X : p(x) > 0 } p(x) log p(x) q(x) = p(x) log q(x) (2.29) log p(x) q(x) p(x) p(x) x A x A ( ) = log x A q(x) log x X q(x) = log = 0. ( ) Observe that since log t is a strictly convex function of t, we have equality sign in ( ) if and only if the function A x q(x) p(x) R is constant, i.e., there exists c R such that q(x) = cp(x), x A. This implies that p(x) = c. q(x) = c x A x A Further, we have equality in ( ) only if q(x) =, x A q(x) = x X which implies that c =. Thus, D(p q) = 0 if and only if p(x) = q(x) for all x. Corollary 2.20 (Information Inequality). Let X, Y be discrete random variables with finite or countable ranges. Then I(X; Y ) 0, (2.32) with equality if and only if X and Y are independent. Proof. We follow [5, p.28]. Let p(x, y) denote the joint probability mass function of X, Y. Then with equality if and only if p(x, y) = p(x)p(y). I(X; Y ) = D(p(x, y) p(x)p(y)) 0, We may think of a deep cause for this result. If we find two random variables which exhibit some dependence on each other then we cannot mathematically determine which is the cause and which the effect; we can only note the degree to which they are correlated.

12 Theorem 2.2 (Conditioning Reduces Entropy). Let X, Y be discrete random variables with finite or countable ranges. Then H(X Y ) H(X), (2.33) with equality if and only if X and Y are independent. Proof. We follow [5, p.29]. 0 I(X; Y ) = H(X) H(X Y ) (2.34) This theorem states that knowing another random variable, Y can only reduce and not increase the uncertainty in X. It is important, however, to note that this is only on average. Example 2.22 below shows that it is possible for H(X Y = y) to be greater than, less than, or equal to H(X), but on the whole we have H(X Y ) = y p(y)h(x Y = y) H(X). Example Let X and Y have the following joint distribution X 2 Y Then H(X) H( 8, 7 8 ) = bit, H(X Y = ) = 0 bits, and H(X Y = 2) = bit. We calculate H(X Y ) = 3 4 H(X Y = ) + 4H(X Y = 2) = 0.25 bit. Thus, the uncertainty in X is increased if Y = 2 is observed and decreased if Y = is observed, but uncertainty decreases on the average. Theorem Let X be a discrete random variable with finite or countable range X. Then 4 8 H(X) log X, with equality if and only if X has a uniform distribution over X. Proof. We follow [5, p.29]. Let u = X be the uniform probability mass function over X, and let p(x) be the probability mass function for X. Then D(p u) = p(x) log p(x) = log X H(X) u(x) Thus by the non-negativity of relative entropy we have that 0 D(p u) = log X H(X) Theorem 2.24 (Independence Bound on Entropy). Let X, X 2,..., X n be discrete random variables taking values in a finite or countable range. Then n H(X, X 2,..., X n ) H(X i ), (2.35) with equality if and only if the X i are independent. Proof. We follow [5, p.30]. By the chain rule for entropies, n H(X, X 2,..., X n ) = H(X i X i,..., X ) i= i= n H(X i ), i=

13 2 which follows from Theorem 2.2. Equality follows if and only if each X i is independent of X i,..., X, i.e. if and only if the X i s are independent Log Sum Inequality and its Applications. In this section we continue to show consequences of the concavity of the logarithm function, the most important of which is the log-sum inequality. This will allow us to prove convexity and concavity for several of our measures including entropy and mutual information. Theorem 2.25 (Log Sum Inequality). For a, a 2,..., a n [0, ) and b, b 2,..., b n [0, ) we have n a i log a ( n ) n i i= a i log a i b n i= i i= i= b, (2.36) i with equality if and only if a i b i is constant. Proof. We follow [5, p.3]. Without loss of generality, we may assume that a i > 0 and b i > 0. The function f(t) = t log t = t log t is strictly convex, since f (t) = t log 2 e > 0, t (0, ). Hence by Jensen s inequality (2.29) we have ( ) αi f(t i ) f αi t i, for α i 0, i α i =. Setting B := n j= we obtain ai bj log a i b i which is the log-sum inequality as b j =, α i = b i B, t i = a i b i, a i bj log a i bj, The log sum inequality will allow us to prove various convexity related results. The first will be a reproving of Theorem 2.9, which states that D(p q) 0 with equality if and only if p(x) = q(x). By the log sum inequality we have that D(p q) = p(x) log p(x) q(x) ( ) p(x) p(x) log = log q(x) = 0, with equality if and only if p(x) q(x) = c. As both p(x) and q(x) are probability mass functions, we have that c =. Thus we have that D(p q) = 0 if and only if p(x) = q(x). Armed with the log-sum inequality, we can now show that relative entropy is a convex function, and that entropy is concave. Theorem 2.26 (Convexity of Relative Entropy). D(p q) is convex in the pair (p, q); that is, if (p, q ) and (p 2, q 2 ) are two pairs of probability mass functions on finite or countable spaces then, for λ [0, ]. D(λp + ( λ)p 2 λq + ( λ)q 2 ) λd(p q ) + ( λ)d(p 2 q 2 ) (2.37) Proof. We follow [5, p.32]. We can apply the log sum inequality to a term on the left hand side of (2.37) to arrive at: (λp (x)+( λ)p 2 (x)) log λp (x) + ( λ)p 2 (x) λq (x) + ( λ)q 2 (x) λp (x) log λp (x) λq (x) +( λ)p 2(x) log ( λ)p 2(x) ( λ)q 2 (x). We use the convention that 0 log 0 = 0, a log a = if a > 0, and 0 log arguments. = 0, which follow from continuity

14 3 Summing this over all values of x gives us (2.37). Theorem 2.27 (Concavity of Entropy). H(p) is a concave function of p where p is a probability mass function on a finite or countable space. Proof. We follow [5, p.32]. By Theorem 2.23 we have that H(p) = log X D(p u), where u is the uniform distribution on X outcomes. Then, since D is convex, we have that H is concave. 3. The Asymptotic Equipartition Property for Sequences of i.i.d. Random Variables The Asymptotic Equipartition Property or AEP is central to information theory. In broad terms, it claims that the outcomes of certain stochastic processes can be divided into two groups: one group is small and is far more likely to occur than the other, larger, group. This property allows for efficient data compression by ensuring that some sequences generated by a stochastic process are more likely than others. In this section we show that AEP holds for sequences of i.i.d. random variables 3.. Asymptotic Equipartition Property Theorem. To approach the Asymptotic Equipartition Property, we first need to cover several notions of convergence. Definition 3. (Convergence of Random Variables). Given a sequence of random variables {X i }, we say that the sequence {X i } converges to a random variable X (i) in probability, if ɛ > 0, P( X n X > ɛ ) 0, (ii) in mean square, if E(X n X) 2 0, (iii) with probability or almost surely, if P{lim n X n = X} =. We can now state the Asymptotic Equipartition Theorem. Theorem 3.2 (Asymptotic Equipartition Property). If {X i } is a sequence of independent identically distributed (i.i.d.) variables, with finite or countable range X, associated with the probability mass function p(x), then in probability. n log p(x, X 2,..., X n ) H(X), (3.) Proof. We follow [5, p. 58]. Recall that functions of independent random variables are themselves independent random variables. Then, since the X i are i.i.d. the log p(x i ) are i.i.d. as well. Thus we have n log p(x, X 2,..., X n ) = log p(x i ) n i E log p(x) in probability (by the weak law of large numbers) = H(X). Remark 3.3. We note that if X, X are i.i.d. discrete random variables taking values in a finite or countable space, then P(X = X ) 2 H(X). To see this denote by p the common probability mass function of these two random variables and by X their common alphabet. Then P(X = X ) = x X P(X = X = x) = x X p(x) 2 = x X p(x)2 log p(x) = E[ 2 log p(x)].

15 4 Using Jensen s inequality for the convex function 2 x we deduce E[ 2 log p(x)] 2 E[log p(x)] = 2 H(X). Definition 3.4. The ɛ-typical set A (n) ɛ, for some ɛ > 0, with respect to p(x) is the set of sequences (x, x 2,..., x n ) X n with the property 2 n(h(x)+ɛ) p(x, x 2,..., x n ) 2 n(h(x) ɛ). (3.2) We can use the AEP to prove several properties about the probability of sets in A (n) ɛ cardinality of A (n) ɛ and the Theorem 3.5. If {X i } is a sequence of independent identically distributed (i.i.d.) variables, with finite or countable range X, associated with the probability mass function p(x), and for some ɛ > 0 A (n) ɛ is the ɛ-typical set with respect to p(x), then (i) If (x, x 2,..., x n ) A (n) ɛ, then H(X) ɛ n log p(x, x 2,..., x n ) H(X) + ɛ. (ii) P(A (n) ɛ ) > ɛ for sufficiently large n. (iii) A (n) ɛ 2 n(h(x)+ɛ), where S denotes the cardinality of the set S. (iv) A (n) ɛ ( ɛ)2 n(h(x) ɛ) for sufficiently large n. Thus the typical set has probability nearly, all elements of the typical set are nearly equiprobable, and the number of elements in the typical set is nearly 2 nh(x). That is, the typical sets have high probability and are small if the entropy is small. Proof. We follow [5, p.59]. The inequality (i) follows from the condition (3.2) in the definition of A (n) ɛ : 2 n(h(x)+ɛ) p(x, x 2,..., x n ) 2 n(h(x) ɛ) = n(h(x) + ɛ) log p(x, x 2,..., x n ) n(h(x) ɛ) = H(X) ɛ n log p(x, X 2,..., X n ) H(X) + ɛ (ii) The weak law of large numbers implies that the probability of the event (X, X 2,..., X n ) A (n) ɛ tends to as n. Indeed, for any ɛ > 0 we have ( lim n P(A(n) ɛ ) = lim P ) n n log p(x, X 2,..., X n ) H(X) < ɛ =. To prove (iii) we note that = and thus we have x X n p(x) x A (n) ɛ p(x) 2 n(h(x)+ɛ) A (n) ɛ, A (n) ɛ 2 n(h(x)+ɛ). (3.3) Finally, to prove (iv) note that, for sufficiently large n, we have that P( A (n) ɛ ) > ɛ. Thus, and thus ɛ < P( A (n) ɛ ) x A (n) ɛ 2 n(h(x) ɛ) = 2 n(h(x) ɛ) A (n) ɛ, A (n) ɛ ( ɛ)2 n(h(x) ɛ).

16 3.2. Consequences of the AEP: Data Compression. Let X, X 2,..., X n be independent, identically distributed random variables with vales in the finite alphabet X drawn according to the density function p(x). We seek to find an efficient encoding system for such sequences of random variables. Note that, ɛ > 0 we can partition all sequences in X n into two sets: the ɛ-typical set A (n) ɛ and its complement, (A (n) ɛ ) c. Let there be a linear ordering on each of these sets, e.g. induced by lexicographic order on X n defined by some linear ordering on X. Then each sequence in A (n) ɛ can be represented by giving the index of the sequence in the set. As there are at most 2 n(h(x)+ɛ) sequences in A (n) ɛ by Theorem 3.5, the indexing ought to require no more than n(h(x) + ɛ) + bits the extra bit is necessary if n(h(x) + ɛ) is not an integer. These sequences are prefixed by a 0 to distinguish them from the sequences in (A (n) ɛ ) c, raising the total number of bits to n(h(x) + ɛ) + 2. Likewise, we can index each sequence not in (A (n) ɛ ) c using at most n log X + bits. Adding a for identification purposes increases the maximum number of bits to n log X + 2 Remark 3.6. We note the following about the above coding scheme: The code is one-to-one and easily decodable, assuming perfect transmission. The initial bit acts as a flag to signal the following length of code. We have been less efficient than we could have been, since we have used a brute force method to encode the sequences in (A (n) ɛ the fact that the cardinality of (A (n) ɛ ) c. In particular, we could have taken into account ) c is less than the cardinality of X n. However, this method is still efficient enough to yield an efficient description. The typical sequences have short descriptions of length approximately nh(x) Let x n denote a sequence (x, x 2,..., x n ) X n and let l(x n ) be the length of the codeword corresponding to x n. If n is large enough that P( A ɛ (n) ) ɛ, the expected value of the codeword is E[ l(x n ) ] = x n p(x n )l(x n ) = x n A ɛ (n) x n A (n) ɛ p(x n )l(x n ) + p(x n )l(x n ) x n (A (n) ɛ ) c p(x n )(n(h(x) + ɛ) + 2) + p(x x n (A (n) ɛ ) c n )( n log X + 2 ) = P( A (n) ɛ )( n(h(x) + ɛ) + 2 ) + P( (A (n) ɛ ) c )(n log X + 2 ) n( H(X) + ɛ ) + ɛn(log X ) + 2 = n( H(X) + ɛ ), where we have that ɛ = ɛ + ɛ log X + 2 n can be made arbitrarily small by the correct choices of ɛ and n. We have thus proved the following result. Theorem 3.7. Let X,..., X n be i.i.d. random variables with finite or countable range X and associated with the probability mass function p(x). Let ɛ > 0. Then there exists a code that maps sequences x n of length n into binary strings such that the mapping is one-to-one (and therefore invertible) and [ ] E n l(xn ) H(X) + ɛ, for sufficiently large n. Proof. Above, follows [5, p.6]. 5

17 High-probability Sets and the Typical Set. We have defined A (n) ɛ so that it is a small set which contains a lot of the probability of X n. However, we are left with the question of whether it is the smallest such set. We will prove that the typical set has the same number of elements as the smallest set to a first order approximation in the exponent. Definition 3.8 (Smallest Set). For each n =, 2,..., let B (n) δ P( B (n) δ ) δ. X n be the smallest set such that We shall argue that the typical set has approximately the same cardinality as the smallest set. Theorem 3.9. Let X,..., X n be i.i.d. random variables with finite or countable range X and associated with the probability mass function p(x). For δ < 2 and any δ > 0, if P( B (n) δ ) > δ, then log B(n) n δ > H(X) δ for sufficiently large n. Proof. First we prove a lemma. Lemma 3.0. Given any two sets A, B, such that P(A) > ɛ and P(B) > ɛ 2, we have that P(A B) > ɛ ɛ 2. Proof. By the Inclusion-Exclusion Principle we have that The above lemma shows that P( A (n) ɛ of inequalities. From here we conclude that P(A B) = P(A) + P(B) P(A B) > ɛ + ɛ 2 = ɛ ɛ 2. ( ɛ δ P A (n) ɛ A ɛ (n) B (n) δ B (n) δ ) ɛ δ. This leads to the following sequences 2 n(h(x) ɛ) = A (n) ɛ ) B (n) δ = A (n) ɛ B (n) δ p(x n ) B (n) δ 2 n(h(x) ɛ) B (n) δ 2 n(h(x) ɛ). B (n) δ 2 n(h(x) ɛ) ( ɛ δ) = log B (n) δ n(h(x) ɛ) + log ɛ δ = log B(n) n δ (δ>/2) > H(X) ɛ + log 2 ɛ n = n log B(n) δ H(X) δ, for δ = ɛ log 2 ɛ n which is small for appropriate ɛ and n. The theorem demonstrates that, to first order in the exponent, the set B (n) δ, must have at least 2 nh(x) elements. However, A (n) ɛ has 2 n(h(x)±ɛ. Therefore, A (n) ɛ is about the same size as the smallest high-probability set. We will now introduce notation to express equality to the first order in the exponent.

18 Definition 3. (First Order Exponential Equality). Let a n and b n be sequences of real numbers. Then we write a n =b n if lim n n log a n = 0. b n Thus, a n =b n implies that a n and b n are equal to the first order in the exponent. We can now restate the above results: If δ n 0 and ɛ n 0, then B (n) δ n = A (n) ɛ n =2 nh(x). 4. Asymptotic Equipartition Property for Markov Chains We now formally extend the results of the previous section from a sequence of i.i.d. random variables to a stochastic process. In particular, we will prove a version of the asymptotic equipartition property for Markov Chains. Throughout this section the random variables will be assumed valued in a finite alphabet X. 4.. Background Information. A (discrete time) stochastic process with state space X is a sequence {X i } i N0 of X -valued random variables. We can characterize the process by the joint probability mass functions p n (x,..., x n ) = P(X = x,..., X n = x n ), (x,..., x n ) X n, n N. One desirable property for a stochastic process to possess is stationarity. Definition 4. (Stationarity). A stochastic process is called stationary if the joint distribution of any subset of the sequence of random variables is invariant with respect to shifts in the time index. That is, P(X = x, X 2 = x 2..., X n = x n ) = P(X +l = x, X 2+l = x 2..., X n+l = x n ), for every n, for every shift l, and for all x, x 2,..., x n X. A famous example of a stochastic process is one in which random variable is dependent only on the random variable which precedes it and is conditionally independent of all other preceding random variables. Such a stochastic process is called a Markov Chain, and we now move to offer a formal definition. Definition 4.2 (Markov Chain). A discrete stochastic process {X i } i N0 with state space X is said to be a Markov Chain if, for any n N, and any x, x 2,..., x n, x n+ X, we have P(X n+ = x n+ X n = x n, X n = x n,..., X = x ) = P(X n+ = x n+ X n = x n ). Here we can deduce that in a Markov Chain the joint probability mass function of a sequence of random variables is p(x, x 2,..., x n ) = p(x )p (x 2 x )p 2 (x 3 x 2 )... p n (x n x n ), p k (x x) = P( X k+ = x X k = x ). This motivates the importance of the next definition. Definition 4.3 (Homogeneity). A Markov Chain is called homogeneous if the conditional probability p n (x x) does not depend on n. That is, for n N 0 P( X n+ = x X n = x ) = P( X 2 = x X = x ), x, x X. Now, if {X i } i N0 is a Markov Chain with finite state space X = {x,..., x m }, X n is called the state at time n. A homogeneous Markov Chain is characterized by its initial state X 0 and a probability transition matrix P = [P i,j ] i,j m, P i,j = P( X = x j X 0 = x i, ). 7

19 8 If it is possible to transition from any state of a Markov Chain to any other with nonzero probability in a finite number of steps, then the Markov Chain is said to be irreducible. If the the greatest common denominator of the lengths of nonzero probability different paths from a state to itself is, then the Markov Chain is said to be aperiodic. Further, if the probability mass function of X n is p n (i) = P(X n = x i ), then p n+ (j) = x n p n (i)p i,j. In particular, this shows that the probability mass function of X n is completely determined by the probability mass function of the initial state X 0. A probability distribution π on the state space X is called a stationary distribution with respect to a homogeneous Markov chain (X n ) n N0 if P(X n = x) = π(x), n N 0, x X. If X = {,..., m} and the probability transition matrix is (P ij ) then we see that π is stationary if and only if m π(j) = π(i)p i,j, j =,..., m. i= The next theorem describes the central fact in the theory of Markov chains. For a proof we refer to [3, 4]. Theorem 4.4. Suppose that {X n } is an irreducible Markov chain with finite state space X. Then the following hold. (i) There exists a unique stationary distribution π on X, and for any initial data X 0 and any function f : X R we have the ergodic limit n lim f(x k ) = f(x)π(x), almost surely. (4.) n n k= x X (ii) If additionally {X n } is aperiodic, then the distribution of X n tends to the stationary distribution as n, i.e., lim P(X n = x) = π(x), x X. n The ergodic limit (4.) is very similar to the strong law of large numbers. We want to mention a special case of (4.). Fix a state x 0 X and let f : X R be the indicator function of the set {x 0 }, {, x = x 0, f(x) = 0, x x 0. In this case, the random variable, M n (x) = n f(x k ), k= represents the number of times the Markov chain visits the state x in the discrete time interval {,..., n}. The ratio F n (x) = M n(x) n then can be viewed as the frequency of these visits in this time interval. The ergodic limit then states that M n (x) lim = π(x), almost surely. (4.2) n n

20 As an example to keep in mind suppose that X consists of all the letters of the English alphabet (both capitalized and uncapitalized) and all the signs of punctuation. By parsing a very long English text we can produce a probability mass function π on X that describes the frequency of occurrence of the various symbols. We can also produce a transition probability matrix P = ( p xy ) x,y X, where p xy describes how frequently the symbol x is followed by the symbol y.the string of symbols in the long text can be viewed as one evolution of a Markov chain with state space X with stationary distribution π Entropy of Markov Chains. Suppose we have a simple Markov Chain with a finite number of states X = {x, x 2,..., x m }, with a transition probability matrix p i,j for i, j {, 2,..., m} and a stationary distribution π, π i = π(x i ). If the system is in state x i, then it transitions to state x k with probability p i,k. We can regard the transition from x i to x k as a random variable T i drawn according to the probability distribution {p i,, p i,2,..., p i,n }, and the entropy of X i is n H(T i ) = p i,k log p i,k. k= The entropy of T i will depend on i, and can be intuitively thought of as a measure of the amount of information obtained when the Markov Chain moves one step forward from the starting state of x i. Next, we define the average over the initial states of this quantity with respect to the stationary measure π, n n n H = π i H i = π i p i,k log p i,k. i= i= k= This can be interpreted as the average amount of uncertainty removed when the Markov chain moves one step ahead. H is then defined to be the entropy of the Markov Chain. Moreover, H is uniquely determined by the chosen stationary probability measure π and the transition probabilities p i,k. We now seek to generalize these concepts to the case where we move r steps ahead for r N. If the systems is in state x i, then it is easy to calculate the probability that in the next r trials we find it in the states x k, x k2,..., x kr in turn, where k, k 2,..., k r are arbitrary numbers from to n. Thus the subsequent fate of a system initially in the state A i in the next r trials can be captured by an X r -valued random variable Ti r. Moreover, the associated entropy is Hr i and is regarded as a measure of the amount of information obtained by moving ahead r steps in the chain after starting at x i. Then, in parallel to the above, H (r) = n i= π i H (r) i. This is the r-step entropy of the Markov Chain, which is the average amount of information given by moving ahead r steps in the chain. The one step entropy defined above can be written as H (). Since we have defined H (r) to be the average amount of information obtained in moving ahead r steps in a Markov Chain, then it is natural to expect that for r, s, N we have H (r+s) = H (r) + H (s), or, equivalently, H (r) = rh (). We shall show that this later condition is true by induction on r. For r = the definition holds trivially, as H () = H (). Now suppose that H (r) = rh () for a given r. We will show that H (r+) = (r + )H (). If the system is in a state x i, then the random variable which describes the 9

21 20 system s behavior over the next r + transitions can be regarded as the product of two dependent variables: (A) the variable corresponding to the transition immediately following x i with entropy H () i and (B) the random variable describing the behavior of the system in the next r trials. The entropy of this variable is H (r) k if the outcome of A was x k. Recall the relation H(A, B) = H(A) + H(B A). This gives us n H (r+) i = H () i + p i,k H (r) k. (4.3) k= Combining this relation with the definition H (r) = n i= π ih (r) = H () + = n k= n i= H (r+) = π i H () i + n i= n k= π i H (r+) i H (r) k i n π i p i,k i= gives us π k H (r) k = H () + H (r) = (r + )H (r+) Asymptotic Equipartition. Consider a homogeneous Markov Chain {X n } n N0 with finite state space X = { x,..., x n }, and a stationary distribution π : X [0, ] which obeys the law of large numbers (4.2). That is, in a sufficiently long sequence of s consecutive trials the relative frequency Mn(x) s of the occurrence of the state xx will converge to π(x) in probability, i.e., for any δ > 0 we have ( ) lim P M s (x) π(x) s s > δ = 0. (4.4) We set P k := π(x k ), and we denote by p ij the probability of transition from state x i to the state x j. A result x s X s of a sequence of s consecutive transitions in a Markov Chain be written as a sequence x s = x k x k2... x ks, k,..., k s {, 2,..., n}. Note that the probability of x s materializing depends not on the part of the chain where the sequence begins due to stationarity, and is equal to P( x s ) = P k p k k 2 p k2 k 3... p ks k s. If i, l {, 2,..., n}, and m il is the number of pairs of the form k r k r+ for r s in which k r = i and k r+ = l, then the probability of the sequence x s can be written as n n P( x s ) = P k (p m il i,l ). (4.5) i= l= Theorem 4.5. Given ɛ > 0 and η > 0, there exists S = S(ɛ, η), sufficiently large, such that for all s > S(ɛ, η) the sequences ( x s ) X s can be divided into two groups with the following properties: (i) The probability of any sequence x s in the first group satisfies the inequality log P( x s ) s H < η (4.6)

22 2 (ii) The sum of the probabilities of all sequences of the second group is less than ɛ. That is, all sequences with the exception of a very low probability group have probabilities lying between a s(h+η) and a s(h η), where a is the base of the logarithm used and H is the one-step entropy of the given chain. This parallels the partition of a probability space into typical sets and non-typical sets. Proof. We follow the approach in [6, p.6]. With δ = δ(s, η) > small enough, to be specified later, we include in the first group the sequences x s X s satisfying the following two conditions (i) P( x s ) > 0, (ii) for any i, l {, 2,..., n} the inequality holds. m i,l sp i p i,l < sδ, i, l {, 2,..., n}. (4.7) All other sequences shall be assigned to group two, and we shall show that this assignment satisfies the theorem if δ is sufficiently small and s sufficiently large. We now consider the first requirement. Suppose that x s belongs to the first group. It follows from (4.7) that m i,l = sp i p i,l + sδθ i,l for θ i,l <, and i, l {, 2,..., n}. We now substitute the expressions for m i,l into (4.5). While doing so, we note that the requirement that the given sequence be a possible outcome requires m i,l = 0 when p i,l = 0. Thus, in (4.5) we must restrict ourselves to nonzero p i,l, which we will denote by an asterisk on the product. Thus P( x s ) = P k (pi,l ) sp ip i,l +sδθ i,l It follows that = log P(x s ) = log P k s i i l = log P k + sh sδ i log P( x s ) s H < s log Pi p i,l log p i,l sδ l i θi,l log p i,l. P k l l + δ i Thus for sufficiently large s and small δ defined by η = δ log p i i,l l log p i,l. θ i,l log p i,l we have that log P( x s ) H s < η. Thus the first requirement of the theorem is satisfied. We now consider the second group. For book-keeping purposes, we note that when we sum the probabilities of the sequences in this group, we exclude the probabilities of those sequences which are impossible as their probabilities are zero. Thus we are interested in the probability P ( x s ) for sequences for which the inequality (4.7) fails for at least one a pair of indices i, l {, 2,..., n}. It suffices to calculate the quantity n n P( m i,l sp i p i,l sδ). i= l= l

23 22 Fix the indices i and l. By (4.4) we have that sufficiently large s ( ) P m i sp i < δ 2 s > ɛ. Then if the inequality m i sp i < δ 2 s is satisfied, and if we make s is sufficiently large, then m i can be made arbitrarily large. Thinking of the transition i l as a Bernoulli trial with success probability p il we expect that during m i trials we would observe m i p il such transitions. The weak law of large numbers then shows that if s is sufficiently large and thus m i is sufficiently large we have ) m i,l P( p i,l m i < δ > ɛ. 2 Combining these results gives us that the probability of satisfying the inequalities m i sp i < δ 2 s (4.8) exceeds ( ɛ) 2 > 2ɛ. However, it follows from (4.8) that p i,l m i sp i p i,l < p i,l δ 2 s < δ 2 s, which together with (4.9) and the triangle inequality gives us Thus for any i and l, we have that for sufficiently large s ( ) P m i,l sp i p i,l < δs > 2ɛ, which implies that and from this it follows that n i= l= m i,l m i p i,l < δ 2 m i δ 2 s (4.9) m i,l sp i p i,l < δs. (4.0) ( ) P m i,l sp i p i,l > δs < 2ɛ, n ( ) P m i,l sp i p i,l > δs < 2n 2 ɛ. As the right hand side of the above inequality can be made arbitrarily small as ɛ becomes arbitrarily small, we have that the sum of the probabilities of all the sequences in the second group can be made arbitrarily small for sufficiently large s. In the remainder of this section we will assume that the base of the logarithms employed is a = 2. We denote by Xɛ,η s the collection of sequences in the first group and we will refer to such sequences as (ɛ, η)-typical. By construction P( Xɛ,η s ) > ɛ if s is sufficiently large. Now observe that P( x s ) 2 s(h+η) Xɛ,η, s and we conclude x s X s ɛ,η X s ɛ,η 2 s(h+η).

24 Finally observe that ( ɛ) < P( X s ɛ,η ) = x s X s ɛ,η P( x s ) X s ɛ,η 2 (H η), so that Xɛ,η s > ( ɛ)2 (H η). We see that the (ɛ, η)-typical sets Xɛ,η s satisfy all the properties of the ɛ-typical sets in the Asymptotic Equipartition Theorem 3.5. Arguing as in the proof of Theorem 3.7 we deduce the following result. Theorem 4.6. Consider a homogeneous ergodic Markov chain (X n ) n on the finite set X with stationary distribution π and associated entropy H. Fix ɛ > 0 and denote by B the set of finite strings of 0-s and -s. Then, for every s > 0 sufficiently large there exists an injection c s : X s B such that if l s ( x s ) denotes the length of the string c s ( x s ) we have where X s is the random vector (X,..., X s ). E [ s l s( X s ) ] H + ɛ, 5. Coding and Data Compression We now begin to establish the limit for information compression by assigning short descriptions to the most probable messages, and assigning longer descriptions to less likely messages. In this chapter we will find the shortest average description length of a random variable by examining its entropy. 5.. Examples of Codes. In the sequel X will denote a finite set called alphabet. It will be equipped with a probability distribution p. We will often refer to the elements of X as the symbols or the letters of the alphabet. Think for example that X consists of all the letters (small and large caps) of the English alphabet together with all the signs of punctuation and a symbol for a blank space (separating words). One can produce a probability distribution on such a set by analyzing a very long text and measuring the frequency with each each of these symbols appears in this text. Definition 5. (Codes). Suppose that X and D are finite sets. (i) We denote by D the collection of words with alphabet D, D := n N D n. The length of a word w D is the natural number n such that w D n. (ii) A source code C or simply code for X based on D is a mapping C : X D. For x X we denote by l C (x) the length of C(x). Example 5.2. Suppose X = {yes, no} and D = {0, }. Then C(yes) = 00 and C(no) = is a source code. Definition 5.3 (Expected Length). Let p be a probability mass function on the finite alphabet X, and let D be a finite set. The expected length L(C) of a source code C : X D is the expectation of the random variable l C : X N, i.e., L(C) = E p [ l C ] = x X p(x)l C (x). 23

Lecture 5 - Information theory

Lecture 5 - Information theory Lecture 5 - Information theory Jan Bouda FI MU May 18, 2012 Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 1 / 42 Part I Uncertainty and entropy Jan Bouda (FI MU) Lecture 5 - Information

More information

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity Eckehard Olbrich MPI MiS Leipzig Potsdam WS 2007/08 Olbrich (Leipzig) 26.10.2007 1 / 18 Overview 1 Summary

More information

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 Please submit the solutions on Gradescope. EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 1. Optimal codeword lengths. Although the codeword lengths of an optimal variable length code

More information

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality

More information

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria Source Coding Master Universitario en Ingeniería de Telecomunicación I. Santamaría Universidad de Cantabria Contents Introduction Asymptotic Equipartition Property Optimal Codes (Huffman Coding) Universal

More information

(Classical) Information Theory II: Source coding

(Classical) Information Theory II: Source coding (Classical) Information Theory II: Source coding Sibasish Ghosh The Institute of Mathematical Sciences CIT Campus, Taramani, Chennai 600 113, India. p. 1 Abstract The information content of a random variable

More information

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Multimedia Communications. Mathematical Preliminaries for Lossless Compression Multimedia Communications Mathematical Preliminaries for Lossless Compression What we will see in this chapter Definition of information and entropy Modeling a data source Definition of coding and when

More information

1 Introduction to information theory

1 Introduction to information theory 1 Introduction to information theory 1.1 Introduction In this chapter we present some of the basic concepts of information theory. The situations we have in mind involve the exchange of information through

More information

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have EECS 229A Spring 2007 * * Solutions to Homework 3 1. Problem 4.11 on pg. 93 of the text. Stationary processes (a) By stationarity and the chain rule for entropy, we have H(X 0 ) + H(X n X 0 ) = H(X 0,

More information

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

10-704: Information Processing and Learning Fall Lecture 9: Sept 28 10-704: Information Processing and Learning Fall 2016 Lecturer: Siheng Chen Lecture 9: Sept 28 Note: These notes are based on scribed notes from Spring15 offering of this course. LaTeX template courtesy

More information

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information 4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information Ramji Venkataramanan Signal Processing and Communications Lab Department of Engineering ramji.v@eng.cam.ac.uk

More information

Communications Theory and Engineering

Communications Theory and Engineering Communications Theory and Engineering Master's Degree in Electronic Engineering Sapienza University of Rome A.A. 2018-2019 AEP Asymptotic Equipartition Property AEP In information theory, the analog of

More information

An introduction to basic information theory. Hampus Wessman

An introduction to basic information theory. Hampus Wessman An introduction to basic information theory Hampus Wessman Abstract We give a short and simple introduction to basic information theory, by stripping away all the non-essentials. Theoretical bounds on

More information

Exercises with solutions (Set B)

Exercises with solutions (Set B) Exercises with solutions (Set B) 3. A fair coin is tossed an infinite number of times. Let Y n be a random variable, with n Z, that describes the outcome of the n-th coin toss. If the outcome of the n-th

More information

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73 1/ 73 Information Theory M1 Informatique (parcours recherche et innovation) Aline Roumy INRIA Rennes January 2018 Outline 2/ 73 1 Non mathematical introduction 2 Mathematical introduction: definitions

More information

Coding of memoryless sources 1/35

Coding of memoryless sources 1/35 Coding of memoryless sources 1/35 Outline 1. Morse coding ; 2. Definitions : encoding, encoding efficiency ; 3. fixed length codes, encoding integers ; 4. prefix condition ; 5. Kraft and Mac Millan theorems

More information

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information. L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate

More information

Dept. of Linguistics, Indiana University Fall 2015

Dept. of Linguistics, Indiana University Fall 2015 L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission

More information

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

10-704: Information Processing and Learning Fall Lecture 10: Oct 3 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 0: Oct 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy of

More information

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1 Kraft s inequality An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if N 2 l i 1 Proof: Suppose that we have a tree code. Let l max = max{l 1,...,

More information

EE376A: Homework #2 Solutions Due by 11:59pm Thursday, February 1st, 2018

EE376A: Homework #2 Solutions Due by 11:59pm Thursday, February 1st, 2018 Please submit the solutions on Gradescope. Some definitions that may be useful: EE376A: Homework #2 Solutions Due by 11:59pm Thursday, February 1st, 2018 Definition 1: A sequence of random variables X

More information

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18 Information Theory David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 18 A Measure of Information? Consider a discrete random variable

More information

Information Theory and Statistics Lecture 2: Source coding

Information Theory and Statistics Lecture 2: Source coding Information Theory and Statistics Lecture 2: Source coding Łukasz Dębowski ldebowsk@ipipan.waw.pl Ph. D. Programme 2013/2014 Injections and codes Definition (injection) Function f is called an injection

More information

Information measures in simple coding problems

Information measures in simple coding problems Part I Information measures in simple coding problems in this web service in this web service Source coding and hypothesis testing; information measures A(discrete)source is a sequence {X i } i= of random

More information

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy Haykin_ch05_pp3.fm Page 207 Monday, November 26, 202 2:44 PM CHAPTER 5 Information Theory 5. Introduction As mentioned in Chapter and reiterated along the way, the purpose of a communication system is

More information

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16 EE539R: Problem Set 4 Assigned: 3/08/6, Due: 07/09/6. Cover and Thomas: Problem 3.5 Sets defined by probabilities: Define the set C n (t = {x n : P X n(x n 2 nt } (a We have = P X n(x n P X n(x n 2 nt

More information

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University Chapter 4 Data Transmission and Channel Capacity Po-Ning Chen, Professor Department of Communications Engineering National Chiao Tung University Hsin Chu, Taiwan 30050, R.O.C. Principle of Data Transmission

More information

Lecture 11: Quantum Information III - Source Coding

Lecture 11: Quantum Information III - Source Coding CSCI5370 Quantum Computing November 25, 203 Lecture : Quantum Information III - Source Coding Lecturer: Shengyu Zhang Scribe: Hing Yin Tsang. Holevo s bound Suppose Alice has an information source X that

More information

Lecture 4 Noisy Channel Coding

Lecture 4 Noisy Channel Coding Lecture 4 Noisy Channel Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 9, 2015 1 / 56 I-Hsiang Wang IT Lecture 4 The Channel Coding Problem

More information

Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory

Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory 1 The intuitive meaning of entropy Modern information theory was born in Shannon s 1948 paper A Mathematical Theory of

More information

Lecture 2: August 31

Lecture 2: August 31 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy

More information

LECTURE 3. Last time:

LECTURE 3. Last time: LECTURE 3 Last time: Mutual Information. Convexity and concavity Jensen s inequality Information Inequality Data processing theorem Fano s Inequality Lecture outline Stochastic processes, Entropy rate

More information

Lecture 22: Final Review

Lecture 22: Final Review Lecture 22: Final Review Nuts and bolts Fundamental questions and limits Tools Practical algorithms Future topics Dr Yao Xie, ECE587, Information Theory, Duke University Basics Dr Yao Xie, ECE587, Information

More information

Homework Set #2 Data Compression, Huffman code and AEP

Homework Set #2 Data Compression, Huffman code and AEP Homework Set #2 Data Compression, Huffman code and AEP 1. Huffman coding. Consider the random variable ( x1 x X = 2 x 3 x 4 x 5 x 6 x 7 0.50 0.26 0.11 0.04 0.04 0.03 0.02 (a Find a binary Huffman code

More information

Information Theory Primer:

Information Theory Primer: Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen s inequality Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Entropies & Information Theory

Entropies & Information Theory Entropies & Information Theory LECTURE I Nilanjana Datta University of Cambridge,U.K. See lecture notes on: http://www.qi.damtp.cam.ac.uk/node/223 Quantum Information Theory Born out of Classical Information

More information

Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information

Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information 1 Conditional entropy Let (Ω, F, P) be a probability space, let X be a RV taking values in some finite set A. In this lecture

More information

Lecture 5: Asymptotic Equipartition Property

Lecture 5: Asymptotic Equipartition Property Lecture 5: Asymptotic Equipartition Property Law of large number for product of random variables AEP and consequences Dr. Yao Xie, ECE587, Information Theory, Duke University Stock market Initial investment

More information

Entropy as a measure of surprise

Entropy as a measure of surprise Entropy as a measure of surprise Lecture 5: Sam Roweis September 26, 25 What does information do? It removes uncertainty. Information Conveyed = Uncertainty Removed = Surprise Yielded. How should we quantify

More information

CS 630 Basic Probability and Information Theory. Tim Campbell

CS 630 Basic Probability and Information Theory. Tim Campbell CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)

More information

Information Theory, Statistics, and Decision Trees

Information Theory, Statistics, and Decision Trees Information Theory, Statistics, and Decision Trees Léon Bottou COS 424 4/6/2010 Summary 1. Basic information theory. 2. Decision trees. 3. Information theory and statistics. Léon Bottou 2/31 COS 424 4/6/2010

More information

Lecture Notes on Digital Transmission Source and Channel Coding. José Manuel Bioucas Dias

Lecture Notes on Digital Transmission Source and Channel Coding. José Manuel Bioucas Dias Lecture Notes on Digital Transmission Source and Channel Coding José Manuel Bioucas Dias February 2015 CHAPTER 1 Source and Channel Coding Contents 1 Source and Channel Coding 1 1.1 Introduction......................................

More information

Introduction to information theory and coding

Introduction to information theory and coding Introduction to information theory and coding Louis WEHENKEL Set of slides No 4 Source modeling and source coding Stochastic processes and models for information sources First Shannon theorem : data compression

More information

x log x, which is strictly convex, and use Jensen s Inequality:

x log x, which is strictly convex, and use Jensen s Inequality: 2. Information measures: mutual information 2.1 Divergence: main inequality Theorem 2.1 (Information Inequality). D(P Q) 0 ; D(P Q) = 0 iff P = Q Proof. Let ϕ(x) x log x, which is strictly convex, and

More information

Information Theory and Communication

Information Theory and Communication Information Theory and Communication Ritwik Banerjee rbanerjee@cs.stonybrook.edu c Ritwik Banerjee Information Theory and Communication 1/8 General Chain Rules Definition Conditional mutual information

More information

Lecture Notes for Communication Theory

Lecture Notes for Communication Theory Lecture Notes for Communication Theory February 15, 2011 Please let me know of any errors, typos, or poorly expressed arguments. And please do not reproduce or distribute this document outside Oxford University.

More information

Intro to Information Theory

Intro to Information Theory Intro to Information Theory Math Circle February 11, 2018 1. Random variables Let us review discrete random variables and some notation. A random variable X takes value a A with probability P (a) 0. Here

More information

Chapter 2: Source coding

Chapter 2: Source coding Chapter 2: meghdadi@ensil.unilim.fr University of Limoges Chapter 2: Entropy of Markov Source Chapter 2: Entropy of Markov Source Markov model for information sources Given the present, the future is independent

More information

COMM901 Source Coding and Compression. Quiz 1

COMM901 Source Coding and Compression. Quiz 1 German University in Cairo - GUC Faculty of Information Engineering & Technology - IET Department of Communication Engineering Winter Semester 2013/2014 Students Name: Students ID: COMM901 Source Coding

More information

Shannon s Noisy-Channel Coding Theorem

Shannon s Noisy-Channel Coding Theorem Shannon s Noisy-Channel Coding Theorem Lucas Slot Sebastian Zur February 2015 Abstract In information theory, Shannon s Noisy-Channel Coding Theorem states that it is possible to communicate over a noisy

More information

Lecture 1: Introduction, Entropy and ML estimation

Lecture 1: Introduction, Entropy and ML estimation 0-704: Information Processing and Learning Spring 202 Lecture : Introduction, Entropy and ML estimation Lecturer: Aarti Singh Scribes: Min Xu Disclaimer: These notes have not been subjected to the usual

More information

Lecture 11: Continuous-valued signals and differential entropy

Lecture 11: Continuous-valued signals and differential entropy Lecture 11: Continuous-valued signals and differential entropy Biology 429 Carl Bergstrom September 20, 2008 Sources: Parts of today s lecture follow Chapter 8 from Cover and Thomas (2007). Some components

More information

MAGIC010 Ergodic Theory Lecture Entropy

MAGIC010 Ergodic Theory Lecture Entropy 7. Entropy 7. Introduction A natural question in mathematics is the so-called isomorphism problem : when are two mathematical objects of the same class the same (in some appropriately defined sense of

More information

The binary entropy function

The binary entropy function ECE 7680 Lecture 2 Definitions and Basic Facts Objective: To learn a bunch of definitions about entropy and information measures that will be useful through the quarter, and to present some simple but

More information

Sample Spaces, Random Variables

Sample Spaces, Random Variables Sample Spaces, Random Variables Moulinath Banerjee University of Michigan August 3, 22 Probabilities In talking about probabilities, the fundamental object is Ω, the sample space. (elements) in Ω are denoted

More information

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H. Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that

More information

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015 Outline Codes and Cryptography 1 Information Sources and Optimal Codes 2 Building Optimal Codes: Huffman Codes MAMME, Fall 2015 3 Shannon Entropy and Mutual Information PART III Sources Information source:

More information

Classical Information Theory Notes from the lectures by prof Suhov Trieste - june 2006

Classical Information Theory Notes from the lectures by prof Suhov Trieste - june 2006 Classical Information Theory Notes from the lectures by prof Suhov Trieste - june 2006 Fabio Grazioso... July 3, 2006 1 2 Contents 1 Lecture 1, Entropy 4 1.1 Random variable...............................

More information

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code Chapter 2 Date Compression: Source Coding 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code 2.1 An Introduction to Source Coding Source coding can be seen as an efficient way

More information

Solutions to Set #2 Data Compression, Huffman code and AEP

Solutions to Set #2 Data Compression, Huffman code and AEP Solutions to Set #2 Data Compression, Huffman code and AEP. Huffman coding. Consider the random variable ( ) x x X = 2 x 3 x 4 x 5 x 6 x 7 0.50 0.26 0. 0.04 0.04 0.03 0.02 (a) Find a binary Huffman code

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#8:(November-08-2010) Cancer and Signals Outline 1 Bayesian Interpretation of Probabilities Information Theory Outline Bayesian

More information

Lecture 1: September 25, A quick reminder about random variables and convexity

Lecture 1: September 25, A quick reminder about random variables and convexity Information and Coding Theory Autumn 207 Lecturer: Madhur Tulsiani Lecture : September 25, 207 Administrivia This course will cover some basic concepts in information and coding theory, and their applications

More information

ECE 587 / STA 563: Lecture 5 Lossless Compression

ECE 587 / STA 563: Lecture 5 Lossless Compression ECE 587 / STA 563: Lecture 5 Lossless Compression Information Theory Duke University, Fall 2017 Author: Galen Reeves Last Modified: October 18, 2017 Outline of lecture: 5.1 Introduction to Lossless Source

More information

LECTURE 13. Last time: Lecture outline

LECTURE 13. Last time: Lecture outline LECTURE 13 Last time: Strong coding theorem Revisiting channel and codes Bound on probability of error Error exponent Lecture outline Fano s Lemma revisited Fano s inequality for codewords Converse to

More information

Bioinformatics: Biology X

Bioinformatics: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA Model Building/Checking, Reverse Engineering, Causality Outline 1 Bayesian Interpretation of Probabilities 2 Where (or of what)

More information

Upper Bounds on the Capacity of Binary Intermittent Communication

Upper Bounds on the Capacity of Binary Intermittent Communication Upper Bounds on the Capacity of Binary Intermittent Communication Mostafa Khoshnevisan and J. Nicholas Laneman Department of Electrical Engineering University of Notre Dame Notre Dame, Indiana 46556 Email:{mhoshne,

More information

ECE 587 / STA 563: Lecture 5 Lossless Compression

ECE 587 / STA 563: Lecture 5 Lossless Compression ECE 587 / STA 563: Lecture 5 Lossless Compression Information Theory Duke University, Fall 28 Author: Galen Reeves Last Modified: September 27, 28 Outline of lecture: 5. Introduction to Lossless Source

More information

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /

More information

EECS 750. Hypothesis Testing with Communication Constraints

EECS 750. Hypothesis Testing with Communication Constraints EECS 750 Hypothesis Testing with Communication Constraints Name: Dinesh Krithivasan Abstract In this report, we study a modification of the classical statistical problem of bivariate hypothesis testing.

More information

Example: Letter Frequencies

Example: Letter Frequencies Example: Letter Frequencies i a i p i 1 a 0.0575 2 b 0.0128 3 c 0.0263 4 d 0.0285 5 e 0.0913 6 f 0.0173 7 g 0.0133 8 h 0.0313 9 i 0.0599 10 j 0.0006 11 k 0.0084 12 l 0.0335 13 m 0.0235 14 n 0.0596 15 o

More information

lossless, optimal compressor

lossless, optimal compressor 6. Variable-length Lossless Compression The principal engineering goal of compression is to represent a given sequence a, a 2,..., a n produced by a source as a sequence of bits of minimal possible length.

More information

Example: Letter Frequencies

Example: Letter Frequencies Example: Letter Frequencies i a i p i 1 a 0.0575 2 b 0.0128 3 c 0.0263 4 d 0.0285 5 e 0.0913 6 f 0.0173 7 g 0.0133 8 h 0.0313 9 i 0.0599 10 j 0.0006 11 k 0.0084 12 l 0.0335 13 m 0.0235 14 n 0.0596 15 o

More information

Information Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST

Information Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST Information Theory Lecture 5 Entropy rate and Markov sources STEFAN HÖST Universal Source Coding Huffman coding is optimal, what is the problem? In the previous coding schemes (Huffman and Shannon-Fano)it

More information

10-704: Information Processing and Learning Spring Lecture 8: Feb 5

10-704: Information Processing and Learning Spring Lecture 8: Feb 5 10-704: Information Processing and Learning Spring 2015 Lecture 8: Feb 5 Lecturer: Aarti Singh Scribe: Siheng Chen Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal

More information

Introduction to Information Entropy Adapted from Papoulis (1991)

Introduction to Information Entropy Adapted from Papoulis (1991) Introduction to Information Entropy Adapted from Papoulis (1991) Federico Lombardo Papoulis, A., Probability, Random Variables and Stochastic Processes, 3rd edition, McGraw ill, 1991. 1 1. INTRODUCTION

More information

ECE 4400:693 - Information Theory

ECE 4400:693 - Information Theory ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential

More information

Network coding for multicast relation to compression and generalization of Slepian-Wolf

Network coding for multicast relation to compression and generalization of Slepian-Wolf Network coding for multicast relation to compression and generalization of Slepian-Wolf 1 Overview Review of Slepian-Wolf Distributed network compression Error exponents Source-channel separation issues

More information

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence

More information

Lecture 5 Channel Coding over Continuous Channels

Lecture 5 Channel Coding over Continuous Channels Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From

More information

Entropy Rate of Stochastic Processes

Entropy Rate of Stochastic Processes Entropy Rate of Stochastic Processes Timo Mulder tmamulder@gmail.com Jorn Peters jornpeters@gmail.com February 8, 205 The entropy rate of independent and identically distributed events can on average be

More information

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline LECTURE 2 Convexity and related notions Last time: Goals and mechanics of the class notation entropy: definitions and properties mutual information: definitions and properties Lecture outline Convexity

More information

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H. Problem sheet Ex. Verify that the function H(p,..., p n ) = k p k log p k satisfies all 8 axioms on H. Ex. (Not to be handed in). looking at the notes). List as many of the 8 axioms as you can, (without

More information

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2 COMPSCI 650 Applied Information Theory Jan 21, 2016 Lecture 2 Instructor: Arya Mazumdar Scribe: Gayane Vardoyan, Jong-Chyi Su 1 Entropy Definition: Entropy is a measure of uncertainty of a random variable.

More information

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding SIGNAL COMPRESSION Lecture 3 4.9.2007 Shannon-Fano-Elias Codes and Arithmetic Coding 1 Shannon-Fano-Elias Coding We discuss how to encode the symbols {a 1, a 2,..., a m }, knowing their probabilities,

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science Transmission of Information Spring 2006

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science Transmission of Information Spring 2006 MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science 6.44 Transmission of Information Spring 2006 Homework 2 Solution name username April 4, 2006 Reading: Chapter

More information

(each row defines a probability distribution). Given n-strings x X n, y Y n we can use the absence of memory in the channel to compute

(each row defines a probability distribution). Given n-strings x X n, y Y n we can use the absence of memory in the channel to compute ENEE 739C: Advanced Topics in Signal Processing: Coding Theory Instructor: Alexander Barg Lecture 6 (draft; 9/6/03. Error exponents for Discrete Memoryless Channels http://www.enee.umd.edu/ abarg/enee739c/course.html

More information

Chapter 5: Data Compression

Chapter 5: Data Compression Chapter 5: Data Compression Definition. A source code C for a random variable X is a mapping from the range of X to the set of finite length strings of symbols from a D-ary alphabet. ˆX: source alphabet,

More information

Lecture 3: Channel Capacity

Lecture 3: Channel Capacity Lecture 3: Channel Capacity 1 Definitions Channel capacity is a measure of maximum information per channel usage one can get through a channel. This one of the fundamental concepts in information theory.

More information

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet) Compression Motivation Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet) Storage: Store large & complex 3D models (e.g. 3D scanner

More information

Chaos, Complexity, and Inference (36-462)

Chaos, Complexity, and Inference (36-462) Chaos, Complexity, and Inference (36-462) Lecture 7: Information Theory Cosma Shalizi 3 February 2009 Entropy and Information Measuring randomness and dependence in bits The connection to statistics Long-run

More information

Some Concepts in Probability and Information Theory

Some Concepts in Probability and Information Theory PHYS 476Q: An Introduction to Entanglement Theory (Spring 2018) Eric Chitambar Some Concepts in Probability and Information Theory We begin this course with a condensed survey of basic concepts in probability

More information

ELEMENT OF INFORMATION THEORY

ELEMENT OF INFORMATION THEORY History Table of Content ELEMENT OF INFORMATION THEORY O. Le Meur olemeur@irisa.fr Univ. of Rennes 1 http://www.irisa.fr/temics/staff/lemeur/ October 2010 1 History Table of Content VERSION: 2009-2010:

More information

Information in Biology

Information in Biology Information in Biology CRI - Centre de Recherches Interdisciplinaires, Paris May 2012 Information processing is an essential part of Life. Thinking about it in quantitative terms may is useful. 1 Living

More information

EE5319R: Problem Set 3 Assigned: 24/08/16, Due: 31/08/16

EE5319R: Problem Set 3 Assigned: 24/08/16, Due: 31/08/16 EE539R: Problem Set 3 Assigned: 24/08/6, Due: 3/08/6. Cover and Thomas: Problem 2.30 (Maimum Entropy): Solution: We are required to maimize H(P X ) over all distributions P X on the non-negative integers

More information

3F1 Information Theory, Lecture 3

3F1 Information Theory, Lecture 3 3F1 Information Theory, Lecture 3 Jossy Sayir Department of Engineering Michaelmas 2013, 29 November 2013 Memoryless Sources Arithmetic Coding Sources with Memory Markov Example 2 / 21 Encoding the output

More information

Noisy-Channel Coding

Noisy-Channel Coding Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/05264298 Part II Noisy-Channel Coding Copyright Cambridge University Press 2003.

More information

k-protected VERTICES IN BINARY SEARCH TREES

k-protected VERTICES IN BINARY SEARCH TREES k-protected VERTICES IN BINARY SEARCH TREES MIKLÓS BÓNA Abstract. We show that for every k, the probability that a randomly selected vertex of a random binary search tree on n nodes is at distance k from

More information

Motivation for Arithmetic Coding

Motivation for Arithmetic Coding Motivation for Arithmetic Coding Motivations for arithmetic coding: 1) Huffman coding algorithm can generate prefix codes with a minimum average codeword length. But this length is usually strictly greater

More information

ELEC546 Review of Information Theory

ELEC546 Review of Information Theory ELEC546 Review of Information Theory Vincent Lau 1/1/004 1 Review of Information Theory Entropy: Measure of uncertainty of a random variable X. The entropy of X, H(X), is given by: If X is a discrete random

More information