An introduction to basic information theory. Hampus Wessman

An introduction to basic information theory Hampus Wessman

Abstract We give a short and simple introduction to basic information theory, by stripping away all the non-essentials. Theoretical bounds on how efficiently information can be reliably communicated, with and without noise, are presented with proofs. All the required background is also discussed.

Contents 1 Introduction 4 2 Information and entropy 4 2.1 Information sources.......................... 4 2.2 What is entropy?........................... 5 2.3 A measure of information...................... 6 3 Noiseless communication 6 3.1 Introduction.............................. 6 3.2 Instantaneous codes......................... 6 3.3 The noiseless coding theorem.................... 8 4 Noisy communication 9 4.1 Noisy channels............................ 9 4.2 Channel capacity........................... 10 4.3 Decoding............................... 10 4.4 The noisy-channel coding theorem................. 12 Appendix 15

1 Introduction Information theory is a branch of mathematics, which has its roots in the groundbreaking paper [3], published by Claude Shannon in 1948. Many of the basic results were already introduced by Shannon back then. Some proofs are quite sketchy in Shannon s paper, but more elaborate proofs have been published afterwards. There are a few introductory books on the subject (e.g. [1]). We seek here to present some of the most basic results of information theory, in as simple a way as possible. Both Shannon s classic paper and many later books on the subject, present the theory in a quite general way. This can be very useful, but it also makes the proofs and presentation longer and harder to comprehend. By restricting ourselves to the simplest possible cases (that are still interesting, of course) and constructing short and simple proofs, we hope to give a shorter and clearer introduction to the subject. The basic results will still be similar and further details can easily be found elsewhere. This text is loosely based on [1] and [3]. Most of the theorems, definitions and ideas are variations of things that are presented in these two works. The basic proof ideas are often similar too, but the actual proofs are quite different. Whenever possible, the proofs use less general and more compact arguments. Some new concepts are also introduced, to simplify the presentation. The basic theory discussed here is fairly well-known by now. Most of it originates from [3]. The text also makes use of some other well-known terminology and results from both basic mathematics and computer science. 1 We don t include references in the text for this, but that does not mean that we claim it to be original. The proof of theorem 7 is based on the classic approach used by Shannon in [3]. Shannon s proof is quite sketchy and a bit more general, though, so this proof is a more detailed variant. The introduction of definition 10 simplifies the presentation somewhat. The rest of the theory is fairly standard and not very complicated. Instantaneous codes are sometimes called prefix codes and is a well-known concept (see e.g. [1] and [2]). In section 2, we introduce the fundamental concepts of information and entropies. Section 3 then discusses how efficiently information can be sent over noiseless channels and section 4 discusses the same thing when the channel is noisy. The main results can be found at the end of section 3 and 4. 2 Information and entropy 2.1 Information sources We begin by discussing what we mean with information. The fundamental problem that we are interested in is how to communicate a message that is selected from a set of possible messages. If there is only one possible message, then the problem is trivial, because we always know what message will be or was chosen. A message like that doesn t convey any information. A more interesting information source would be one that chooses messages from a set of several possible messages. One way to generate such messages in real life would be to repeatedly throw a die and choose the number you get 1 We try to explain everything that is not entirely obvious. 4

as your message. In this case the message that you threw a 2 does convey some information. We will soon define a measure of how much information is produced by a simple information source like this. For the rest of this text, the message chosen by an information source will simply be a random variable X that takes on one of the values x 1, x 2,..., x n with probabilities p 1, p 2,..., p n respectively. This simple model is sufficient for our purposes. Sometimes, we will make several independent observations of the random variable. 2.2 What is entropy? If someone throws a fair coin without revealing the result, then you can t be sure what the result was. There is a certain amount of uncertainty involved. Lets say that the same person throws a two-headed coin instead. This time, you can be completely sure what the result was. In other words, the result is not uncertain at all this time. Entropy is a measure of this uncertainty. Its usefulness will be seen later. Right now, we simply give the definition. Definition 1. The entropy of the probabilities p 1, p 2,..., p n is H(X) = H(p 1, p 2,..., p n ) = p i log 2 p i. Note that we use the binary logarithm here. Entropy is measured in the unit of bits. A few simple examples follow. The entropy of throwing a fair coin is exactly 1.0 bit. Throwing a two-headed coin gives 0.0 bits of entropy. The entropy of throwing a six-sided die is approximately 2.58 bits. We will also have use for conditional entropies. Lets say that we have two random variables X and Y and that we know that Y = y, then we define the conditional entropy of X given that Y = y as Definition 2. H(X Y = y) = P (X = x i Y = y) log 2 P (X = x i Y = y), where P (X = x i Y = y) is the conditional probability that X = x i given that Y = y. Let us also define the conditional entropy of X given Y as Definition 3. H(X Y ) = m P (Y = y i )H(X Y = y i ), where P (Y = y i ) is the probability that Y = y i. Here, we assume that Y is a discrete random variable. P will denote probabilities throughout the text. 5

2.3 A measure of information We can now define a measure of information. Assume that we have two random variables X and Y, the value of which are both unknown. If the value of Y is revealed to us, how much will that tell us about X? We define the amount of information conveyed by doing so to be the corresponding decrease in entropy. Definition 4. The amount of information revealed about X when given Y is I(X Y ) = H(X) H(X Y ). This value will always be non-negative, but we don t prove that here. 3 Noiseless communication 3.1 Introduction Given an information source, we now turn to the problem of encoding the messages from the information source so that they can be efficiently communicated over a noiseless channel. With a channel, we simply mean something that transmits binary data by accepting it at some point and reproducing it at some other point. A noiseless channel does this without introducing any errors. In reality we can think of sending data through a computer network (e.g. over the Internet) or storing it on a DVD 2 to later be read back again. It doesn t really matter what we do with the data, as long as we assume that it can be perfectly recalled later. The only requirement we have on the encoded message is that the original message should be possible to recreate. In particular, we don t care if decoding is expensive. Because we don t need to handle communication failures of any kind (we assume there will be no errors), it will be most efficient to encode the information as compactly as possible. At least, this is what we will strive for here. We will investigate the theoretical limits of how compactly data can be encoded on average. In real world applications, it may not be feasible to go that far, for various reasons. The theory can be made more general, but we restrict ourselves to binary codes here. Let us have a random variable X, like earlier. For each x i we will assign a codeword c i that consists of a sequence of one or more bits 3. Together, these code words make up a code. The code words can have different lengths. We observe the random variable one or more times, independently, to generate a sequence. Such a sequence will be called a message. 3.2 Instantaneous codes Not all codes are of interest, because they can t always be uniquely decoded. Lets define what we mean with that. Definition 5. A code is uniquely decodable if every finite sequence of bits corresponds to at most one message, according to that code. 2 DVD discs use error correction codes internally. Imagine here that we just store a file and assume that we don t need to handle read errors. 3 Bit means binary digit here, that is a 0 or a 1. 6

Here we assume that the message is encoded by concatenating the code words for each symbol in the message to create a sequence of bits that corresponds to that message. Decoding simply tries to do the reverse. We will take a closer look at one type of uniquely decodable codes. Definition 6. A code is called an instantaneous code, if no code word is a prefix 4 of any other code word. In particular, two code words can t be equal in an instantaneous code. We now show that codes of this type are always uniquely decodable. Theorem 1. Instantaneous codes are uniquely decodable. Proof. Lets start by concluding that there is a one-to-one correspondence between a sequence of code words and a message. Lets assume that we have a finite sequence of bits, an instantaneous code and two sequences of code words from that code, m 1 and m 2, which both agree with the sequence of bits. We need to show that m 1 = m 2 or, in other words, that there can t be two different such sequences. Consider the first code word in m 1 and m 2. What if they are not the same? That is not possible, because then one (the shorter, or both) would be a prefix of the other one and then the code is not instantaneous, which we assumed. We can now ignore the first code words (because they are identical) and look at the rest of the code words and the rest of the bits. The same argument can be applied again and by induction the whole sequences m 1 and m 2 must be the same. The following theorem tells us when it is possible to construct an instantaneous code. Theorem 2. There exists an instantaneous code with code words of the lengths n 1, n 2,..., n m if and only if m 2 ni 1. Proof. Without loss of generality, assume that n 1 n 2... n m = k. There are 2 k possible code words of length k. Lets call these base words. Any code word of length n k can be constructed by selecting one of these base words and taking a prefix of length n from it. Now, lets construct a code by selecting a code word of length n 1 and then one of length n 2 and so on. When we are selecting the first code word, any code word is possible (because there are no other code words that it can conflict with). Lets choose any code word of length n 1. This code word will be a prefix of 2 k n1 base words. No code word after this can be a prefix of these base words, so they are now excluded. Now, select a code word of length n 2. We can choose any non-excluded base word and take a prefix of length n 2 from that. This is possible, because we know that no other code word (so far) is a prefix of these base words and all chosen code words are at most the same length as this one, so this is enough. On the other hand, no other code words are possible, because then there exists already a code word that is a prefix of it. After choosing a code word of length n 2 (if possible), we need to exclude 2 k n2 new base words. We continue like this until we have chosen all the code words we need. At every 4 A code word of length n 1 is called a prefix of another code word of length n 2 if n 1 n 2 and the first n 1 bits of each code word form identical subsequences. 7

step we pick a prefix of any non-excluded base word, if there is any. No other choices are possible. It will be possible to select all m code words, so that they form an instantaneous code, if and only if at most 2 k base words are exluded after selecting them all. It doesn t really matter which code words we actually choose along the way. This is the same as requiring that m 2 k ni 2 k, which is equivalent to what we wanted to prove. We will only discuss instantaneous codes, but the following theory can be generalized to all uniquely decodable codes. The results will be similar. 3.3 The noiseless coding theorem The following theorem establishes a lower bound of how compactly information can be encoded. We prove later that there exist codes close to this bound. Definition 7. We call n = n p in i the average code word length of a code. Theorem 3 (Noiseless coding theorem). Given a random variable X and a corresponding instantaneous code, n is bounded below by H(X) n. Proof. The following facts will be needed (the first is from theorem 2): 2 ni 1 (1) ln(x) x 1 log 2 (x) log 2 (e)(x 1) (2) First rewrite the inequality from the theorem a bit. H(X) n p i log 2 (p i ) p i n i p i log 2 (p i ) p i log 2 (2 ni ) p i log 2 2 ni p i 0 Now we are almost done. It follows from (1) and (2) that 2 ( ni 2 n i ) p i log 2 p i log p 2 (e) 1 = i p i (( ( = log 2 (e) 2 n i ) n ) ) p i = log2 (e) 2 ni 1 0. 8

We still don t know how close we can get to this bound, in general. It turns out that we can get very close. Theorem 4. Given a random variable X, there exists an instantaneous code such that H(X) n < H(X) + 1. Proof. For each i, choose n i as the integer that satisfies log 2 (p i ) n i < log 2 (p i ) + 1. It follows from theorem 2 that there exists an instantaneous code with these code word lengths, because 2 ni 2 log 2 (pi) = p i = 1. Furthermore, we note that by multiplying by p i above and summing over all code words, we get that p i log 2 (p i ) p i n i < p i log 2 (p i ) + which is equivalent to the inequality in the theorem. p i, It is now easy to see that it is possible, relatively speaking, to get arbitrarily close to the lower bound, by encoding several symbols together as a block and letting the length of this block tend towards infinity. If we encode a block of N symbols together, using an instantaneous code, then it is possible for each symbol to use less than 1/N bits above the lower bound on average. Optimal instantaneous codes can be efficiently constructed by using Huffman s algorithm. See almost any book about algorithms, e.g. [2] (also [1]). 4 Noisy communication 4.1 Noisy channels We will now discuss the problem of sending information over an unreliable communication channel. The discussion will be restricted to binary memoryless symmetric channels. The channel accepts a sequence m 1 of n bits at one point and delivers another sequence m 2 of n bits at another point. m 2 is constructed by adding noise to m 1. The noise is produced independently for each bit. With a probability of p the bit is unchanged and with a probability of q = 1 p the bit is flipped 5. A channel with p = 1 2 is completely useless (the received sequence is independent of the sequence that was sent). When p = 1, the channel is noiseless and the theory from the last section applies. If p < q, then we can just flip all bits in m 2 before delivering it, to get p > q instead. We will assume that 1 2 < p < 1. This means that 0 < q < 1 2. We call q the bitwise error probability. 5 Flipping a bit is the same as changing its value. There is only one way to do that. 9

Let B be the set of all possible bit sequences of length n. There are 2 n elements in B. We will always send a block of n bits over the channel at a time (but we may choose to vary n). Furthermore, assume that there is an information source which randomly selects a message from a set of 2 Rn possible messages, where R is a real number, 0 R 1 and x is the floor function of x. Let M denote the set of all possible messages. R is called the rate of the information source. Assign an element c(m) B to each m M. These are called code words. The mapping of messages to code words is called a code. Note that we don t require the code to assign unique code words to each message. Our goal is to send c(m) over the channel and recover m from the received sequence r B with high probability. We will later show when this is possible. 4.2 Channel capacity The following definition will be very important. Let X be a random variable that takes on one of the values 0 and 1 with the corresponding probabilities p 0 and p 1. We send the value of X over our noisy channel. Let Y be the received bit. Then I(X Y ) is a measure of the information conveyed by sending this bit over the channel. It depends on p 0 and p 1 (and the channel). Definition 8. The channel capacity (for this kind of channel) is C = max p 0,p 1 I(X Y ) = max p 0,p 1 H(X) H(X Y ). The channel capacity is defined similarly for more complicated channels. In our case, the maximum is always achieved when p 0 = p 1 = 1 2 and with those probabilities we get that C = H(X) H(X Y ) = 1 + p log 2 (p) + q log 2 (q). We will continue to assume that X and Y are as above (with p 0 = p 1 = 1 2 ), so that H(X Y ) = (p log 2 (p) + q log 2 (q)). (3) 4.3 Decoding We will present one way to decode the received bit sequence here. It will not necessarily be the best possible way, but it will be sufficient for our purposes. Without any noise, it would be fairly trivial to decode the received sequence. Let us therefore examine the noise that is introduced by the channel. We are mainly interested in the number of bits that are flipped. We will call this the number of errors. Definition 9 (Hamming distance). For x, y B, let d(x, y) be the number of bits that are different in x and y, when comparing bits at the same position in each sequence. When we send a message m M and receive a binary sequence r B, the number of errors can be written as e = d(c(m), r). Both e and r are random variables here (they are functions of the random noise and m). For each of the n bits in the code word, the probability that there will be a transmission error 10

for that bit is q (as discussed above). The weak law of large numbers directly gives us the following result. Theorem 5. Let e be the number of errors when sending an arbitrary code word of length n over the channel. Then, for any δ > 0, where P ( ) denotes the probability. lim P (n(q δ) < e < n(q + δ)) = 1, n This is very useful. Let us make a related definition and then see how this may help us decode received sequences. Definition 10. For any b B and δ > 0, define the set of potential code words to be S(b; δ) = {x B n(q δ) < d(b, x) < n(q + δ)}. It follows from theorem 5 that lim n P (c(m) S(r; δ)) = 1 (with m and r as above and δ > 0). We can thus assume that c(m) S(r; δ) and make the probability that we are wrong arbitrarily small by chosing a large enough n. We therefore look at all messages x, such that c(x) S(r; δ). If there is only one such message, we will assume that it is m and decode the received sequence as this message. If there are more than one such messages, then we simply choose to fail. We have made very few assumptions about the code so far, so we don t know how close together the code words lie. It is even possible that the decoding will always fail. This will be further discussed later. It will be useful to know the number of elements in S(m; δ). The following theorem gives an estimation of that. Theorem 6. Let S(m; δ) denote the number of elements in S(m; δ) and let H(X Y ) be as in equation (3) above. Then, for any ɛ > 0 and m B there is a δ > 0 such that S(m; δ) 2 n(h(x Y )+ɛ). Proof. Assume that we send m B over the noisy channel and receive a sequence r B (so that r is a random variable, because of the noise). Lets look at an element x S(m; δ). The probability that x is the received sequence depends on d(m, x) (this is the number of bit errors that is needed to turn m into x). A sequence with more errors is always less likely. For all x S(m; δ) we know that d(m, x) < n(q + δ), so that P (r = x) p n(p δ) q n(q+δ), x S(m; δ), where P (r = x) is the probability that the received sequence is x (when sending m). This, in turn, gives us a lower bound on the following average probability. S(m; δ) 1 P (r = x) p n(p δ) q n(q+δ) x S(m;δ) We also know that the probability that r S(m; δ) is at most 1, so that P (r = x) 1. x S(m;δ) 11

All the above shows that x S(m;δ) P (r = x) S(m; δ) = S(m; δ) 1 x S(m;δ) P (r = x) 1 p n(p δ) q = n(q+δ) 2n( p log 2 (p) q log 2 (q))+δ log 2 (p/q)). Finally, note that H(X Y ) = (p log 2 (p) + q log 2 (q)) (from (3) above) and choose δ > 0 such that ɛ = δ log 2 (p/q) (this is possible, because p > q). Then, we get that S(m; δ) 2 n( p log 2 (p) q log 2 (q))+δ log 2 (p/q)) = 2 n(h(x Y )+ɛ). 4.4 The noisy-channel coding theorem For this theorem, we will assume that the situation is like that described above. An information source randomly selects a message m out of 2 Rn possible messages (where x is the floor function of x). A code assigns a code word c(m) B to the message and the code word is sent over the channel. The received sequence r B is then decoded as described above. The channel capacity is denoted C as before. We say that the communication was successful if we manage to decode r and the decoded message is the sent message m. Otherwise, we say that there was a communication error. Each message m i will be selected with a probability of p i. For a certain code c, let P (error c) = i p ip (error m i, c) be the (average) probability of error, where P (error m i, c) is the probability of error when sending the message m i using the code c. Theorem 7 (Noisy-channel coding theorem). Assume that 0 < R < C, where R is the rate of the information source and C is the channel capacity. 6 Then, there exists a sequence of codes c 1, c 2,..., c n with code word lengths 1, 2,..., n, such that lim P (error c n) = 0. n Proof. The basic idea is to do as follows: 1. Select a random code, by independently assigning a random code word of length n to each message (with all potential code words being equally likely and duplicates being allowed). 2. Show that the expected value of the error probability (for a randomly chosen code) can be made arbitrarily small by selecting a large enough n. 3. Conclude that, for each n, there must be at least one code whose probability of error is less than or equal to the expected value. Create a sequence of such codes and we are done. 6 See section 4.1 for a description of R and definition 8 for a definition of C. 12

The rest of the proof will elaborate on step 2. Let us begin by taking a closer look at the expected value that is mentioned above. Note that there are N m = 2 Rn possible messages and N c = (2 n ) Nm possible codes. Let D be the set of all possible codes. For a random code c (that is, c is not a fixed code here), we are interested in E(P (error c)) = d D = d D = N m N 1 c p i d D N m N 1 c P (error d) = p i P (error m i, d) = Nc 1 P (error m i, d). Lets choose an arbitrary message m and see what d D N 1 c P (error m, d) will be. This is simply the probability of error when randomly chosing a code (as above) and then sending the message using that code. We decode the received sequence r as before. It will be more convenient to calculate the probability of success, so we will focus on that. For the communication to be successful, two things need to happen. First, we must have that c(m) S(r; δ). If that is true, then for all other messages x M\{m} we must have that c(x) / S(r; δ). If both these things are true, then the communication will be successful. The probability of success is thus the product of the probabilites of each of these things happening. We will let P ( ) denote probabilities. In other words, when using a random code c, we have that 1 d D Nc 1 P (error m, d) = P (c(m) S(r; δ))p ( x M\{m}, c(x) / S(r; δ)). We already know that lim n P (c(m) S(r; δ)) = 1 for any δ > 0. It remains to be shown that also lim n P ( x M\{m}, c(x) / S(r; δ)) = 1, at least for one δ > 0. In that case, it is clear that lim n E(P (error c)) = 0, as required above. What is the probability that no other message belongs to S(r; δ)? The code is selected by independently choosing a random code word for each message. It is, therefore, quite easy to calculate this. There are 2 Rn 1 other messages and 2 n possible code words. Note that S(r; δ) 2 n(h(x Y )+ɛ) and H(X Y ) = 1 C (see equation (3) above). The probability is P ( x M\{m}, c(x) / S(r; δ)) = (1 ) )+ɛ) 2 Rn 2n(H(X Y 2 n ( 1 ) 2 S(r; δ) Rn 1 2 n = (1 2 n(h(x Y ) 1+ɛ)) 2 Rn = = (1 2 n(c ɛ)) 2 Rn. Now, choose 0 < ɛ < C R and a corresponding δ > 0 (see theorem 6) and let 13

t = 2 Rn and k = (C ɛ)/r. Then k > 1 and t when n. We get that (1 2 n(c ɛ)) 2 Rn = ( 1 t k) t. It is easy to show that lim t ( 1 t k ) t = 1 (see the appendix). We can therefore make the probability of error arbitrarily small, by choosing a large enough n and suitable ɛ and δ. This completes our proof. 14

Appendix ( We will show here that lim ) t 1 t k t = 1, when k > 1. We will make the change of variables t = s 1 and in the third step we make use of l Hospital s rule. ( lim 1 t k ) ( t = exp lim t log ( ( ( 1 t k)) log ) ) 1 s k = exp lim = t t s 0 + s ( ks k 1 ) ( ks k 1 ) ( lims 0 + ks k 1 ) = exp lim s 0 + 1 s k = exp lim s 0 + s k = exp 1 lim s 0 + s k = 1 ( ) 0 = exp = e 0 = 1. 1 15

References [1] Robert B. Ash, Information Theory. Dover Publications, New York, 1990. [2] Cormen, Leiserson, Rivest and Stein, Introduction to Algorithms. MIT Press and McGraw-Hill, 2009. [3] Claude E. Shannon, A Mathematical Theory of Communication. Originally published in The Bell System Technical Journal, 1948. 16