An Introduction to Information Theory: Notes

An Introduction to Information Theory: Notes Jon Shlens jonshlens@ucsd.edu 03 February 003 Preliminaries. Goals. Define basic set-u of information theory. Derive why entroy is the measure of information caacity. 3. Discuss the basics of mutual information 4. Solve binary symmetric channel. Probability Theory robability distribution relations joint : P (X, Y ) P (X) P (Y ) marginal : P (X) = P (X, Y = y i ) y iɛy conditional : P (X Y ) = y iɛy P (Y = y i )P (X, Y = y i ) Bayes rule : P (X, Y ) = P (X Y ) P (Y ) = P (Y X) P (X) iid = indeendently and identically distributed P (X, X ) = P (X) P (X) A Simle Examle. Situation A Pretend we like to buy and sell a articular commodity - how about ork bellies at the Chicago Mercantile Exchange in 860. We talk to our trader every day and tell him one action a day: BUY, SELL or HOLD. One day we decide that we are going on a tri to Euroe but we would like to kee trading. Because hones don t exist, we decide on a simle system. We use a telegrah line to send a Morse code signal of a dot (denoted 0) or a dash (denoted ). Here is what we agree on: We will BUY and SELL exactly one half of the days; we will never HOLD. We will send a 0 reeatedly if it is a BUY and a reeatedly if it is a SELL.

. Qualtitative Analysis of A question: what does the trader learn by receiving a 0 or? before signal: equal chance of a BUY or a SELL but never HOLD. after signal: 0, denotes with 00% certainty to either BUY or SELL.3 Situation B Same sitaution as above but now let s say that our telegrah machine is noisy. Most of the time that we ress a 0 or, the trader receives a 0 or, resectively. But occasionally, say, 0% of the time, the trader receives the oosite..4 Qualitative Analysis of B question: what does the trader learn by receiving a 0 or? before signal: same as situation A after signal: Not 00% certain what order was. However, the trader does have a good hunch..5 Statements about Classical Information Theory. There exists a reset, agreed-uon model between the sender to receiver.. Information is usually measured in bits. question in the game of 0 questions 3. Information is selection between ossible alternatives. dee oint: the quantity of information does not deend on the comlexity of the reset alternatives. 3 Intuitive Examle Pretend we have a set of ossible messages X = {x, x,..., x N } all with equal robability { : i = for i =,,..., N}. We lan to send only one message x i through our channel. 3. A Simle Game Each element of X is labeled with a number j =,,..., N. Pretend that you are the sender and you are about to transmit one symbol x i. Your friend will be the receiver. Let your friend try to guess which symbol you will send. This game is a formal version of 0 questions. conclusion: how many questions does your friend need to select N equally robable numbers?

3. Define the uncertainty If the answers to your questions are yes or no, then we attach an equation to this situations. Let H be the average minimum number of questions your friend needs to guess which symbol you will send. H = N We define H as the Shannon entroy. H = log N H = log N H = log 4 Entroy. The Shannon entroy is the one and the same from thermodynamics.. Entroy measures the number of ossible states in a system. equivalent to a measure of uncertainty, variability or even concentration in a df. 3. In base the units of entroy are bits. In many theoretical treatments, base e is measured in gnats. 4. The most general form of entroy for X = {x, x,..., x N } and P = {,,..., N } (non-equal robabilities) is: H(P (X)) log i = N i log i i 5. Entroies can generalize to continuous distributions. discrete distributions: H 0 continuous distributions: 5 Proerties of Entroy 5. Sending two symbols not well defined. The same equirobable situation as the revious examle. However, this time we will send two symbols x i and x j. What is the entroy of sending two symbols x i and x j? H(x i, x j ) = N i,j= H(x i, x j ) = log N H(x i, x j ) = log N N log N H(x i, x j ) = log N + log N 3

5. Conclusion. Information is additive.. Entroy grows as more symbols sent. Entroy is an extensive quantity. 6 Mutual Information The goal is to formally quantify the reduction in uncertainty by examining the aroriate subtraction of entroies. Let us first look at the robability distributions of the receiver before and after one symbol is sent. beforehand: P (X) afterwards: P (X Y ) = i P (Y = y i)p (X Y = y i ) By our definition of uncertainty, the reduction in entroy between the two robability states is defined as the mutual information. I(X; Y ) = H(P (X)) H(P (X Y )) or I(X; Y ) = H(X) H(X Y ) Mutual information is also measure in bits. 6. Relations between entroies I will just not justify these statements but it is easy to work out. Regardless of whether one remembers the details of these equations, it is much easier to remember the Venn diagram in Figure. mutual information is symmetric I(X; Y ) = I(Y ; X) mutual information can be defined many ways I(X; Y ) = H(X) H(X Y ) I(X; Y ) = H(Y ) H(Y X) I(X; Y ) = H(X) + H(X) H(X, Y ) 7 Simle Examles, Returned We will now return full circle and calculate the mutual information I in the two beginning examles. In other words, we will formally quantify our revious qualitative notions. 4

Figure : Venn diagram of relations between variable entroies 7. Examle A: Returned X = {BUY, SELL, HOLD}, P (X) = {, }, 0) The entroy beforehand H(X). [ H(X) = log + ] log + 0 log 0 But notice that 0 log 0 is not finite. This brings u the comlicated issue of suort which authors go a great length to address. The simle, ad-hoc way avoiding these roofs is just to state in the context of information theory 0 log b 0 0. [ H(X) = log + ] log + 0 = [ + ] = Before calculating H(X Y ), we need to comute P (X Y ). P (X Y = 0) = {, 0, 0} P (X Y = ) = {0,, 0} Now we can comute the associated entroy. H(X Y ) = P (Y = 0)H(X Y = 0) + P (Y = )H(X Y = ) = 0 + 0 = 0 Therefore, the mutual information is I(X; Y ) = H(X) H(X Y ) = 0 = bit. 5

Mutual Information in a binary symmetric channel 0.9 0.8 0.7 0.6 I(X;Y) 0.5 0.4 0.3 0. 0. 0 0 0. 0. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Figure : Mutual information in a binary symmetric channel 7. Examle B: Returned First of all, I need to state beforehand that this roblem is a famous first chater roblem in any information theory textbook. It is often called the binary symmetric channel or more colloquially the noisy tyewriter. Let s just state all of the robability distributions before calculating the entories. Let the variable = 0. be the robability of incorrect transmission. { P (X) =, }, 0 P (X Y = 0) = {,, 0} P (X Y = ) = {,, 0} We can now calculate all of the entroies. H(X) = H(X Y = 0) = [( ) log ( ) + log ()] = ( ) log + log H(X Y = ) = [( ) log ( ) + log ()] = ( ) log + log Finally we can calculate the mutual information. I(X; Y ) = H(X) [P (Y = 0)H(X Y = 0) + P (Y = )H(Y = )] [( ) ( ) ( I(X; Y ) = ( ) log + log + [ ] I(X; Y ) = ( ) log + log 6 ) ( ( ) log + log )]

As a consistency check, notice that if = 0, we recover the solution for Examle A of bit. This function is lotted in figure. 8 Conclusions. Classical information theory requires a set robability model.. Information is selection between ossibilities. 3. Entroy is an extensive measure of uncertainty. 4. food for thought: if entroy is an extensive quantity, what is an invariant of a system? 9 References. Cover T and Thomas J (99) Elements of Information Theory. New York: John Wiley and Sons.. Rieke et al (997) Sikes: exloring the neural code. Cambridge, MA: MIT Press. 3. Mackay, David (003) Information Theory, Inference and Learning Algorithms, online htt://www.inference.hy.cam.ac.uk/mackay/itrnn/book.html 7