Lecture 11: Information theory THURSDAY, FEBRUARY 21, 2019

Similar documents
Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

Information in Biology

Information in Biology

Chapter 2.5 Random Variables and Probability The Modern View (cont.)

6.02 Fall 2011 Lecture #9

INTRODUCTION TO INFORMATION THEORY

Welcome to Comp 411! 2) Course Objectives. 1) Course Mechanics. 3) Information. I thought this course was called Computer Organization

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Lecture 1. ABC of Probability

STA Module 4 Probability Concepts. Rev.F08 1

Shannon's Theory of Communication

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

log 2 N I m m log 2 N + 1 m.

CS 630 Basic Probability and Information Theory. Tim Campbell

Revision of Lecture 5

Introduction to Information Theory. B. Škorić, Physical Aspects of Digital Security, Chapter 2

Lecture 2: August 31

1. Basics of Information

Classical Information Theory Notes from the lectures by prof Suhov Trieste - june 2006

Lecture 1: Shannon s Theorem

Probability Rules. MATH 130, Elements of Statistics I. J. Robert Buchanan. Fall Department of Mathematics

Discrete Random Variables

Discrete Random Variables

Probability and Independence Terri Bittner, Ph.D.

Lecture 7: DecisionTrees

Lecture 11: Continuous-valued signals and differential entropy

Lecture 1: Introduction, Entropy and ML estimation

Quantitative Biology Lecture 3

MATH MW Elementary Probability Course Notes Part I: Models and Counting

Problems from Probability and Statistical Inference (9th ed.) by Hogg, Tanis and Zimmerman.

Information and Entropy. Professor Kevin Gold

An introduction to basic information theory. Hampus Wessman

Noisy channel communication

Lecture 18: Quantum Information Theory and Holevo s Bound

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Classification & Information Theory Lecture #8

Entropies & Information Theory

Probability (Devore Chapter Two)

AQI: Advanced Quantum Information Lecture 6 (Module 2): Distinguishing Quantum States January 28, 2013

EE376A - Information Theory Final, Monday March 14th 2016 Solutions. Please start answering each question on a new page of the answer booklet.

Chapter 14. From Randomness to Probability. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Midterm #1 - Solutions

Introduction to Information Theory

Single Maths B: Introduction to Probability

7.1 What is it and why should we care?

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 12. Random Variables: Distribution and Expectation

Outline. Probability. Math 143. Department of Mathematics and Statistics Calvin College. Spring 2010

Expectations and Entropy

PERMUTATIONS, COMBINATIONS AND DISCRETE PROBABILITY

Elementary Discrete Probability

Probability Year 9. Terminology

Mutually Exclusive Events

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

the time it takes until a radioactive substance undergoes a decay

Lecture 5 - Information theory

Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Note 15. Random Variables: Distributions, Independence, and Expectations

Grades 7 & 8, Math Circles 24/25/26 October, Probability

Lecture 4 : Conditional Probability and Bayes Theorem 0/ 26

Probability Year 10. Terminology

Coding, Information Theory (and Advanced Modulation) Prof. Jay Weitzen Ball 411

Basic Concepts of Probability

Lecture 1: Probability Fundamentals

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

Mathematical Foundations

Chapter 8: An Introduction to Probability and Statistics

Principles of Communications

CMPSCI 240: Reasoning about Uncertainty

Lecture Notes 1 Basic Probability. Elements of Probability. Conditional probability. Sequential Calculation of Probability

God doesn t play dice. - Albert Einstein

If the objects are replaced there are n choices each time yielding n r ways. n C r and in the textbook by g(n, r).

CS 361: Probability & Statistics

Channel capacity. Outline : 1. Source entropy 2. Discrete memoryless channel 3. Mutual information 4. Channel capacity 5.

Chapter 9 Fundamental Limits in Information Theory

3F1 Information Theory, Lecture 3

2. Probability. Chris Piech and Mehran Sahami. Oct 2017

Quantum Information Theory and Cryptography

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester

EE376A: Homework #2 Solutions Due by 11:59pm Thursday, February 1st, 2018

Discrete Mathematics and Probability Theory Fall 2012 Vazirani Note 14. Random Variables: Distribution and Expectation

Information Theory (Information Theory by J. V. Stone, 2015)

Conditional Probability and Bayes

PERFECT SECRECY AND ADVERSARIAL INDISTINGUISHABILITY

Probabilistic Systems Analysis Spring 2018 Lecture 6. Random Variables: Probability Mass Function and Expectation

SDS 321: Introduction to Probability and Statistics

Information. = more information was provided by the outcome in #2

A Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University

Basic Probability. Introduction

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 16. Random Variables: Distribution and Expectation

Information & Correlation

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

Topic 2 Probability. Basic probability Conditional probability and independence Bayes rule Basic reliability

Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides

MAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK UNIT V PART-A. 1. What is binary symmetric channel (AUC DEC 2006)

Basic Concepts of Probability

6.02 Fall 2012 Lecture #1

Review Basic Probability Concept

Properties of Probability

MAHALAKSHMI ENGINEERING COLLEGE QUESTION BANK. SUBJECT CODE / Name: EC2252 COMMUNICATION THEORY UNIT-V INFORMATION THEORY PART-A

Transcription:

Lecture 11: Information theory DANIEL WELLER THURSDAY, FEBRUARY 21, 2019

Agenda Information and probability Entropy and coding Mutual information and capacity Both images contain the same fraction of black and white pixels, but the one on the right is far more disordered than the first. We would need far more information to describe the pattern of pixels on the right than on the left. This idea is connected to the signal s entropy. 2

Probability and information Recall we determined the information conveyed by an event occurring is related to its likelihood: Fundamental Tenet III: Information content is inversely proportional to probability. Today, we will formally define the information I E of event E, based on the properties information should have. This definition of information will also extend naturally to discrete random variables, allowing us to describe information in terms of probability distributions. 3

Information as a quantity To begin to quantify information mathematically, what properties should information have? 1) It should be nonnegative: observing something that occurs should not increase uncertainty. 2) The amount of information should decrease as the likelihood increases, as a rare occurrence is more informative than a common one. 3) The amount of information should decrease to zero as the likelihood increases to one, since we already know something certain to occur will occur. 4) If we observe two independent events in succession, the information conveyed by both occurring should be the same as the sum of the information obtained from each event. Recall: what happens to the probabilities of independent events occurring together? What is P(AB)? 4

Information as a quantity Proposition: For an event E with probability P(E), we define self-information I E as Let s confirm it satisfies the four properties: 1) I E is nonnegative since P(E) 1 2) I E is decreasing as P(E) increases. 3) I E decreases to 0 as P(E) increases to 1. 4) If A, B are independent, then P(AB) = P(A)P(B), and I AB = log 2 1 P A)P(B = log 2 I E = log 2 1 P E 1 P A + log 2 1 P B 5

Information as a quantity Is this function unique? Essentially, yes. We can change the base of the logarithm, but otherwise, Alfréd Rényi showed that it is the only choice for discrete events that satisfies these properties. If the base = 2, we call the corresponding unit of information a bit If the base = e, we call the corresponding unit of information a nat How do we apply to a discrete random variable X? Replace P(E) in the formula with the probability mass for the value x associated with the desired outcome. Rényi is also credited (along with Erdős) with claiming A mathematician is a device for turning coffee into theorems. 6

Proof (optional) We have I AB = I A + I B. Take a derivative with respect to A: I AB = I A + I B d da I AB = d da I A + I B BI AB = I A Let A = 1 to get BI B = I 1 for all B, and then integrate: I 1 I x = x 1 I B db = I 1 x 1 1 B db = I 1 ln x If we additionally constrain I 1 = 0, then I x = I 1 ln x. Since I(x) is decreasing, I 1 < 0 and does not depend on x. Set I 1 = 1 for base b > 1 to get I x = log 1 ln b b. QED. x 7

Information examples Calculate self-information I E for the following events: Draw a king of clubs: Roll a pair of dice and get a pair of 1 s: Time to play a game 8

Information of a binary variable First, suppose we have a fair coin flip: P(0) = P(1) = 0.5. How much information is conveyed by a single coin flip? For X = 0, we have I 0 = log 2 (1/0.5) = 1 bit. For X = 1, we have I 1 = log 2 (1/0.5) = 1 bit. In this case, a single binary coin flip (a bit) gives us 1 bit of information. Now, suppose the coin were not fair: P(0) = 0.75, P(1) = 0.25. For X = 0, we have I 0 = log 2 (1/0.75) = 0.415 bits. For X = 1, we have I 1 = log 2 (1/0.25) = 2 bits. So in one case, the coin flip conveys less than 1 bit, and in the other (rarer) case, more than 1 bit. What about multiple coin flips? 9

Activity: Guess What (or Who)? The object of the classic game is to guess your opponent s secret identity through a series of yes/no questions. We will be playing a simplified game with a collection of 12 food emojis: Track how few questions you used to narrow down to your opponent s food item. 10

Activity: Guess What? How many questions did you need to ask? What is the average among the class? Question: How does this compare versus log 2 (12) = 3.585? 11

Average information and entropy Computing the average information of a random variable is as simple as combining the information for each possible outcome, weighted by the likelihood of that outcome: E I X = I X x P X = x x This average information is also called the entropy H(X) of the random variable. This term arises from statistical physics: My greatest concern was what to call it. I thought of calling it 'information,' but the word was overly used, so I decided to call it 'uncertainty.' When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, 'You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage. Claude Shannon, Scientific American (1971), volume 225, page 180. 12

Average information and entropy Let s go back to the coin flip example. Fair coin: I 0 = I 1 = 1 bit, so H = 1 bit as well. For biased coin: I 0 = 0.415 bits, I 1 = 2 bits, so H = (0.75)(0.415) + (0.25)(2) = 0.811 bits. On average, a biased coin conveys less information than a fair coin! Why is this intuitive? How about a fair six-sided die? I 1 = I 2 = I 3 = I 4 = I 5 = I 6 = log 2 (6) = 2.585 bits The entropy is the average = 2.585 bits. What about drawing a playing card? 13

Average information and entropy Minimum entropy value is zero. How can this be attained? One outcome is certain: on average, no information is gained by observing it. Maximum entropy value occurs when all values are equally likely (we call this a uniform distribution). For a fair coin, this is 1 bit. For a fair six-sided die, this is log 2 (6) = 2.585 bits. For N possible equally likely outcomes, H = log 2 (N) bits. However, realistic signals rarely have either absolute certainty or uniform distributions. So their entropy is somewhere in between. This is good news! 14

Entropy of real signals Let s examine a standard image: lots of details Zoom in closer: lots of redundancy! Data compression will take advantage of this redundancy to reduce storage/transmission requirements. 15

Entropy of real signals Let s simplify things to a black/white example: Black and White Total pixels: 31 x 24 = 744 pixels We could store each pixel as a separate bit, totaling 744 bits of storage. This would be consistent with equiprobable 0 s and 1 s. 24 pixels Instead, let s ask how much information is stored in this image: 571 black, 173 white pixels. 31 pixels 16

Entropy of real signals For each outcome (B or W), count the relative frequencies: P(B) = 571/744 = 0.767, P(W) = 173/744 = 0.233. For each outcome, also compute self-information: I B = 0.382 bits, I W = 2.104 bits. Average to find entropy or average information: H = 0.783 bits per pixel. Multiply by # of pixels to find total information: (0.783 bits/pixel) (744 pixels) = 583 bits. Two interpretations of entropy: Abstract: average information among outcomes according to a probability distribution. Empirical: average information among realizations (e.g., pixels) according to relative frequencies. Question for next time: how do we code or compress data to minimize redundancy? 17

Sharing information In communications and storage, we want the signal we transmit or write to be the same as the signal received or read from the medium. In general, we cannot expect an error-free representation, since real signals get corrupted by noise. One way of quantifying information loss is to measure how much information is shared between the transmitted and received signal. This measure of shared information is called mutual information. 18

Mutual information To compute mutual information, first we define conditional entropy, H(X Y) that represents the average information remaining to be gained from X after observing a related signal Y. First, define a conditional expected value E X Y {f(x) Y = y}, which is a function of random variable Y, and thus is itself random. E X Y f X Y = y = f x P X = x Y = y x The conditional entropy H(X Y) is the expected value (with respect to the remaining random variable Y) of this random function, with f x = log 2 1 P X = x Y = y H X Y = E E X Y log 2 1 P X = x Y = y 1 = P Y = y P X = x Y = y log 2 y x P X = x Y = y 19

Mutual information The mutual information I(X;Y) = H(X) H(X Y). This is the difference between the information available from observing just X, and the remaining information in X after observing Y. Mutual information is symmetric, but conditional entropy is not. H(X Y) is not equal to H(Y X), but I(X;Y) = I(Y;X) = H(Y) H(Y X). Let s try an example: Y = X E, where X and E are independent binary random variables, and is the binary exclusive or operator. We ll go over the logical interpretation of this operator later in the semester, but here we just think of it mathematically: X, E = 0 Y = ቊ 1 X, E = 1 Let s try two cases: P(E = 1) = 0.5, P(E = 1) = 0.1. Assume P(X = 0) = P(X = 1) = 0.5. 20

Mutual information (example) First, P(E = 1) = 0.5. What is P(Y X)? What is P(Y)? Since Y depends on both X through E, which has P(E=1) = 0.5. P(Y=0 X=0) = P(E=0) = 0.5, and P(Y=1 X=0) = P(E=1) = 0.5. P(Y=0 X=1) = P(E=1) = 0.5, and P(Y=1 X=1) = P(E=0) = 0.5. To find P(Y), we use P(Y=0) = P(X=0) P(Y=0 X=0) + P(X=1) P(Y=0 X=1) = 0.5*0.5 + 0.5*0.5 = 0.5 What is H(Y X)? What is H(Y)? What is I(X;Y)? H(Y X) = P(X=0) * [ P( Y=0 X=0 ) * log2( 1/ P(Y=0 X=0) ) + P( Y=1 X=0 ) * log2( 1/ P(Y=1 X=0) ) ] + P(X=1) * [ P( Y=0 X=1 ) * log2( 1/ P(Y=0 X=1) ) + P( Y=1 X=1 ) * log2( 1/ P(Y=1 X=1) ) ] = 1 bit H(Y) = P(Y=0) log2( 1 / P(Y=0) ) + P(Y=1) log2(1 / P(Y=1) ) = 1 bit, so I(X:Y) = 0 bit (i.e., X and Y share no information because X = Y is just as likely as X not equal to Y) 21

Mutual information (example) Now, P(E = 1) = 0.1. What is P(Y X)? What is P(Y)? Since Y depends on both X through E, which has P(E=1) = 0.1. P(Y=0 X=0) = P(E=0) = 0.9, and P(Y=1 X=0) = P(E=1) = 0.1. P(Y=0 X=1) = P(E=1) = 0.1, and P(Y=1 X=1) = P(E=0) = 0.9. To find P(Y), we use P(Y=0) = P(X=0) P(Y=0 X=0) + P(X=1) P(Y=0 X=1) = 0.5*0.9 + 0.5*0.1 = 0.5 (same as before) What is H(Y X)? What is H(Y)? What is I(X;Y) now? H(Y X) = P(X=0) * [ P( Y=0 X=0 ) * log2( 1/ P(Y=0 X=0) ) + P( Y=1 X=0 ) * log2( 1/ P(Y=1 X=0) ) ] + P(X=1) * [ P( Y=0 X=1 ) * log2( 1/ P(Y=0 X=1) ) + P( Y=1 X=1 ) * log2( 1/ P(Y=1 X=1) ) ] = 0.469 bits H(Y) = P(Y=0) log2( 1 / P(Y=0) ) + P(Y=1) log2(1 / P(Y=1) ) = 1 bit (same as before), so I(X:Y) = 0.531 bits (i.e., X and Y share some information) 22

Mutual information and capacity In the example, we ve assumed our source signal X is a simple fair binary variable, which we ve seen requires a lot of information to encode. A natural question is how much information can be communicated from transmitter to receiver, in general. Assuming we can optimize P(X) for our communication system, we define the capacity of the system to communicate information as the maximum mutual information as a function of P(X): Fundamental Tenet IV: The capacity of a communications channel (or system) is the maximum mutual information between source and receiver that can be communicated with respect to the source distribution P(X). We will see other ways to measure capacity later in the semester. 23

More about capacity This definition motivates a number of questions we will address later in the course: How do we design P(X)? How do we characterize a communications (or storage) system this way? Is this an absolute limit? Or can we exceed it? How close can we get to capacity in practice? The key innovation in mathematical communications theory was not only the creation of this limit, but the strategies that followed to achieve it! Image credits: (left) Marie Gallager; (right) Télécom Bretagne. Robert Gallager and Claude Berrou developed low density parity check codes and turbo codes, respectively, which provide performance close to information-theoretic capacity. 24

Announcements Next time: Compression and coding methods Homework #5 will go out next week. 25