Entropy. Probability and Computing. Presentation 22. Probability and Computing Presentation 22 Entropy 1/39

Similar documents
Lecture 4: Probability and Discrete Random Variables

Example 1. The sample space of an experiment where we flip a pair of coins is denoted by:

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

The devil is in the denominator

CMPSCI 240: Reasoning Under Uncertainty

Mathematical Foundations of Computer Science Lecture Outline October 18, 2018

Discrete Structures for Computer Science

Lecture 6: The Pigeonhole Principle and Probability Spaces

6.3 Bernoulli Trials Example Consider the following random experiments

Carleton University. Final Examination Fall DURATION: 2 HOURS No. of students: 195

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 16. Random Variables: Distribution and Expectation

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 12. Random Variables: Distribution and Expectation

Lecture 4: Probability, Proof Techniques, Method of Induction Lecturer: Lale Özkahya

Discrete Mathematics and Probability Theory Fall 2012 Vazirani Note 14. Random Variables: Distribution and Expectation

The probability of an event is viewed as a numerical measure of the chance that the event will occur.

Theorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr( )

Lecture Notes. This lecture introduces the idea of a random variable. This name is a misnomer, since a random variable is actually a function.

Random Variable. Pr(X = a) = Pr(s)

Monty Hall Puzzle. Draw a tree diagram of possible choices (a possibility tree ) One for each strategy switch or no-switch

Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Note 15. Random Variables: Distributions, Independence, and Expectations

Exercises with solutions (Set B)

Probabilistic models

6.02 Fall 2012 Lecture #1

Notes on Discrete Probability

EE376A: Homework #2 Solutions Due by 11:59pm Thursday, February 1st, 2018

God doesn t play dice. - Albert Einstein

X = X X n, + X 2

Guidelines for Solving Probability Problems

Introduction to Randomized Algorithms: Quick Sort and Quick Selection

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

TT-FUNCTIONALS AND MARTIN-LÖF RANDOMNESS FOR BERNOULLI MEASURES

Topics. Probability Theory. Perfect Secrecy. Information Theory

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

k P (X = k)

Discrete Random Variable

27 Binary Arithmetic: An Application to Programming

Conditional Probability

Probability and random variables

Set theory background for probability

I - Probability. What is Probability? the chance of an event occuring. 1classical probability. 2empirical probability. 3subjective probability

Probability. Lecture Notes. Adolfo J. Rumbos

Lecture 3. Discrete Random Variables

Elementary Discrete Probability

Econ 113. Lecture Module 2

An Entropy Bound for Random Number Generation

EE376A - Information Theory Final, Monday March 14th 2016 Solutions. Please start answering each question on a new page of the answer booklet.

Lecture 20 : Markov Chains

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

Lecture 1: Shannon s Theorem

EE5585 Data Compression May 2, Lecture 27

Chapter 2: Random Variables

Probability and random variables. Sept 2018

2. AXIOMATIC PROBABILITY

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

Statistics and Econometrics I

Discrete Mathematics and Probability Theory Fall 2017 Ramchandran and Rao Midterm 2 Solutions

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 20

P (E) = P (A 1 )P (A 2 )... P (A n ).

Great Theoretical Ideas in Computer Science

Carleton University. Final Examination Fall DURATION: 2 HOURS No. of students: 223

Intro to Information Theory

P (A B) P ((B C) A) P (B A) = P (B A) + P (C A) P (A) = P (B A) + P (C A) = Q(A) + Q(B).

1. If X has density. cx 3 e x ), 0 x < 0, otherwise. Find the value of c that makes f a probability density. f(x) =

STAT2201. Analysis of Engineering & Scientific Data. Unit 3

Math 180B Homework 4 Solutions

CSE525: Randomized Algorithms and Probabilistic Analysis April 2, Lecture 1

Classification & Information Theory Lecture #8

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

Lecture 7: DecisionTrees

Probabilistic models

1 Review of The Learning Setting

02 Background Minimum background on probability. Random process

Probability theory basics

Mutually Exclusive Events

UNIT NUMBER PROBABILITY 6 (Statistics for the binomial distribution) A.J.Hobson

n N CHAPTER 1 Atoms Thermodynamics Molecules Statistical Thermodynamics (S.T.)

P [(E and F )] P [F ]

Chapter 1: Introduction to Probability Theory

Recursive Estimation

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Lecture Notes 1 Basic Probability. Elements of Probability. Conditional probability. Sequential Calculation of Probability

PERFECT SECRECY AND ADVERSARIAL INDISTINGUISHABILITY

Entropy as a measure of surprise

Some Basic Concepts of Probability and Information Theory: Pt. 2

Discrete Random Variables

Probability Experiments, Trials, Outcomes, Sample Spaces Example 1 Example 2

Lecture Lecture 5

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Discrete Random Variables

Lecture 9: Conditional Probability and Independence

Expected Value 7/7/2006

Announcements. Lecture 5: Probability. Dangling threads from last week: Mean vs. median. Dangling threads from last week: Sampling bias

6.042/18.062J Mathematics for Computer Science November 28, 2006 Tom Leighton and Ronitt Rubinfeld. Random Variables

LECTURE 1. 1 Introduction. 1.1 Sample spaces and events

1. Discrete Distributions

Midterm Exam 1 Solution

MAT 271E Probability and Statistics

Bandits, Experts, and Games

2!3! (9)( 8x3 ) = 960x 3 ( 720x 3 ) = 1680x 3

Transcription:

Entropy Probability and Computing Presentation 22 Probability and Computing Presentation 22 Entropy 1/39

Introduction Why randomness and information are related? An event that is almost certain to occur (very high probability) carries almost no information when it happens. Is this news when the sun rises in the morning? An event that happens seldom (very low probability) is interesting and so informative. Probability and Computing Presentation 22 Entropy 2/39

Entropy We want to measure what intuitively is information / randomness. By analogy with physics, the entity that is to be measured will be called entropy. In thermodynamics, entropy is used as a measure of disorder of a physical system. Example: gas compressed has less entropy than this gas after dissipation. Shouldn t it have more entropy? It has more pressure! Probability and Computing Presentation 22 Entropy 3/39

Enter bits Consider a uniform probability distribution on a probability space with 2 n events. Postulate: each occurrence of such an event carries n units of entropy/information/randomness. Why: we need to concatenate n symbols in the simplest alphabet {0, 1} to create a message to describe the event: 2 n messages and 2 n events Probability and Computing Presentation 22 Entropy 4/39

Generalizing bits If p = 2 n is the probability of an elementary event, then n = lg 1 p. Postulate: occurrence of an event A, with Pr(A) > 0, brings this amount of information: lg 1 = lg Pr(A). Pr(A) Probability and Computing Presentation 22 Entropy 5/39

Random variables A random variable X : Ω R determines q partitioning of the sample space Ω into events of the form X = x, for all x in the range of values of X. Define the entropy of random variable X: H(X) = Pr(X = x) lg where x is taken over all values of X. 1 Pr(X = x), Probability and Computing Presentation 22 Entropy 6/39

Intuition for H(X) The entropy H(X) of a random variable X is the average number of bits needed to encode an outcome of an experiment that determines the value of X. Probability and Computing Presentation 22 Entropy 7/39

Entropy of a random variable X as expectation A random variable X with values in a countable set V X. Treat V X be a sample space with X s probability distribution: for r V X define Pr X (r) = Pr(X = r). Now define a random variable Y : V X R: Y (r) = lg 1 Pr X (r). Interpretation of entropy as expectation: H(X) = r V X Pr(X = r) lg 1 Pr(X = r) = r V X Pr X (r) Y (r) = E [Y ]. Probability and Computing Presentation 22 Entropy 8/39

Random variables with Bernoulli distributions Random variable X p with Bernoulli distribution, p probability of success. X p has entropy: H(X p ) = p lg 1 p + (1 p) lg 1 1 p. This is a function of p, denoted H(p), and called the (binary) entropy function: H(p) = p lg p (1 p) lg(1 p). Probability and Computing Presentation 22 Entropy 9/39

The essential properties of the entropy function It is continuos and converges to 0 when p 1 and when p 0. It is symmetric with respect to x = 1 2, for 0 < p 1 2 : H(p) = H(1 p). Entropy is increasing for 0 < p 1 2 and decreasing for 1 2 p < 1. Probability and Computing Presentation 22 Entropy 10/39

Bits Maximum of entropy function H is attained at H(1/2) which equals ( 1 H = 2) 1 2 lg 2 + 1 2 lg 2 = 1. This represents the intuition: the outcome of toss of a fair coin gives a unit of information, called a bit. Probability and Computing Presentation 22 Entropy 11/39

Compression reflects information Example: take p = 1 4, then ( 1 H = 4) 1 4 lg 4 + 3 lg 4 4 3 0.82 < 1. How to interpret the property that H(1/4) < 1? What is troubling is that we cannot encode an outcome of a single experiment with fewer than one bits. Take a loooong sequence of some n outcomes of such experiments. Represent them by a corresponding sequence of n zeros and ones. This sequence could be compressed to fewer than n bits. Like.82 n bits? Probability and Computing Presentation 22 Entropy 12/39

A connection with combinatorics The entropy function occurs in combinatorial calculations when estimating a sum of consecutive binomial coefficients. Let 0 a 1 2 and n be a natural number. Then 0 k an ( ) n 2 H(a) n. k (see the lecture notes) Probability and Computing Presentation 22 Entropy 13/39

Realization of random variables Imagine that we carry out experiments to obtain a sequence x 1, x 2,... as a realization of X 1, X 2,..., where X i = x i according to the probability distribution of X i. We want to process this sequence of outcomes of experiments to produce a new sequence of values which is to be, with respect to its statistical properties, as if it were a realization of a sequence random variables Y 1, Y 2,.... This means simulating one sequence of random variables given a sequence of values taken on by a sequence of some other random variables. Probability and Computing Presentation 22 Entropy 14/39

Simulating a biased coin Suppose that X 1, X 2,... is a sequence of outcomes of independent tosses of a fair coin. We want to simulate a sequence on independent Bernoulli trials Y 1, Y 2,..., each with probability 0 < p < 1 of success, for a given p. This could be interpreted as simulating a biased coin using a fair coin. Probability and Computing Presentation 22 Entropy 15/39

An intuition of the simulation Suppose we can draw a random real number r from the interval (0, 1). Now if we draw r such that r p then this is a success and we output 1, and if we draw r > p then this is a failure and we output 0. Probability and Computing Presentation 22 Entropy 16/39

How to simulate drawing a random real number? We can represent such a number r by its binary expansion r = 0.r 1 r 2 r 3... where r i {0, 1} are bits. For example, 1 2 = 0.1000..., 1 4 = 0.01000.... Random bits r 1 r 2 r 3... are taken from the sequence X 1, X 2,.... We use each one of these bits only once. Probability and Computing Presentation 22 Entropy 17/39

Towards a simulation Let p = 0.p 1 p 2 p 3... be a binary representation of p. Suppose we produce a randomly selected r = 0.r 1 r 2 r 3.... What does it mean that r p? This means that, for i being the first bit on which p and r differ: r i < p i. (simulation next) Probability and Computing Presentation 22 Entropy 18/39

The simulation Proceed bit by bit. Begin with comparing r 1 with p 1 : if r 1 < p 1 then this indicates success, and if r 1 > p 1 then this indicates failure, and if r 1 = p 1 then this bit does not help and we proceed to consider the second bit. This process continues until we find the first bit i such that r i p i. Probability and Computing Presentation 22 Entropy 19/39

Extracting randomness We discuss simulating Y 1, Y 2,... which is a sequence of outcomes of independent tosses of a fair coin. This is called extracting randomness from X 1, X 2,.... Probability and Computing Presentation 22 Entropy 20/39

Generalizing extraction Consider a procedure R that for a given X produces R(X), which is a string of bits. Given a sequence X 1, X 2,... of random variables, a new simulated sequence is defined to be R(X 1 ), R(X 2 ),.... We say that the bits of R(X i ) are produced simultaneously while the bits in R(X i ) with respect to R(X j ), for i j, are produced separately. For R to be an extractor it needs to have additional properties which we define next in two ways, then argue why they are equivalent. Probability and Computing Presentation 22 Entropy 21/39

One take on an extractor R We want the sequence obtained by concatenation R(X 1 ), R(X 2 ),... to be such that when it interpreted as a sequence of bits r 1, r 2,..., that is, r i {0, 1}, then the values r i are independent from each other and such that Pr(r i = 0) = Pr(r i = 1) = 1 2. Probability and Computing Presentation 22 Entropy 22/39

Another take on an extractor R It has two properties: Sequences produced separately are independent of each other. For any integer k > 0, if it is possible to extract k bits simultaneously, meaning that R(X) = (r 1,..., r k ) for some sequence of k bits (r 1,..., r k ), then, for any sequence (z 1,..., z k ) of k bits, the probability of extracting (z 1,..., z k ) as a value of R(X) is the same as the probability of extracting (r 1,..., r k ). Probability and Computing Presentation 22 Entropy 23/39

Equivalence of two definitions These two takes on extractor define the same concept. Probability and Computing Presentation 22 Entropy 24/39

An example of extracting randomness Experiment (realization) returns an integer selected uniformly at random from the interval [0, 7]. Fix a one-to-one correspondence between the integers in the interval [0, 7] and 8 strings of 3 bits each. Given a number in [0, 7], output the corresponding string. This is a clean situation as there are precisely binary strings, each of length 3. 2 3 = 8 I always liked 8, it is a nice number. Probability and Computing Presentation 22 Entropy 25/39

From 8 to 10 Suppose Y [0, 9] is an integer selected uniformly at random from among 10 integers. Since 10 = 8 + 2, we may assign a string of 3 bits to some 8 numbers in the interval, which leaves two integers in the interval. To them we may assign two different bits, bit per number. Sometimes we output three bits and sometimes just one bit? Probability and Computing Presentation 22 Entropy 26/39

From 10 to 11 What if Z [0, 10] is an integer selected uniformly at random from among 11 integers? Now 11 = 2 3 + 2 1 + 2 0 = 8 + 2 + 1. There is no apparent way to extend the construction for 10 integers, unless we modify the method significantly. Sometimes we output three bits, sometimes just one bit, and sometimes nothing at all? Probability and Computing Presentation 22 Entropy 27/39

Limits on extraction Fact: (upper bound on rate of extraction) No extraction function can produce more than H(X) bits simultaneously on average, if each random variable among X 1, X 2,... has the probability distribution of X. Probability and Computing Presentation 22 Entropy 28/39

Extracting randomness from a biased coin We want to process outcomes of tosses of a coin. Such outcomes of coin tosses represent a Bernoulli sequence, that is, a sequence of outcomes of independent Bernoulli trials. There is some probability p of heads coming up (which we call success) and q = 1 p is the probability of tails (or failure). The number p is not known: we want an extractor that works similarly for any 0 < p < 1 without having p as part of its code. This is obviously impossible!. Probability and Computing Presentation 22 Entropy 29/39

The main insight An input made of outcomes of experiments x 1, x 2, x 3,..., each either heads H or tails T, produced by independent tosses of a coin (we do not know if fair). Partition the input into pairs (x 1, x 2 ), (x 3, x 4 ), (x 5, x 6 ),.... There are four possibilities for a pair: HH, TT, HT, TH, which occur with the respective probabilities p 2, q 2, pq and qp. Probability and Computing Presentation 22 Entropy 30/39

We experience illumination The probability of HT is the same as of TH: pq = qp. Output T for each pair TH. Output H for each pair HT. Ignore pairs TT and HH. The guy who invented this simply knew God s phone number. Probability and Computing Presentation 22 Entropy 31/39

Efficiency of this extraction How many input bits do we use per one output bit on the average? Each pair HT and TH can be considered a success to produce output, so such success occurs with probability 2pq. The average waiting time for such a success is 1/(2pq) pairs, or twice as many 1/(pq) coin tosses. When p = q then four coin tosses are needed to produce one outcome bit on the average. It could be worse, as p(1 p) < 1 4 for 0 < p < 1 2. Example: For p = 1/3, the expected number of coin tosses per one outcome bit is 9 2 > 4. Probability and Computing Presentation 22 Entropy 32/39

A streamlined extraction procedure We partition the input, of consecutive realizations of the probability distribution of X, into consecutive pairs (x 1, x 2 ), (x 3, x 4 ), (x 5, x 6 ),.... We will be sending bits to three streams, one of them is the output, and the other two streams are denoted Y and Z. The bits making Y and Z are to have the property that they are outcomes of independent tosses of coins of some unknown biases, one such a coin for Y with its bias, and another coin for Z with its bias. Probability and Computing Presentation 22 Entropy 33/39

Three possible actions There are three actions to perform for a consecutive pair (x i, x i+1 ) of inputs. Some of these actions may be void. 1. Send to output: If (x i, x i+1 ) = HT then output H, and if (x i, x i+1 ) = TH then output T. 2. Send to Y : If (x i, x i+1 ) = HH then add H to Y, and if (x i, x i+1 ) = TT then add T to Y. 3. Send to Z: If either (x i, x i+1 ) = HH or (x i, x i+1 ) = TT then add H to Z, and if either (x i, x i+1 ) = HT or (x i, x i+1 ) = TH then add T to Z. Probability and Computing Presentation 22 Entropy 34/39

Final comments The streams Y and Z are to be processed recursively and the outcomes of this processing are to be interleaved with what we send directly to the output. The ultimate output bits, obtained from the direct output produced interleaved with those produced recursively from Y and Z, are to be independent and unbiased. As the process continues, the number of streams increases with no upper bound on the number of recursively processed streams. Probability and Computing Presentation 22 Entropy 35/39

Explaining the algorithm Let us look at the occurrences of tails in the three streams to be able to argue why they are independent. They can be obtained in the following three ways: 1 tails in the output come from input pairs of the form TH 2 tails in Y come from input pairs of the form TT 3 tails in Z come from input pairs of the form HT or TH The pairs TT are independent of the pairs TH or HT, as they are produced by different coin tosses, with no overlap. An occurrence of tails in the output is independent from those in Z: if tails in Z, then it could be produced by either HT, which adds H to the output, or TH, which adds T to the output. Probability and Computing Presentation 22 Entropy 36/39

Quality The extractor we described is optimal: it extracts all the randomness that is there. (see lecture notes for details) Probability and Computing Presentation 22 Entropy 37/39

Homework Your friend flips a fair coin repeatedly until the first heads occurs. Let X be a random variable equal to the number of flips. You want to determine how many flips were performed in a specific experiment. You are allowed to ask a series of yes no questions of the following form: you give your friend a set of integers and your friend answers yes if the number of flips is in the set and no otherwise. Probability and Computing Presentation 22 Entropy 38/39

Questions with hints 1 Give a formula for H(X). Hint: This is a specific random variable, apply the definition of entropy. 2 Describe a strategy such that the expected number of questions you ask before determining the number of flips is H(X). Hint: Find a strategy with a formula for the expected number of questions that looks the same as the formula for H(X). 3 Give an intuitive explanation of why there is no strategy that would allow to ask fewer than H(X) questions on average. Hint: Just verbalize your intuitions, referring to entropy. Probability and Computing Presentation 22 Entropy 39/39