Entropy. Probability and Computing. Presentation 22. Probability and Computing Presentation 22 Entropy 1/39

Entropy Probability and Computing Presentation 22 Probability and Computing Presentation 22 Entropy 1/39

Introduction Why randomness and information are related? An event that is almost certain to occur (very high probability) carries almost no information when it happens. Is this news when the sun rises in the morning? An event that happens seldom (very low probability) is interesting and so informative. Probability and Computing Presentation 22 Entropy 2/39

Entropy We want to measure what intuitively is information / randomness. By analogy with physics, the entity that is to be measured will be called entropy. In thermodynamics, entropy is used as a measure of disorder of a physical system. Example: gas compressed has less entropy than this gas after dissipation. Shouldn t it have more entropy? It has more pressure! Probability and Computing Presentation 22 Entropy 3/39

Enter bits Consider a uniform probability distribution on a probability space with 2 n events. Postulate: each occurrence of such an event carries n units of entropy/information/randomness. Why: we need to concatenate n symbols in the simplest alphabet {0, 1} to create a message to describe the event: 2 n messages and 2 n events Probability and Computing Presentation 22 Entropy 4/39

Generalizing bits If p = 2 n is the probability of an elementary event, then n = lg 1 p. Postulate: occurrence of an event A, with Pr(A) > 0, brings this amount of information: lg 1 = lg Pr(A). Pr(A) Probability and Computing Presentation 22 Entropy 5/39

Random variables A random variable X : Ω R determines q partitioning of the sample space Ω into events of the form X = x, for all x in the range of values of X. Define the entropy of random variable X: H(X) = Pr(X = x) lg where x is taken over all values of X. 1 Pr(X = x), Probability and Computing Presentation 22 Entropy 6/39

Intuition for H(X) The entropy H(X) of a random variable X is the average number of bits needed to encode an outcome of an experiment that determines the value of X. Probability and Computing Presentation 22 Entropy 7/39

Entropy of a random variable X as expectation A random variable X with values in a countable set V X. Treat V X be a sample space with X s probability distribution: for r V X define Pr X (r) = Pr(X = r). Now define a random variable Y : V X R: Y (r) = lg 1 Pr X (r). Interpretation of entropy as expectation: H(X) = r V X Pr(X = r) lg 1 Pr(X = r) = r V X Pr X (r) Y (r) = E [Y ]. Probability and Computing Presentation 22 Entropy 8/39

Random variables with Bernoulli distributions Random variable X p with Bernoulli distribution, p probability of success. X p has entropy: H(X p ) = p lg 1 p + (1 p) lg 1 1 p. This is a function of p, denoted H(p), and called the (binary) entropy function: H(p) = p lg p (1 p) lg(1 p). Probability and Computing Presentation 22 Entropy 9/39

The essential properties of the entropy function It is continuos and converges to 0 when p 1 and when p 0. It is symmetric with respect to x = 1 2, for 0 < p 1 2 : H(p) = H(1 p). Entropy is increasing for 0 < p 1 2 and decreasing for 1 2 p < 1. Probability and Computing Presentation 22 Entropy 10/39

Bits Maximum of entropy function H is attained at H(1/2) which equals ( 1 H = 2) 1 2 lg 2 + 1 2 lg 2 = 1. This represents the intuition: the outcome of toss of a fair coin gives a unit of information, called a bit. Probability and Computing Presentation 22 Entropy 11/39

Compression reflects information Example: take p = 1 4, then ( 1 H = 4) 1 4 lg 4 + 3 lg 4 4 3 0.82 < 1. How to interpret the property that H(1/4) < 1? What is troubling is that we cannot encode an outcome of a single experiment with fewer than one bits. Take a loooong sequence of some n outcomes of such experiments. Represent them by a corresponding sequence of n zeros and ones. This sequence could be compressed to fewer than n bits. Like.82 n bits? Probability and Computing Presentation 22 Entropy 12/39

A connection with combinatorics The entropy function occurs in combinatorial calculations when estimating a sum of consecutive binomial coefficients. Let 0 a 1 2 and n be a natural number. Then 0 k an ( ) n 2 H(a) n. k (see the lecture notes) Probability and Computing Presentation 22 Entropy 13/39

Realization of random variables Imagine that we carry out experiments to obtain a sequence x 1, x 2,... as a realization of X 1, X 2,..., where X i = x i according to the probability distribution of X i. We want to process this sequence of outcomes of experiments to produce a new sequence of values which is to be, with respect to its statistical properties, as if it were a realization of a sequence random variables Y 1, Y 2,.... This means simulating one sequence of random variables given a sequence of values taken on by a sequence of some other random variables. Probability and Computing Presentation 22 Entropy 14/39

Simulating a biased coin Suppose that X 1, X 2,... is a sequence of outcomes of independent tosses of a fair coin. We want to simulate a sequence on independent Bernoulli trials Y 1, Y 2,..., each with probability 0 < p < 1 of success, for a given p. This could be interpreted as simulating a biased coin using a fair coin. Probability and Computing Presentation 22 Entropy 15/39

An intuition of the simulation Suppose we can draw a random real number r from the interval (0, 1). Now if we draw r such that r p then this is a success and we output 1, and if we draw r > p then this is a failure and we output 0. Probability and Computing Presentation 22 Entropy 16/39

How to simulate drawing a random real number? We can represent such a number r by its binary expansion r = 0.r 1 r 2 r 3... where r i {0, 1} are bits. For example, 1 2 = 0.1000..., 1 4 = 0.01000.... Random bits r 1 r 2 r 3... are taken from the sequence X 1, X 2,.... We use each one of these bits only once. Probability and Computing Presentation 22 Entropy 17/39

Towards a simulation Let p = 0.p 1 p 2 p 3... be a binary representation of p. Suppose we produce a randomly selected r = 0.r 1 r 2 r 3.... What does it mean that r p? This means that, for i being the first bit on which p and r differ: r i < p i. (simulation next) Probability and Computing Presentation 22 Entropy 18/39

The simulation Proceed bit by bit. Begin with comparing r 1 with p 1 : if r 1 < p 1 then this indicates success, and if r 1 > p 1 then this indicates failure, and if r 1 = p 1 then this bit does not help and we proceed to consider the second bit. This process continues until we find the first bit i such that r i p i. Probability and Computing Presentation 22 Entropy 19/39

Extracting randomness We discuss simulating Y 1, Y 2,... which is a sequence of outcomes of independent tosses of a fair coin. This is called extracting randomness from X 1, X 2,.... Probability and Computing Presentation 22 Entropy 20/39

Generalizing extraction Consider a procedure R that for a given X produces R(X), which is a string of bits. Given a sequence X 1, X 2,... of random variables, a new simulated sequence is defined to be R(X 1 ), R(X 2 ),.... We say that the bits of R(X i ) are produced simultaneously while the bits in R(X i ) with respect to R(X j ), for i j, are produced separately. For R to be an extractor it needs to have additional properties which we define next in two ways, then argue why they are equivalent. Probability and Computing Presentation 22 Entropy 21/39

One take on an extractor R We want the sequence obtained by concatenation R(X 1 ), R(X 2 ),... to be such that when it interpreted as a sequence of bits r 1, r 2,..., that is, r i {0, 1}, then the values r i are independent from each other and such that Pr(r i = 0) = Pr(r i = 1) = 1 2. Probability and Computing Presentation 22 Entropy 22/39

Another take on an extractor R It has two properties: Sequences produced separately are independent of each other. For any integer k > 0, if it is possible to extract k bits simultaneously, meaning that R(X) = (r 1,..., r k ) for some sequence of k bits (r 1,..., r k ), then, for any sequence (z 1,..., z k ) of k bits, the probability of extracting (z 1,..., z k ) as a value of R(X) is the same as the probability of extracting (r 1,..., r k ). Probability and Computing Presentation 22 Entropy 23/39

Equivalence of two definitions These two takes on extractor define the same concept. Probability and Computing Presentation 22 Entropy 24/39

An example of extracting randomness Experiment (realization) returns an integer selected uniformly at random from the interval [0, 7]. Fix a one-to-one correspondence between the integers in the interval [0, 7] and 8 strings of 3 bits each. Given a number in [0, 7], output the corresponding string. This is a clean situation as there are precisely binary strings, each of length 3. 2 3 = 8 I always liked 8, it is a nice number. Probability and Computing Presentation 22 Entropy 25/39

From 8 to 10 Suppose Y [0, 9] is an integer selected uniformly at random from among 10 integers. Since 10 = 8 + 2, we may assign a string of 3 bits to some 8 numbers in the interval, which leaves two integers in the interval. To them we may assign two different bits, bit per number. Sometimes we output three bits and sometimes just one bit? Probability and Computing Presentation 22 Entropy 26/39

From 10 to 11 What if Z [0, 10] is an integer selected uniformly at random from among 11 integers? Now 11 = 2 3 + 2 1 + 2 0 = 8 + 2 + 1. There is no apparent way to extend the construction for 10 integers, unless we modify the method significantly. Sometimes we output three bits, sometimes just one bit, and sometimes nothing at all? Probability and Computing Presentation 22 Entropy 27/39

Limits on extraction Fact: (upper bound on rate of extraction) No extraction function can produce more than H(X) bits simultaneously on average, if each random variable among X 1, X 2,... has the probability distribution of X. Probability and Computing Presentation 22 Entropy 28/39

Extracting randomness from a biased coin We want to process outcomes of tosses of a coin. Such outcomes of coin tosses represent a Bernoulli sequence, that is, a sequence of outcomes of independent Bernoulli trials. There is some probability p of heads coming up (which we call success) and q = 1 p is the probability of tails (or failure). The number p is not known: we want an extractor that works similarly for any 0 < p < 1 without having p as part of its code. This is obviously impossible!. Probability and Computing Presentation 22 Entropy 29/39

The main insight An input made of outcomes of experiments x 1, x 2, x 3,..., each either heads H or tails T, produced by independent tosses of a coin (we do not know if fair). Partition the input into pairs (x 1, x 2 ), (x 3, x 4 ), (x 5, x 6 ),.... There are four possibilities for a pair: HH, TT, HT, TH, which occur with the respective probabilities p 2, q 2, pq and qp. Probability and Computing Presentation 22 Entropy 30/39

We experience illumination The probability of HT is the same as of TH: pq = qp. Output T for each pair TH. Output H for each pair HT. Ignore pairs TT and HH. The guy who invented this simply knew God s phone number. Probability and Computing Presentation 22 Entropy 31/39

Efficiency of this extraction How many input bits do we use per one output bit on the average? Each pair HT and TH can be considered a success to produce output, so such success occurs with probability 2pq. The average waiting time for such a success is 1/(2pq) pairs, or twice as many 1/(pq) coin tosses. When p = q then four coin tosses are needed to produce one outcome bit on the average. It could be worse, as p(1 p) < 1 4 for 0 < p < 1 2. Example: For p = 1/3, the expected number of coin tosses per one outcome bit is 9 2 > 4. Probability and Computing Presentation 22 Entropy 32/39

A streamlined extraction procedure We partition the input, of consecutive realizations of the probability distribution of X, into consecutive pairs (x 1, x 2 ), (x 3, x 4 ), (x 5, x 6 ),.... We will be sending bits to three streams, one of them is the output, and the other two streams are denoted Y and Z. The bits making Y and Z are to have the property that they are outcomes of independent tosses of coins of some unknown biases, one such a coin for Y with its bias, and another coin for Z with its bias. Probability and Computing Presentation 22 Entropy 33/39

Three possible actions There are three actions to perform for a consecutive pair (x i, x i+1 ) of inputs. Some of these actions may be void. 1. Send to output: If (x i, x i+1 ) = HT then output H, and if (x i, x i+1 ) = TH then output T. 2. Send to Y : If (x i, x i+1 ) = HH then add H to Y, and if (x i, x i+1 ) = TT then add T to Y. 3. Send to Z: If either (x i, x i+1 ) = HH or (x i, x i+1 ) = TT then add H to Z, and if either (x i, x i+1 ) = HT or (x i, x i+1 ) = TH then add T to Z. Probability and Computing Presentation 22 Entropy 34/39

Final comments The streams Y and Z are to be processed recursively and the outcomes of this processing are to be interleaved with what we send directly to the output. The ultimate output bits, obtained from the direct output produced interleaved with those produced recursively from Y and Z, are to be independent and unbiased. As the process continues, the number of streams increases with no upper bound on the number of recursively processed streams. Probability and Computing Presentation 22 Entropy 35/39

Explaining the algorithm Let us look at the occurrences of tails in the three streams to be able to argue why they are independent. They can be obtained in the following three ways: 1 tails in the output come from input pairs of the form TH 2 tails in Y come from input pairs of the form TT 3 tails in Z come from input pairs of the form HT or TH The pairs TT are independent of the pairs TH or HT, as they are produced by different coin tosses, with no overlap. An occurrence of tails in the output is independent from those in Z: if tails in Z, then it could be produced by either HT, which adds H to the output, or TH, which adds T to the output. Probability and Computing Presentation 22 Entropy 36/39

Quality The extractor we described is optimal: it extracts all the randomness that is there. (see lecture notes for details) Probability and Computing Presentation 22 Entropy 37/39

Homework Your friend flips a fair coin repeatedly until the first heads occurs. Let X be a random variable equal to the number of flips. You want to determine how many flips were performed in a specific experiment. You are allowed to ask a series of yes no questions of the following form: you give your friend a set of integers and your friend answers yes if the number of flips is in the set and no otherwise. Probability and Computing Presentation 22 Entropy 38/39

Questions with hints 1 Give a formula for H(X). Hint: This is a specific random variable, apply the definition of entropy. 2 Describe a strategy such that the expected number of questions you ask before determining the number of flips is H(X). Hint: Find a strategy with a formula for the expected number of questions that looks the same as the formula for H(X). 3 Give an intuitive explanation of why there is no strategy that would allow to ask fewer than H(X) questions on average. Hint: Just verbalize your intuitions, referring to entropy. Probability and Computing Presentation 22 Entropy 39/39