COS597D: Information Theory in Computer Science September 21, Lecture 2

Similar documents
Lecture 5 - Information theory

Lecture 3: Oct 7, 2014

Lecture 6: The Pigeonhole Principle and Probability Spaces

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

COS597D: Information Theory in Computer Science October 5, Lecture 6

Lecture 6. Today we shall use graph entropy to improve the obvious lower bound on good hash functions.

Lecture 12: Constant Degree Lossless Expanders

Math 180B Problem Set 3

ECE 4400:693 - Information Theory

Lecture Notes CS:5360 Randomized Algorithms Lecture 20 and 21: Nov 6th and 8th, 2018 Scribe: Qianhang Sun

Lecture 11: Quantum Information III - Source Coding

Properly colored Hamilton cycles in edge colored complete graphs

Assignment 4: Solutions

LECTURE 1. 1 Introduction. 1.1 Sample spaces and events

Essential Learning Outcomes for Algebra 2

Discrete Mathematics for CS Spring 2006 Vazirani Lecture 22

Entropy. Probability and Computing. Presentation 22. Probability and Computing Presentation 22 Entropy 1/39

Lecture 5: January 30

Lecture Notes 1 Basic Probability. Elements of Probability. Conditional probability. Sequential Calculation of Probability

6.1 Occupancy Problem

CPSC 536N: Randomized Algorithms Term 2. Lecture 9

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

1 Basic Combinatorics

Exercises with solutions (Set B)

The Communication Complexity of Correlation. Prahladh Harsha Rahul Jain David McAllester Jaikumar Radhakrishnan

The first bound is the strongest, the other two bounds are often easier to state and compute. Proof: Applying Markov's inequality, for any >0 we have

Exercises with solutions (Set D)

Math 261A Probabilistic Combinatorics Instructor: Sam Buss Fall 2015 Homework assignments

Lecture 4: Two-point Sampling, Coupon Collector s problem

Lecture 1: September 25, A quick reminder about random variables and convexity

Problem Sheet 1. You may assume that both F and F are σ-fields. (a) Show that F F is not a σ-field. (b) Let X : Ω R be defined by 1 if n = 1

1. When applied to an affected person, the test comes up positive in 90% of cases, and negative in 10% (these are called false negatives ).

Lecture 2: Review of Basic Probability Theory

Ma/CS 6b Class 15: The Probabilistic Method 2

Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Note 15. Random Variables: Distributions, Independence, and Expectations

Lecture 3: Lower bound on statistically secure encryption, extractors

Inaccessible Entropy and its Applications. 1 Review: Psedorandom Generators from One-Way Functions

Discrete Mathematics and Probability Theory Fall 2017 Ramchandran and Rao Midterm 2 Solutions

Discrete Mathematics & Mathematical Reasoning Chapter 6: Counting

Lecture 04: Balls and Bins: Birthday Paradox. Birthday Paradox

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 16. Random Variables: Distribution and Expectation

Aditya Bhaskara CS 5968/6968, Lecture 1: Introduction and Review 12 January 2016

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

Lecture 6: Entropy Rate

Lecture 35: December The fundamental statistical distances

Asymptotically optimal induced universal graphs

The binary entropy function

Lecture 24: Approximate Counting

Discrete Mathematics

Theorem (Special Case of Ramsey s Theorem) R(k, l) is finite. Furthermore, it satisfies,

LECTURE 13. Last time: Lecture outline

Shannon s Noisy-Channel Coding Theorem

EE376A: Homework #2 Solutions Due by 11:59pm Thursday, February 1st, 2018

Lecture 1 : Probabilistic Method

LECTURE 3. Last time:

Lectures 6: Degree Distributions and Concentration Inequalities

High-dimensional permutations

X = X X n, + X 2

Sets. A set is a collection of objects without repeats. The size or cardinality of a set S is denoted S and is the number of elements in the set.

Asymptotically optimal induced universal graphs

Introduction to Decision Sciences Lecture 11

DISTINGUISHING PARTITIONS AND ASYMMETRIC UNIFORM HYPERGRAPHS

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 12. Random Variables: Distribution and Expectation

Computing and Communications 2. Information Theory -Entropy

Algebraic Methods in Combinatorics

Math Bootcamp 2012 Miscellaneous

REVIEW QUESTIONS. Chapter 1: Foundations: Sets, Logic, and Algorithms

Probabilistic Combinatorics. Jeong Han Kim

Some Basic Concepts of Probability and Information Theory: Pt. 2

n N CHAPTER 1 Atoms Thermodynamics Molecules Statistical Thermodynamics (S.T.)

Discrete Mathematics and Probability Theory Fall 2012 Vazirani Note 14. Random Variables: Distribution and Expectation

The Advantage Testing Foundation Solutions

Notes 3: Stochastic channels and noisy coding theorem bound. 1 Model of information communication and noisy channel

Lecture 10 + additional notes

Lecture 8: Channel and source-channel coding theorems; BEC & linear codes. 1 Intuitive justification for upper bound on channel capacity

CIS 2033 Lecture 5, Fall

Convergence Rate of Markov Chains

Information Theory + Polyhedral Combinatorics

Lecture 6: Deterministic Primality Testing

Permutations and Combinations

1 Ways to Describe a Stochastic Process

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Probability and random variables. Sept 2018

EECS 70 Discrete Mathematics and Probability Theory Fall 2015 Walrand/Rao Final

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

Randomized Algorithms

Lecture 8: Conditional probability I: definition, independence, the tree method, sampling, chain rule for independent events

Lecture 02: Summations and Probability. Summations and Probability

Example: Letter Frequencies

Entropy and Graphs. Seyed Saeed Changiz Rezaei

Notes. Combinatorics. Combinatorics II. Notes. Notes. Slides by Christopher M. Bourke Instructor: Berthe Y. Choueiry. Spring 2006

4/17/2012. NE ( ) # of ways an event can happen NS ( ) # of events in the sample space

HAMILTON CYCLES IN RANDOM REGULAR DIGRAPHS

Example: Letter Frequencies

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

1 Introduction to information theory

Probability Theory. Introduction to Probability Theory. Principles of Counting Examples. Principles of Counting. Probability spaces.

Math 493 Final Exam December 01

WAITING FOR A BAT TO FLY BY (IN POLYNOMIAL TIME)

Lecture 4: Counting, Pigeonhole Principle, Permutations, Combinations Lecturer: Lale Özkahya

Transcription:

COS597D: Information Theory in Computer Science September 1, 011 Lecture Lecturer: Mark Braverman Scribe: Mark Braverman In the last lecture, we introduced entropy H(X), and conditional entry H(X Y ), and showed how they are related via the chain rule. We also proved the inequality H(X 1,..., X n ) H(X 1 ) + + H(X n ). We also showed that H(X Y ) H(X). Equality holds if and only if X and Y are independent. Similarly to this inequality, one can show that more generally, Lemma 1. H(X Y Z) H(X Y ). In this lecture, we will derive one more useful inequality and then give some examples of applying entropy in combinatorics. We start with some simple examples. 1 Warm-up examples Example. Prove that for any deterministic function g(y), H(X Y ) H(X g(y )). Solution We have H(X Y ) = H(X Y g(y )) H(X g(y )). Here the equality holds because the value of Y completely determines the value of g(y ). The inequality is an application of Lemma 1. Example 3. Consider a bin with n balls of various colors. In one experiment k n balls are taken out with replacement. In another the balls are taken out without replacement. In which case does the resulting sequence have higher entropy? Solution One quick way to get the answer (but not the proof) is to test the question for very small numbers. Suppose n = k = and the bin contains one red and one blue ball. Then the possible outcomes of the first experiment are RR, RB, BR, and BB all with equal probabilities. Hence the entropy of the first experiment is bits. The possible outcomes of the second experiment are RB and BR, and hence the entropy is only 1 bit. Thus we should try to prove that the first experiment has the higher entropy. Denote the random variables representing the colors of the balls in the first experiment by X 1,..., X k and in the second experiment by Y 1,..., Y k. The first experiment is run with replacements, and hence all the X i s are independent and identically distributed. Thus we have: H(X 1... X k ) = H(X 1 )+H(X X 1 )+...+H(X k X 1... X k 1 ) = H(X 1 )+H(X )+...+H(X k ) = k H(X 1 ), where the first equality is just the Chain Rule, and the second equality follows from the fact that the variables are independent. Next, we observe that the distribution of the i-th ball drawn in the second experiment is the same as the original distribution of the colors in the bag, which is the same as the distribution of X 1. Thus H(Y i ) = H(X 1 ) for all i. Finally, we get H(Y 1... Y k ) = H(Y 1 ) + H(Y Y 1 ) +... + H(Y k Y 1... Y k 1 ) H( 1 ) + H(Y ) +... + H(Y k ) = k H(X 1 ), which shows that the second experiment has the lower entropy. Note that equality holds only when the Y i s are independent, which would only happen if all the balls are of the same color, and thus both entropies are 0! Based on lecture notes by Anup Rao and Kevin Zatloukal 1

Example 4. Let X and Y be two (not necessarily independent) random variables. Let Z be a random variable that is obtained by tossing a fair coin, and setting Z = X if the coin comes up heads and Z = Y if the coin comes up tails. What can we say about H(Z) in terms of H(X) and H(Y )? Solution We claim that (H(X) + H(Y ))/ H(Z) (H(X) + H(Y ))/ + 1. These bounds are tight. To see this consider the case when X = 0 and Y = 1 are just two constant random variables and thus H(X) = H(Y ) = 0. We then have Z B 1/ distributed as a fair coin with H(Z) = 1 = (H(X) + H(Y ))/ + 1. At the same time, if X and Y are any i.i.d. (independent, identically distributed) random variables, then Z will have the same distribution as X and Y and thus H(Z) = H(X) = H(Y ) = (H(X) + H(Y ))/. To prove the inequalities denote by S the outcome of the coin toss when we select whether Z = X or Z = Y. Then we have For the other direction, note that H(Z) H(Z S) = 1 H(Z S = 0) + 1 H(Z S = 1) = 1 H(X) + 1 H(Y ). H(Z) H(ZS) = H(S) + H(Z S) = 1 + 1 H(X) + 1 H(Y ). One more inequality We show that the uniform distribution has the highest entropy. In fact, we will see that a distribution has the maximum entropy if and only if it is uniform. Lemma 5. Let X be the support of X. Then H(X) log X. Proof We write H(X) as an expectation and then apply Jensen s inequality (from Lecture 1) to the convex function log(1/t): H(X) = ( ) ( ) p(x) log(1/p(x)) log p(x)(1/p(x)) = log 1 = log X 3 Applications 3.1 Bounding the Binomial Tail Suppose n + 1 people have each watched a subset of n movies. Since there are only n possible subsets of these movies, there must be two people that have watched exactly the same subset by the pigeonhole principle. We shall show how to argue something similar in less trivial cases where the pigeonhole principle does not apply. This first example will show one way to do that using information theory. Suppose n people have each watched a subset of n movies and every person has watched at least 90% of the movies. If the number of possible subsets meeting this constraint is less than n, then we must have two people who have watched exactly the same subset, of movies as before. The following result will give us what we need.

Lemma 6. If k n/, then k ( n i=0 i) nh(k/n). We would like to compute n ( n ) ( i=0.9(n) i. Since n ) ( i = n ) 0.1(n) ( n i, this sum is equal to n ) i=0 i nh(1/10) < n since we can compute H(0.1) < 0.469 < 0.5. It remains only to prove the lemma. Proof Let X 1 X X n be a uniformly ( random string sampled from the set of n-bit strings of weight k ( at most k. Thus, H(X 1 X X n ) = log n i=0 k) ). Further, we have that PrX i = 1 = E X i. By symmetry, this probability is equal to 1 n n n j=1 E X j = 1 n E j=1 X j k n, where we have used linearity of expectation. We can relate this to the entropy by using the fact that p H(p) for 0 p 1. Finally, applying our inequality from last ( time, we see that H(X 1 X X n ) H(X 1 ) + + H(X n ) nh(k/n). k ( Thus, we have shown that log n i=0 i) ) nh(k/n), proving the lemma. 3. Triangles and Vees KR10 Let G = (V, E) be a directed graph. We say that vertices (x, y, z) (not necessarily distinct) form a triangle if {(x, y), (y, z), (z, x)} E. Similarly, we say they form a vee if {(x, y), (x, z)} E. Let T be the number of triangles in the graph, and V be the number of vees. We are interested in the relationship between V and T. From any particular triangle, we can get one vee from each edge, say (x, y), by repeating the second vertex: (x, y, y) is a vee. If the vertices of a triangle are distinct, the number of vees in the triangle is equal to the number of triangles contributed by the vertices of the triangle, since the three cyclic permutations of (x, y, z) are distinct triangles. However, the same edge could be used in many different triangles, so that this simple counting argument does not tell us anything about the relationship between V and T. We shall use an information theory based argument to show: Lemma 7. In any directed graph, V T. Proof Let (X, Y, Z) be a uniformly random triangle. Then by the chain rule, log(t ) = H(X, Y, Z) = H(X) + H(Y X) + H(Z X, Y ). We will construct a distribution on vees with at least log T entropy, which together with Lemma 5, would imply that log V log T V T. Since conditioning can only reduce entropy, we have that H(X, Y, Z) H(X)+H(Y X)+H(Z Y ). Now observe that by symmetry, the joint distribution of Y, Z is exactly the same as that of X, Y. Thus we can simply the bound to log T H(X) + H(Y X). Sample a random vee, (A, B, C) with the distribution q(a, b, c) = PrX = a PrY = b X = a PrY = c X = a. In words, we sample the first vertex with the same distribution as X, and then sample two independent copies of Y to use as the second and third vertices. Observe that if q(a, b, c) > 0, then (a, b, c) must be a vee, so this is a distribution on vees. On the other hand, the entropy of this distribution is H(A)+H(B A)+H(C AB) = H(A) + H(B A) + H(C A) = H(X) + H(Y X), which is at least log T as required. The lemma is tight. Consider a graph with 3n vertices partitioned into three sets A, B, C with A = B = C = n. Suppose that the edges are {(a, b) a A, b B} and similarly for B, C and C, A. For each triple (a, b, c), we get three triangles (a, b, c), (b, c, a), and (c, a, b) so T = 3n 3. On the other hand, each vertex a A is involved in n vees of the form (a, b 1, b ), and similarly for B, C. So V = 3n 3. 3.3 Counting Perfect Matchings Rad97 Suppose that we have a bipartite graph G = (A B, E), with A = B = n. Let A = n. A perfect matching is a subset of E that is incident to every vertex exactly once. Hence, it is a bijection between the sets A 3

and B. How many possible perfect matchings can there be in a given graph? If we let d v be the degree of vertex v, then a trivial bound on the number of perfect matchings is i A d i. A tighter bound was proved by Brégman: Theorem 8 (Brégman). The number of perfect matchings is at most i A (d i!) 1/di. To see that this is tight, consider the complete bipartite graph. Any bijection can be chosen, and the number of bijections is the number of permutations of n letters, which is n!. In this case, the bound of the theorem is n (n!)1/n = (n!) (1/n) n = n!. We give a simple proof of this theorem using information theory, due to Radhakrishnan Rad97. Proof Let ρ be a uniformly random perfect matching, and for simplicity, assume that A = 1,,..., n. Write ρ(i) for the neighbor of i under the matching ρ. Given a permutation τ, we write τ(i) to denote the concatenation τ(i), τ(i 1),..., τ(1). Then, using the chain rule, the fact that conditioning only reduces entropy, and the fact that the entropy of a variable is at most the logarithm of the size of its support, n H(ρ) = H(ρ(i) ρ(i 1)) H(ρ(i)) log d i = log This proves that the number of matchings is at most n d i. Can we improve the proof? In computing H(ρ(i)... ), we are losing too much by throwing away all the previous values. To improve the bound, let us start by symmetrizing over the order in which we condition the individual values of ρ. If π is any permutation, then conditioning in the order of π(1), π(),..., π(n) shows that H(ρ) = n H(ρπ(i) ρπ(i 1)), where here ρπ(i) denotes ρ(π(i)). Since this is true for any choice of π, we can take the expectation over a uniformly random choice of π without changing the value. Let L be a uniformly random index in {1,,..., n}. Then, H(ρπ(L) L, π, ρπ(l 1)) = d i 1 n H(ρπ(i) π, ρπ(i 1)) = 1 n H(ρ). Now we rewrite this quantity according to the contribution of each vertex in A: H(ρπ(L) L, π, ρπ(l 1)) = Prπ(L) = i E H(ρπ(L) L, π, ρπ(l 1)) = (1/n) E π,l s.t. π(l)=i π,l s.t. π(l)=i H(ρπ(L) L, π, ρπ(l 1)) Consider any fixed perfect matching ρ. We are interested in the number of possible choices for ρπ(l) conditioned on π(l) = i, after ρπ(1),..., ρπ(l 1) have been revealed. Let a 1, a,..., a di 1 be such that the set of neighbors of i in the graph is exactly {ρ(a 1 ), ρ(a ),..., ρ(a di 1), ρ(i)}. π, L in the expectation can be sampled by first sampling a uniformly random permutation π and then setting L so that π(l) = i. Thus, the ordering of {a 1, a,..., a di 1, i} induced by π is uniformly random, and {ρ(a 1 ), ρ(a ),..., ρ(a di 1)} {ρπ(l 1),..., ρπ(1)} = {a 1, a,..., a di 1} {π(l 1),..., π(1)} is equally likely to be 0, 1,,..., d i 1. The number of available choices for ρ(π(l)) is equally likely to be bounded by 1,,..., d i. This allows us to bound (1/n)H(ρ) = (1/n) E H(ρπ(L) L, π, ρπ(l 1)) π,ls.t.π(l)=i d i (1/n) (1/d i ) log(j) = (1/n) j=1 ( ) log ((d i!) 1/di = (1/n) log which proves that the number of perfect matchings is at most i A (d i!) 1/di. i A (d i!) 1/di ), 4

References KR10 Swastik Kopparty and Benjamin Rossman. The homomorphism domination exponent. Technical report, ArXiv, April 14 010. Rad97 Jaikumar Radhakrishnan. An entropy proof of bregman s theorem. J. Comb. Theory, Ser. A, 77(1):161 164, 1997. 5