Entropy and Limit theorems in Probability Theory

Size: px

Start display at page:

Download "Entropy and Limit theorems in Probability Theory"

Evelyn Young
5 years ago
Views:

1 Entropy and Limit theorems in Probability Theory Introduction Shigeki Aida Important Notice : Solve at least one problem from the following Problems -8 and submit the report to me until June 9. What is entropy? Entropy represents the uncertainty of probabilistic phenomena. The following definition is due to Shannon. Definition. Shannon Let us consider a finite set E = {A,..., A N }. A nonnegative function P on E is called a probability distribution if N P {A i} =. Each A i is called an elementary event. A subset of E is called an event. Then, for this probability distribution P, we define the entropy by HP = P {A i } log P {A i }.. emark. We use the convention, 0 log 0 = 0. If we do not mention about the base of the logarithmic function, we mean by log the natural logarithm, log e. We define for a nonnegative sequence {p i } N, Hp,..., p N = p i log p i.. Example. Coin tossing: E = {H, T } and P {H} = P {T } = /. We have HP = log. Dice: E = {,,, 4, 5, 6}. P {i} = /6 i 6. Then we have HP = log 6. Unfair Dice: E = {,,, 4, 5, 6}. P {} = 9/0, P {i} = /50 i 6. [ 0 ] 9/0 HP = log 50 /0 0 log 9 9 < log = HP. Problem For unfair dice E = {,,, 4, 5, 6} with the probability P 4 {} = 8/0, P 4 {} = /0, P 4 {i} = /40 i =, 4, 5, 6, calculate the entropy HP 4. Is HP 4 bigger than HP? In the above examples and, the entropies are nothing but log# all elementary events, because all elementary events have equal probabilities. The notion of entropy appeared in statistical mechanics also. Of course, the discovery is before that in the information theory. The following is a basic property of the entropy. Theorem.4 Suppose that E = N. Then for any probability distribution P, we have 0 HP log N..4 Then the minimum value is attained by probability measures such that P {A i } = for some i. The maximum is attained by the uniform distribution P, namely, P A i = /N for all i N. This is one of lectures of Mathematics B in Graduate School of Science in Tohoku University in 00.

2 We refer the proof to the proof of Theorem. in the next section. The notion of entropy is used to solve the following problem: Problem Here are eight gold coins and a balance. One of coins is an imitation and it is slightly lighter than the others. How many times do you need to use the balance to find the imitation? Solution: In information theory, the entropy stands for the quantity of the information. In the above problem, we have eight equal possibilities such that each coin may be imitation. So the entropy is log 8. We get some information by using the balance. By using the balance one time, we can get the following three informations:.the same weight,.the left one is lighter,.the right one is lighter. So it contains information log. Thus, by using k-times of the balance, we get information which is amount of k log. So if k log < log 8, we do not get full information. So we need k. Also it is not difficult to see that two times is enough. If the number of coins N satisfies n < N n, then n-times is enough. We refer the detail and other various examples to [5]. Problem In the above problem, how many times do you need to use the balance in the case where n = 7? Also present a method how to use the balance. In order to get into the detail, we recall basic notions in probability theory. Basic notions in probability theory Mathematically, probability space is defined to be a measure space whose total measure equals. We refer the audiences to some text books e.g. [4], [], [] for the precise definition. Definition. A triplet Ω, F, P is called a probability space if the following hold. Ω is a set and F is a family of some subsets of Ω satisfying that i If A, A,..., A i,... F, then A i F. ii If A F, then A c F. iii Ω, F. For each A F, a nonnegative number P A is asssigned and satisfying that i 0 P A for all A A. ii P Ω =. iii σ-additivity If A, A,..., A i,... F and A i A j = i j, then P A i = P A i.

3 The nonnegative function P : F [0, ] is called a probability measure on Ω. A F is called an event and P A is called the probability of A. A function X on Ω is called a random variable or measurable function if for any I = [a, b], X I := {ω Ω Xω I} F holds. For random variable X, define µ X I = P X I,. where I denotes an interval on. µ X is also a probability on and it is called the law of X or the distribution of X. Problem Let Ω, F, P be a probability space. A i F. Prove that if A i F i =,,..., then Problem 4 Let Ω, F, P be a probability space. Let A, B F. Under the assumption that A B, prove that P A P B. In this course, we consider the following random variables. The case where X is a discrete-type random variable: Namely, X takes finite number values {a,..., a n }. Then p i := P {ω Ω Xω = a i }=: P X = a i satisfies that n p i =. The probability distribution µ X of X is given by µ X {a i } = p i i n. The case where X has a density function: That is, there exists a nonnegative function fx on such that for any interval [a, b]. P {ω Ω Xω [a, b]} = b a fxdx Definition. For a random variable X, we denote the expectation, the variance by m and σ respectively. Namely, i The case where X is a discrete-type random variable and takes values a,..., a n : m = E[X] = a i P X = a i,. σ = E[X m ] = a i m P X = a i.. ii The case where X is a continuous-type random variable which has the density function f: m = E[X] = xfxdx,.4 σ = E[X m ] = x m fxdx..5

4 Definition. Independence of random variables Let {X i } N be random variables on a probability space Ω, F, P. N is a natural number or N =. {X i } N are said to be independent if for any {X ik } m k= {X i} N m N and < a k < b k <, the following hold: m P X i [a, b ],, X im [a m, b m ] = P X ik [a k, b k ]..6 Definition.4 Let µ be a probability distribution on. Let {X i } be independent random variables and the probability distribution of X i is equal to the same distribution µ for all i. Then {X i } is said to be i.i.d. = independent and identically distributed random variables with the distribution µ. Entropy and Law of large numbers Shannon and McMillan s theorem Suppose we are given a set of numbers A = {,..., N} N. We call A the alphabet and the element is called a letter. A finite sequence {ω, ω,..., ω n } ω i A is called a sentence with the length n. The set of the sentences whose length are n is the product space A n := {ω,..., ω n ω i A}. Let P be a probability distribution on A. We denote P {i} = p i. In this section, we define the entropy of P by using the logarithmic function to the base N: We can prove that HP = P {i} log N P {i}.. Theorem. For every P, 0 HP holds. HP = 0 holds if and only if P {i} = for some i A. HP = holds if and only if P is the uniform distribution, that is, p i = /N for all i. Lemma. Let gx = x log x, or gx = log x. Then for any {m i } N with m i 0 and N m i = and nonnegative sequence {x i } N, we have N g m i x i m i gx i.. Furthermore, when m i > 0 for all i, the equality of. holds if and only if x = = x N. Proof of Theorem.: First, we consider the lower bound. Applying. to the case where m i = x i = p i and gx = log x, we have Hp,..., p N log N p i log = 0.. 4

5 Clearly, in., the equality holds if and only if p i = for some i. Next, we consider the upper bound. By applying Lemma. to the case where m i = /N, x i = p i and gx = x log x, for any nonnegative probability distribution {p i }, we have Since N p i =, this implies g p i N N gp i..4 N log N N p i log p i. Thus, N p i log p i log N and N p i log N p i. By the last assertion of Lemma., the equality holds iff p i = /N for all i. Now we consider the following situation. Here is a memoryless information source S which sends out the letter according to the probability distribution P at each time independently. Namely, mathematically, we consider i.i.d. {X i } with the distribution P. We consider coding problem of the sequence of letters. Basic observation: Suppose that P {} = and P {i} = 0 i N. Then the random sequence X i is, actually, a deterministic sequence {,,...,,...}. Thus, the variety of sequence is nothing. In this case, we do not need to send the all sequences. In fact, if we know that the entropy of P is 0, then immediately after getting the first letter, we know that subsequent all letters are. Namely, we can encode all sentences, whatever the lengths are, to just one letter. Suppose that N and consider a probability measure such that P {} = P {} = / and P {i} = 0 for i N. Then note that the sequences contain i are not sent out. Thus the number of possible sequences under P whose lengths are n are n. Note that the number of all sequences of alphabet A whose lengths are k is N k. Thus, if N k n k n log N = HP, then all possible sentences whose lengths are n can be encoded into the sentences whose lengths are k n. Also the decode is also possible. More precisely, we can prove the following claim. Claim such that If k n HP, then there exists an encoder ϕ : An A k and a decoder ψ : A k A n P ψϕx,..., X n X,..., X n = 0..5 The probability P ψϕx,..., X n X,..., X n is called the error probability. For general P, we can prove the following theorem [6]. Theorem. Shannon and McMillan Take a positive number > HP. For any ε > 0, there exists M N such that for all n M and k satisfying that k n, there exists ϕ : An A k and ψ : A k A n such that P ψϕx,..., X n X,..., X n ε..6 5

6 is called the coding rate. theorem. We need the following estimates for the proof of the above Lemma.4 Let {Z i } be i.i.d. Suppose that E[ Z i ] < and E[ Z i ] <. Then Z + Z n P m n δ σ nδ,.7 where m = E[Z i ], σ = E[Z i m ]. Problem 5 Prove Lemma.4 using the Chebyshev lemma. This lemma immediately implies the following weak law of large numbers. Theorem.5 Assume the same assumption on {Z i } as in Lemma.4. Then lim P Z + Z n m n n δ = 0..8 Proof of Theorem.: Take n N. The probability distribution of the i.i.d. subsequence {X i } n is the probability distribution P n defined on A n such that for any {a i } n, n P n {ω = a,..., ω n = a n } = P {a i }..9 Let us consider random variables on A n, Z i ω = log N P {ω i } i n. Then {Z i } n are i.i.d. and the expectation and the variance are finite. In fact, we have m = E[Z i ] = P {ω i } log n P {ω i } = HP σ = E[Z i E[Z i ] ] = Take δ > 0 such that > HP + δ. By Lemma.4, P n log n N P {ω i } HP + δ log N p i p i HP..0 σ nδ.. Hence, for any ε > 0, there exists M N such that P n log n N P {ω i } HP + δ ε for all n M.. Noting { ω,..., ω n = { { n } log N P {ω i } < HP + δ ω,..., ω n ω,..., ω n } n P {ω i } > N nhp +δ } n P {ω i } N n =: C n,. 6

7 by., we have, for n M, P X,..., X n C n { = P n ω..., ω n A n n } P {ω i } N n P n { On the other hand, we have C n N n P n { Hence we have ω,..., ω n A n n } P {ω i } N nhp +δ ε.4 ω,..., ω n A n n } P {ω i } N n.5 C n N n..6 By this estimate, if k n, then, there exists an injective map φ : C n A k and a map ψ : A k C n such that ψφω,..., ω n = ω,..., ω n for any ω,..., ω n C n. By taking a map ϕ : A n A k which satisfies ϕ Cn = φ, we have P ψϕx,..., X n = X,..., X n P X,..., X n C n ε..7 This completes the proof. 4 Entropy and central limit theorem Let {X i } be i.i.d. such that m = E[X i] = 0 and σ = E[Xi ] =. Let Then we have S n = X + X n n. Theorem 4. Central limit theorem=clt For any < a < b <, lim P S n [a, b] = n b a π e x dx. 4. The probability distribution whose density is Gx = π e x dx is called a normal distribution Gaussian distribution with mean 0 and variance and the notation N0, stands for the distribution. A standard proof of CLT is given by using the characteristic function of S n, ϕt = E[e ts n ] t. Below, we assume Assumption 4. The distribution of X i has the density function f, namely, P X i [a, b] = b a fxdx. 7

8 Then, we can prove that the distribution of S n also has the density function f n x by Lemma 4.. In this case, 4. is equivalent to b lim n a f n xdx = b a Gxdx. 4. By using the entropy of S n, we can prove a stronger result f n x Gx dx = 0 4. lim n under additional assumptions on f. For the distribution P which has the density fx, and a random variable X whose distribution is P, we define the entropy H and Fisher s information I by HP = fx log fxdx, 4.4 IP = f x dx. 4.5 fx We may denote HP by HX, Hf and may denote IP by IX, If. We summarize basic results on H and I. Lemma 4. If random variables X and Y have the density functions f and g respectively then ax + Y a > 0 has the density function hx = x f a a y gydy. Gibbs s lemma Let fx be a density of a probability whose mean 0 and the variance is, that is, xfxdx = 0, 4.6 x fx =. 4.7 Then we have, Hf HG. 4.8 The equality holds if and only if fx = Gx for all x. Shannon-Stam s inequality Let X, Y be independent random variables whose density functions satisfy 4.6 and 4.7. Then for a, b with a + b =, we have a HX + b HY HaX + by. 4.9 The equality holds iff the laws of X and Y are N0,. 4 Blachman-Stam s inequality Let X, Y be independent random variables whose density functions satisfy 4.6 and 4.7. Then for a, b with a + b =, IaX + by a IX + b IY Csiszár-Kullback-Pinsker For a probability density function f satisfying 4.6 and 4.7, we have fx Gx dx HG Hf. 4. 8

9 6 Stam s inequality For a probability density function f, we have e Hf If. 4. πe 7 Gross s inequality For a probability density function f, we have Hf If log πe. 4. Problem 6 Using Stam s inequality and an elementary inequality log x x x > 0, prove Gross s inequality. Problem 7 Prove Stam s inequality by using Gross s inequality in the following way. Show that f t x = tf txis a probability density function for any t > 0. Next substituting f t x to Gross s inequality, get an upper bound estimate for Hf. Optimize the estimate for Hf choosing suitable t and prove Stam s inequality. Problem 8 Prove that Gross s inequality is equivalent to the following Gross s logarithmic Sobolev inequality: For all u = ux with ux dµx =, we have ux log ux dµx u x dµx, 4.4 where dµx = π e x dx. emark 4.4 N f = e Hf is called the Shannon s entropy power functional. Stam [9] proved his inequality in 959. This inequality reveals a relation between the Fisher information and the Shannon entropy. Later, Gross [5] proved his log-sobolev inequality in the form 4.4 in 975. He also proved that the log-sobolev inequality is equivalent to the hypercontractivity of the corresponding Ornstein-Uhlenbeck semi-group. The hypercontractivity is very important in the study of quantum field theory. Gross s motivation is in the study of quantum field theory. After Gross s work, the importance of the log-sobolev inequality was widely known. Meanwhile, the contribution of the Stam had been forgotton but some people noted his contribution like in [, ] and some people call the inequality as Stam-Gross logarithmic Sobolev inequality as in [0]. My recent work [] is related with semiclassical problem of P φ Hamiltonians which appear in the constructive quantum field theory. In the work, I used logarithmic Sobolev inequality to determine the asymptotic behavior of the ground state energy =the lowest eigenvalue of the Hamiltonian under the semiclassical limit 0. By Lemma 4., S n has a density function f n. Also note that HS nm HS n for any n, m N. We can prove f n converges to G. Theorem 4.5 Assume that f is C -function and If <. Then for any n, f n is a continuous function. Moreover we have lim f nx = Gx for all x 4.5 n lim Hf n = HG, 4.6 n lim f n x Gx dx = n 9

10 Sketch of Proof: By Shannon-Stam s inequality, we can prove that 4.6. This and Csiszár- Kullback-Pinsker inequality implies 4.7. On the other hand, Blachman-Stam inequality implies If n If. This and 4.7 implies 4.5. See [], [7] and references in them for the detail. 5 Boltzmann s H-theorem, Markov chain and entropy We recall kinetic theory of rarefied gases. Let v i xt, v i yt, v i zt be the velocity of the i- th molecule at time t i N. N denotes the number of molecules. The velocities v i t = v xt, v i yt, v i zt obey Newton s equation of motion, but N is very big and it is almost meaningless to know each behavior of v i t. Boltzmann considered the probability distribution of the velocity, say, f t v x, v y, v z dv x dv y dv z and derived the following his H-theorem: Theorem 5. Boltzmann Let Ht = f t v x, v y, v z log f t v x, v y, v z dv x dv y dv z. Then d Ht 0. dt emark 5. In statistical mechanics, the entropy is nothing but kht, where k is the Boltzmann s constant =.8 0 J K. Therefore, the above theorem implies that the entropy increases. Some people raised questions about the H-theorem. i Newton s equation of motion for the particles xt = x i t N moving in a potential U reads as follows: m i ẍ i t = U xt x i 5. x i 0 = x i,0, 5. x i 0 = v i, 5. where x i,0, v i are initial positions and the initial velocities. m i is the mass of x i. The time reversed dynamics x t is the solution of 5. with initial velocity v i. Clearly, it is impossible that both entropies of xt and x t increase. This is a contradiction. ii The second question is based on Poincaré s recurrence theorem: There exists a time t 0 such that xt 0 is close to x0. Then at that time t 0, the entropy also should be close to each other. Therefore, the monotone property of entropy does not hold. The reason why the above paradox appeared is in the statistical treatment of rarefied gases. We refer the recent progress based on probabilistic models to [4]. Note: Poincaré s recurrence theorem holds under some additional assumptions on U. From now on, we consider stochastic dynamics which are called Markov chains and prove that the entropy of the dynamics increases when time goes on. 0

11 Example 5. There are datum on the weather in a certain city. Today Fine day ainy day Tomorrow Fine day Probability ainy day Probability Fine day Probability ainy day Probability Today=0-th dayis fine, then how much is the probability that the n-th day is also fine? Solution: Let p k be the probability that the k-th day is fine and set q k = p k, that is the probability that k-th day is rainy day. Then p k, q k satisfies the following recurrence relation: So we obtain p k, q k = p k, q k p k, q k =, 0 k Definition 5.4 Let E = {S,..., S N } be a finite set. E is called a state space. We consider a random motion of a particle on S. Let {p ij } i,j E be nonnegative numbers satisfying that p ij = for all i N. 5.5 j= p ij denotes the probability that the particle moves from S i to S j. If the probability that the particle locates at S i is π i i N at the time n, then the probability that the particle locates at S j at the time n + is N π ip ij. That is, if the initial distribution of the particle is π0 = π,..., π N, then the distribution πn = π n,..., π N n at time n is given by πn = π0p n, 5.6 where P denotes the N N matrix whose i, j-element is p ij. Below, we denote by p n ij i, j-element of P n. the We prove the following. Theorem 5.5 Assume that A p ij = for all j N.

12 A Mixing property of the Markov chain There exists n 0 N such that p n 0 ij > 0 for any i, j {,..., N},. Then for any initial distribution π = π,..., π N, we have Note that πn i N = πn = πp n. emark 5.6 A is equivalent to Also A holds if p ij = p ji for all i, j E. lim πn i =. for all i N. 5.7 n N,..., =,..., P. This theorem can be proved by using the entropy of πn and Lemma.. ecall Hπ = π i log π i. Lemma 5.7 Let Q = q ij be a N, N-probability matrix, that is, q ij 0 for all i, j and N j= q ij = for all i. Assume N q ij = for any j. Then for any π, HπQ Hπ. In addition, if q ij > 0 for all i, j, then, HπQ > Hπ for all π /N,..., /N. Proof. HπQ = = πq i log πq i N N π k q ki log π k q ki. 5.8 k= Since N k= q ki =, by applying Lemma. to the case where m k = q ki, x k = π k, N N π k q ki log π k q ki q ki π k log π k. 5.9 k= k= 5.8, 5.9 and N q ki = imply HπQ Hπ. The last assertion follows from the last assertion of Lemma.. By using this lemma, we prove Theorem 5.5. Proof of Theorem 5.5 Since πn moves in a bounded subset in N, there exist the accumulation points. That is, there exist x = x,..., x N and a subsequence {πnk} k= such that lim k πnk = x. x is also a probability and satisfies that Hx = lim k Hπnk. Note that P n 0 is a probability matrix and all elements of P n 0 are positive. So if x /N,..., /N, then HxP n 0 > Hx. But xp n 0 = lim k πp nk+n 0 and HxP n 0 = lim k HπP nk+n 0. Since Hπn is an increasing sequence, HxP n 0 = Hx. This is a contradiction. So x = /N,..., /N which completes the proof. k= k=

13 eferences [] S. Aida, Semi-classical limit of the lowest eigenvalue of a Schrdinger operator on a Wiener space. II. P φ -model on a finite volume. J. Funct. Anal , no. 0, [] A. Barron, Entropy and the central limit theorem. Ann. Probab , no., 6 4. [] E. Carlen, Super additivity of Fisher s Information and logarithmic Sobolev inequalities, Journal of Functional Analysis, 0 99, 94. [4],,,.,, 00 [5] L. Gross, Logarithmic Sobolev inequalities. Amer. J. Math , no. 4, [6] A.I. Khinchin, Mathematical foundations of information theory, Dover books on advanced mathematics, 957. [7] P.L. Lions and G. Toscani, A strengthened central limit theorem for smooth densities, Journal of Functional Analysis, 9 995, [8] C.E. Shannon, The mathematical theory of communication, Bell Syst, Techn. Journ. 7, 79 4, [9] A.J. Stam, Some inequalities satisfied by the quantities of information of Fisher and Shannon, Information and Control 959, 0. [0] C. Villani, Topics in optimal transportation. Graduate Studies in Mathematics, 58. American Mathematical Society, Providence, I, 00. [] D. Williams, Probability with martingales, Cambridge Mathematical Textbooks, 99. [] K. Yoshida, Markoff process with a stable distribution, Proc. Imp. Acad. Tokyo, 6 940, [],,. [4],,. [5],,.

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 202 BOUNDS AND ASYMPTOTICS FOR FISHER INFORMATION IN THE CENTRAL LIMIT THEOREM