Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 202 BOUNDS AND ASYMPTOTICS FOR FISHER INFORMATION IN THE CENTRAL LIMIT THEOREM Sergey G. Bobkov University of Minnesota, Minneapolis, USA joint work with Gennadiy P. Chistyakov and Friedrich Götze Bielefeld University, Bielefeld, Germany
Fisher s quantity of information X a random variable with values in R Definition. If X has an absolutely continuous density p, its Fisher information is defined by I(X) = I(p) = + p (x) 2 p(x) dx, where p is a Radon-Nikodym derivative of p. In all other cases, I(X) = +. Equivalently, I(X) = E p (X) p(x) 2. Remarks. ) P{p(X) > 0} =, so the definition makes sense. Integration is over {x : p(x) > 0}. 2) Assume I(X) < +. Then p(x) = 0 p (x) = 0. 3) Translation invariance and homogeneity: I(a+bX) = b 2 I(X) (a R, b 0). 2
2 When the Fisher information appears naturally Statistics: Estimation of the shift parameter in p(x θ). Probability: Shifts of product measures, distinguishing a sequence of iid random variables from a translate of itself. µ a probability measure on R, µ θ (A) = µ (A + θ), θ R, A R Theorem (Feldman 96, Shepp 965) ( θ l 2 µ θ << µ ) I(µ) < + and dµ(x) dx > 0 a.e. Information Theory: de Bruijn s identity Differential entropy h(x) = + p(x) log p(x) dx. Theorem. If a random variable X has finite variance, then for all τ > 0, d dτ h(x + τz) = 2 I(X + τz), where Z N(0, ) is independent of X. 3
3 Distances to normality X a r.v., with density p(x), and a = EX, σ 2 = Var(X) < + Z N(a, σ 2 ) with density q(x) = 2πσ 2 e (x a)2 /2σ 2. Relative entropy of X with respect to Z (informational divergence, Kullback-Leibler distance): D(X) = D(X Z) = h(z) h(x) = p log p q dx. Relative Fisher information of X with respect to Z Properties I(X Z) = I(X) I(Z) = 0 D(X) + D(a + bx) = D(X) p p q q Same for the standardized Fisher information σ 2 I(X Z) = σ 2 I(X) D(X) = 0 I(X Z) = 0 X is normal 2 p dx. 4
4 Relations between distances Csiszár-Kullback-Pinsker inequality for total variation (967): For all random variables X and Z, 2 P X P Z 2 TV D(X Z). Stam s inequality (959) Logarithmic Sobolev inequality: If Z N(0, ), D(X Z) 2 I(X Z). Sharpening (still equivalent): If Z N(a, σ 2 ), EX = EZ = a, Var(X) = Var(Z) = σ 2, then D(X) 2 log [ + σ 2 I(X Z) ] = 2 log [σ2 I(X)]. Let EX = 0, Var(X) =, X p, Z N(0, ): P X P Z TV = + p(x) ϕ(x) dx 2 I(X Z). Shimizu (975): sup x p(x) ϕ(x) C I(X Z). Sharpening: One can show that p ϕ TV = + p (x) ϕ (x) dx C I(X Z). 5
5 Central limit theorem (X n ) n independent identically distributed random variables, EX = 0, Var(X ) = CLT: Weakly in distribution Z n = X +... + X n n Z N(0, ) (n ) Theorem (Barron-Johnson 2004) I(Z n Z) 0, as n I(Z n0 Z) < + for some n 0. Equivalently: I(Z n0 ) < + for some n 0. Sufficient: I(X ) < +. Necessary: n n, Z n have bounded densities p n and sup x p n (x) ϕ(x) 0 (n ). Problems. How to determine in terms of X? (range of applicability) 2. What is rate for I(Z n Z), and under what conditions? 6
6 Uniform local limit theorem Theorem (Gnedenko 950 s) The following properties are equivalent: a) For all sufficently large n, Z n have (continuous) bounded densities p n satisfying sup x p n (x) ϕ(x) 0 (n ); b) For some n, Z n has a (continuous) bounded density p n ; c) The characteristic function f (t) = E e itx of X satisfies a smoothness condition + f (t) ν dt < +, for some ν > 0. 7
7 CLT for Fisher information distance (X n ) n independent identically distributed random variables, EX = 0, Var(X ) =. Theorem. The following assertions are equivalent: a) For some n, Z n has finite Fisher information; b) For some n, Z n has density of bounded total variation; c) For some n, Z n has a continuously differentiable density p n such that + p n(x) dx < + ; d) For some ε > 0, the characteristic function f (t) = E e itx satisfies f (t) = O(t ε ), as t + ; e) For some ν > 0, + f (t) ν t dt < +. In this and only in this case, I(Z n Z) 0 (n ). 8
8 /n bounds Barron, Johnson (2004) Artstein, Ball, Barthe, Naor (2004) Theorem. Assume that EX = 0, Var(X ) =, and that X satisfies a Poincaré-type inequality λ Var(u(X )) E u (X ) 2 (0 < λ ). Then I(Z n Z) + λ 2 (n ) I(X Z). Thus, I(Z n Z) = O(/n). Extension to Z n = a X +... + a n X n (a 2 +... + a 2 n = ) A-B-B-N (2004): where I(Z n Z) L 4 λ 2 + ( λ 2 ) L 4 I(X Z), L 4 = a 4 +... + a 4 n. 9
9 Rate of convergence under moment conditions (X n ) n independent identically distributed random variables. Let EX = 0, Var(X ) =, and I(Z n0 ) < +, for some n 0. Theorem 2. If E X s < +, for some s > 2, then I(Z n Z) = [(s 2)/2)] j= c j n j + o n (s 2)/2 (log n) (s 3)/2, where each c j is a certain polynomial in cumulants γ 3,..., γ 2j+ of X, or moments EX 3,..., EX 2j+. s = 4: EX 4 < + I(Z n Z) = c n + o n (log n) /2, c = 2! γ2 3 = 2 (EX3 ) 2. s = 6: EX 6 < +, EX 3 = 0 I(Z n Z) = c 2 n 2 + o n 2 (log n) 3/2, c 2 = 3! γ2 4 = 6 (EX4 3) 2. 0
0 Case 2 < s < 4. Lower bounds In case E X s < + with 2 < s < 4, Theorem 2 only yields I(Z n Z) = o This is worse than /n rate. n (s 2)/2 (log n) (s 3)/2. Let η > s 2, 2 < s < 4. Theorem 3. There exists a sequence (X n ) n of independent i.i.d. random variables with symmetric distributions, with EX 2 =, E X s < +, I(X ) < +, and such that with some constant c = c(η, s) I(Z n Z) c n (s 2)/2 (log n) η, n n (X ). Remark. The distribution of X may be a mixture of mean zero normal laws.
When is Fisher information finite? Question: What should one assume about X with density p to ensure that I(X) = + p (x) 2 p(x) dx < +? And if so, how to bound I(X) from above? Stam s inequality: If X and X 2 are independent, then I(X + X 2 ) I(X ) + I(X 2 ). Monotonicity: I(X + X 2 ) I(X ). Example: X j Uniform on intervals of length a j I(X ) = + (uniform distribution) I(X + X 2 ) = + (triangle distribution) I(X + X 2 + X 3 ) < + (like beta with α = β = 2). 2
2 Necessary conditions From the definition I(X) = E p (X) p(x) 2 E p (X) p(x) Hence, p is a function of bounded variation with 2 = [ + ] 2 p (x) dx. p TV I(X). In general, the characteristic function f(t) = E e itx satisfies f(t) t p TV (t R). Conclusion. f(t) t I(X) (t R). 3
3 Convolution of densities of bounded variation Let S = X +X 2 +X 3 be the sum of three independent random variables with densities p, p 2, p 3 having bounded total variation. Proposition. One has 2I(S) p TV p 2 TV + p TV p 3 TV + p 2 TV p 3 TV. In particular, if p = p 2 = p 3 = p, I(X + X 2 + X 3 ) 3 2 p 2 TV. Definition. p TV = sup n k= p(x k ) p(x k ), where the sup is over all x 0 < x <... < x n, and where we may assume that p(x) is in between p(x ) and p(x+), for all x. Particular case: If X j Uniform on intervals of length a j, then 2 I(X + X 2 + X 3 ) + +. a a 2 a a 3 a 2 a 3 However, I(X + X 2 ) = +. 4
4 Proof of Proposition Let P denote the collection of all densities of bounded variation. Let U denote the collection of all uniform densities q(x) =, for a < x < b. b a Note that q TV = 2 b a. Proposition follows from the case of uniform densities and the following: Lemma. Any density p P can be represented as a convex mixture of uniform densities p(x) = U q(x) dπ(q) a.e. and with the property that p TV = U q TV dπ(q). Remark. The mixing probability measure π on U seems to be unique, but no explicit construction is available. When p is piece-wise constant, the lemma can be proved by induction on the number of supporting intervals. 5
5 Proof of Theorem Let S n = X +... + X n with i.i.d. summands and characteristic function f n (t) = E e its n = f (t) n. If I n = I(S n ) < +, then, as noted, f (t) n = f n (t) t In f (t) = O(t /n ). Now, assume that, for some (fixed) n, Then S n has density + f (t) n t dt < +. p n (x) = 2π + e itx f (t) n dt, which has a continuous derivative satisfying ( + x 2 ) p n(x) = i 2π Hence, p n(x) + e itx (tf n(t) + 2f n(t) tf n (t)) dt. C +x 2 and p n TV < +. By Proposition, I 3n < +. 6
6 Towards Theorem 2 Let (X n ) n be i.i.d., EX = 0, Var(X ) =, Z n = X +... + X n n, I(Z n ) < + (n n 0 ), with densities p n so that I(Z n Z) = + (p n(x) + xp n (x)) 2 p n (x) dx = I 0 + I, I 0 = T n T n (p n(x) + xp n (x)) 2 p n (x) dx, I = x T n... Good choice: T n = (s 2) log n + s log log n + ρ n (s > 2), where ρ n + sufficiently slowly to guarantee that sup x T n p n (x) ϕ(x) 0. Case s = 4: T 2 n 2 log n + 4 log log n + ρ n. 7
7 Edgeworth-type expansion for densities Let EX s < + (s 3 integer). For x T n, one may use a suitable approximation of p n. Not enough: ( + x s ) (p n (x) ϕ(x)) = O. n Edgeworth approximation of p n : with q k (x) = ϕ(x) ϕ s (x) = ϕ(x) + s 2 H k+2j(x) r!... r k! k= q k (x) n k/2 (γ 3 ) r... ( γ k+2 ) r k. 3! (k + 2)! Here r + 2r 2 +... + kr k = k, j = r +... + r k r dr γ r = i dt log E r eitx t=0 and (3 r s). Lemma. Let I(Z n0 ) < +, for some n 0. Fix l = 0,,... Then, for all sufficiently large n, for all x, p (l) n (x) ϕ (l) s (x) where ε n 0, as n, and sup x ψ l,n (x), ψ l,n(x) + x s ε n n (s 2)/2, + ψ l,n(x) 2 dx. 8
8 Moderate deviations Second step: I = x T n (p n(x) + xp n (x)) 2 p n (x) dx = o n (s 2)/2 (log n) (s 3)/2. We have where I 2I, + 2I,2, I, = x T n p n(x) 2 p n (x) dx, I,2 = x T n x 2 p n (x) dx easy. Integration by parts: I +, = + T n p n(x) 2 p n (x) dx = p n(t n ) log p n (T n ) + T n p n(x) log p n (x) dx. Lemma 2. Assume p is representable as convolution of three densities with Fisher information I. Then, for all x, p (x) I 3/4 p(x), p (x) I 5/4 p(x). 9