Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012

Similar documents
RÉNYI DIVERGENCE AND THE CENTRAL LIMIT THEOREM. 1. Introduction

The Central Limit Theorem: More of the Story

Stein s method, logarithmic Sobolev and transport inequalities

Self-normalized Cramér-Type Large Deviations for Independent Random Variables

Monotonicity of entropy and Fisher information: a quick proof via maximal correlation

Entropy and Limit theorems in Probability Theory

Stability results for Logarithmic Sobolev inequality

Formulas for probability theory and linear models SF2941

Asymptotic Statistics-III. Changliang Zou

ENTROPY VERSUS VARIANCE FOR SYMMETRIC LOG-CONCAVE RANDOM VARIABLES AND RELATED PROBLEMS

On Concentration Functions of Random Variables

Continuous Random Variables

NEW FUNCTIONAL INEQUALITIES

Entropy power inequality for a family of discrete random variables

Spring 2012 Math 541B Exam 1

On large deviations of sums of independent random variables

ST5215: Advanced Statistical Theory

Chapter 4: Asymptotic Properties of the MLE

Chapter 3: Unbiased Estimation Lecture 22: UMVUE and the method of using a sufficient and complete statistic

Concentration, self-bounding functions

Logarithmic Sobolev Inequalities

Concentration inequalities and the entropy method

ON CONCENTRATION FUNCTIONS OF RANDOM VARIABLES. Sergey G. Bobkov and Gennadiy P. Chistyakov. June 2, 2013

Variance reduction. Michel Bierlaire. Transport and Mobility Laboratory. Variance reduction p. 1/18

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011

Chapter 9: Basic of Hypercontractivity

P (A G) dp G P (A G)

x log x, which is strictly convex, and use Jensen s Inequality:

(Multivariate) Gaussian (Normal) Probability Densities

Fisher Information, Compound Poisson Approximation, and the Poisson Channel

Score functions, generalized relative Fisher information and applications

Exercises in Extreme value theory

18.175: Lecture 15 Characteristic functions and central limit theorem

Probability and Measure

Tail bound inequalities and empirical likelihood for the mean

Laplace s Equation. Chapter Mean Value Formulas

A concentration theorem for the equilibrium measure of Markov chains with nonnegative coarse Ricci curvature

Stat410 Probability and Statistics II (F16)

A note on the convex infimum convolution inequality

Characteristic Functions and the Central Limit Theorem

Information geometry for bivariate distribution control

Solutions to Tutorial 11 (Week 12)

BOUNDS ON THE DEFICIT IN THE LOGARITHMIC SOBOLEV INEQUALITY

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Spectral Gap and Concentration for Some Spherically Symmetric Probability Measures

Concentration inequalities: basics and some new challenges

n! (k 1)!(n k)! = F (X) U(0, 1). (x, y) = n(n 1) ( F (y) F (x) ) n 2

Introduction to Self-normalized Limit Theory

Information Theoretic Asymptotic Approximations for Distributions of Statistics

CVaR and Examples of Deviation Risk Measures

Mod-φ convergence I: examples and probabilistic estimates

1 Fourier Integrals of finite measures.

On the Entropy of Sums of Bernoulli Random Variables via the Chen-Stein Method

Math Camp II. Calculus. Yiqing Xu. August 27, 2014 MIT

Series 7, May 22, 2018 (EM Convergence)

From the Newton equation to the wave equation in some simple cases

A Criterion for the Compound Poisson Distribution to be Maximum Entropy

Entropic structure of the Landau equation. Coulomb interaction

Concentration Properties of Restricted Measures with Applications to Non-Lipschitz Functions

ECE534, Spring 2018: Solutions for Problem Set #3

Asymptotics for posterior hazards

STA205 Probability: Week 8 R. Wolpert

Supplement: Universal Self-Concordant Barrier Functions

Measure-theoretic probability

Optimization and Simulation

Heat Flow Derivatives and Minimum Mean-Square Error in Gaussian Noise

CALCULUS JIA-MING (FRANK) LIOU

On asymmetric quantum hypothesis testing

The largest eigenvalues of the sample covariance matrix. in the heavy-tail case

Analysis Qualifying Exam

C.7. Numerical series. Pag. 147 Proof of the converging criteria for series. Theorem 5.29 (Comparison test) Let a k and b k be positive-term series

Pointwise convergence rates and central limit theorems for kernel density estimators in linear processes

Consistency of the maximum likelihood estimator for general hidden Markov models

CENTRAL LIMIT THEOREM AND DIOPHANTINE APPROXIMATIONS. Sergey G. Bobkov. December 24, 2016

ECE 4400:693 - Information Theory

The Lindeberg central limit theorem

Weak and strong moments of l r -norms of log-concave vectors

Correlation Detection and an Operational Interpretation of the Rényi Mutual Information

REGULARIZED DISTRIBUTIONS AND ENTROPIC STABILITY OF CRAMER S CHARACTERIZATION OF THE NORMAL LAW. 1. Introduction

ELEMENTS OF PROBABILITY THEORY

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang.

Tail inequalities for additive functionals and empirical processes of. Markov chains

Distance-Divergence Inequalities

topics about f-divergence

Entropy and the Additive Combinatorics of Probability Densities on LCA groups

Strong approximation for additive functionals of geometrically ergodic Markov chains

Geometry of log-concave Ensembles of random matrices

Reducing subspaces. Rowan Killip 1 and Christian Remling 2 January 16, (to appear in J. Funct. Anal.)

Lecture 35: December The fundamental statistical distances

BIHARMONIC WAVE MAPS INTO SPHERES

A Hierarchy of Information Quantities for Finite Block Length Analysis of Quantum Tasks

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

Convergence rates in weighted L 1 spaces of kernel density estimators for linear processes

OXPORD UNIVERSITY PRESS

Machine learning - HT Maximum Likelihood

Section 8.2. Asymptotic normality

Transcription:

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 202 BOUNDS AND ASYMPTOTICS FOR FISHER INFORMATION IN THE CENTRAL LIMIT THEOREM Sergey G. Bobkov University of Minnesota, Minneapolis, USA joint work with Gennadiy P. Chistyakov and Friedrich Götze Bielefeld University, Bielefeld, Germany

Fisher s quantity of information X a random variable with values in R Definition. If X has an absolutely continuous density p, its Fisher information is defined by I(X) = I(p) = + p (x) 2 p(x) dx, where p is a Radon-Nikodym derivative of p. In all other cases, I(X) = +. Equivalently, I(X) = E p (X) p(x) 2. Remarks. ) P{p(X) > 0} =, so the definition makes sense. Integration is over {x : p(x) > 0}. 2) Assume I(X) < +. Then p(x) = 0 p (x) = 0. 3) Translation invariance and homogeneity: I(a+bX) = b 2 I(X) (a R, b 0). 2

2 When the Fisher information appears naturally Statistics: Estimation of the shift parameter in p(x θ). Probability: Shifts of product measures, distinguishing a sequence of iid random variables from a translate of itself. µ a probability measure on R, µ θ (A) = µ (A + θ), θ R, A R Theorem (Feldman 96, Shepp 965) ( θ l 2 µ θ << µ ) I(µ) < + and dµ(x) dx > 0 a.e. Information Theory: de Bruijn s identity Differential entropy h(x) = + p(x) log p(x) dx. Theorem. If a random variable X has finite variance, then for all τ > 0, d dτ h(x + τz) = 2 I(X + τz), where Z N(0, ) is independent of X. 3

3 Distances to normality X a r.v., with density p(x), and a = EX, σ 2 = Var(X) < + Z N(a, σ 2 ) with density q(x) = 2πσ 2 e (x a)2 /2σ 2. Relative entropy of X with respect to Z (informational divergence, Kullback-Leibler distance): D(X) = D(X Z) = h(z) h(x) = p log p q dx. Relative Fisher information of X with respect to Z Properties I(X Z) = I(X) I(Z) = 0 D(X) + D(a + bx) = D(X) p p q q Same for the standardized Fisher information σ 2 I(X Z) = σ 2 I(X) D(X) = 0 I(X Z) = 0 X is normal 2 p dx. 4

4 Relations between distances Csiszár-Kullback-Pinsker inequality for total variation (967): For all random variables X and Z, 2 P X P Z 2 TV D(X Z). Stam s inequality (959) Logarithmic Sobolev inequality: If Z N(0, ), D(X Z) 2 I(X Z). Sharpening (still equivalent): If Z N(a, σ 2 ), EX = EZ = a, Var(X) = Var(Z) = σ 2, then D(X) 2 log [ + σ 2 I(X Z) ] = 2 log [σ2 I(X)]. Let EX = 0, Var(X) =, X p, Z N(0, ): P X P Z TV = + p(x) ϕ(x) dx 2 I(X Z). Shimizu (975): sup x p(x) ϕ(x) C I(X Z). Sharpening: One can show that p ϕ TV = + p (x) ϕ (x) dx C I(X Z). 5

5 Central limit theorem (X n ) n independent identically distributed random variables, EX = 0, Var(X ) = CLT: Weakly in distribution Z n = X +... + X n n Z N(0, ) (n ) Theorem (Barron-Johnson 2004) I(Z n Z) 0, as n I(Z n0 Z) < + for some n 0. Equivalently: I(Z n0 ) < + for some n 0. Sufficient: I(X ) < +. Necessary: n n, Z n have bounded densities p n and sup x p n (x) ϕ(x) 0 (n ). Problems. How to determine in terms of X? (range of applicability) 2. What is rate for I(Z n Z), and under what conditions? 6

6 Uniform local limit theorem Theorem (Gnedenko 950 s) The following properties are equivalent: a) For all sufficently large n, Z n have (continuous) bounded densities p n satisfying sup x p n (x) ϕ(x) 0 (n ); b) For some n, Z n has a (continuous) bounded density p n ; c) The characteristic function f (t) = E e itx of X satisfies a smoothness condition + f (t) ν dt < +, for some ν > 0. 7

7 CLT for Fisher information distance (X n ) n independent identically distributed random variables, EX = 0, Var(X ) =. Theorem. The following assertions are equivalent: a) For some n, Z n has finite Fisher information; b) For some n, Z n has density of bounded total variation; c) For some n, Z n has a continuously differentiable density p n such that + p n(x) dx < + ; d) For some ε > 0, the characteristic function f (t) = E e itx satisfies f (t) = O(t ε ), as t + ; e) For some ν > 0, + f (t) ν t dt < +. In this and only in this case, I(Z n Z) 0 (n ). 8

8 /n bounds Barron, Johnson (2004) Artstein, Ball, Barthe, Naor (2004) Theorem. Assume that EX = 0, Var(X ) =, and that X satisfies a Poincaré-type inequality λ Var(u(X )) E u (X ) 2 (0 < λ ). Then I(Z n Z) + λ 2 (n ) I(X Z). Thus, I(Z n Z) = O(/n). Extension to Z n = a X +... + a n X n (a 2 +... + a 2 n = ) A-B-B-N (2004): where I(Z n Z) L 4 λ 2 + ( λ 2 ) L 4 I(X Z), L 4 = a 4 +... + a 4 n. 9

9 Rate of convergence under moment conditions (X n ) n independent identically distributed random variables. Let EX = 0, Var(X ) =, and I(Z n0 ) < +, for some n 0. Theorem 2. If E X s < +, for some s > 2, then I(Z n Z) = [(s 2)/2)] j= c j n j + o n (s 2)/2 (log n) (s 3)/2, where each c j is a certain polynomial in cumulants γ 3,..., γ 2j+ of X, or moments EX 3,..., EX 2j+. s = 4: EX 4 < + I(Z n Z) = c n + o n (log n) /2, c = 2! γ2 3 = 2 (EX3 ) 2. s = 6: EX 6 < +, EX 3 = 0 I(Z n Z) = c 2 n 2 + o n 2 (log n) 3/2, c 2 = 3! γ2 4 = 6 (EX4 3) 2. 0

0 Case 2 < s < 4. Lower bounds In case E X s < + with 2 < s < 4, Theorem 2 only yields I(Z n Z) = o This is worse than /n rate. n (s 2)/2 (log n) (s 3)/2. Let η > s 2, 2 < s < 4. Theorem 3. There exists a sequence (X n ) n of independent i.i.d. random variables with symmetric distributions, with EX 2 =, E X s < +, I(X ) < +, and such that with some constant c = c(η, s) I(Z n Z) c n (s 2)/2 (log n) η, n n (X ). Remark. The distribution of X may be a mixture of mean zero normal laws.

When is Fisher information finite? Question: What should one assume about X with density p to ensure that I(X) = + p (x) 2 p(x) dx < +? And if so, how to bound I(X) from above? Stam s inequality: If X and X 2 are independent, then I(X + X 2 ) I(X ) + I(X 2 ). Monotonicity: I(X + X 2 ) I(X ). Example: X j Uniform on intervals of length a j I(X ) = + (uniform distribution) I(X + X 2 ) = + (triangle distribution) I(X + X 2 + X 3 ) < + (like beta with α = β = 2). 2

2 Necessary conditions From the definition I(X) = E p (X) p(x) 2 E p (X) p(x) Hence, p is a function of bounded variation with 2 = [ + ] 2 p (x) dx. p TV I(X). In general, the characteristic function f(t) = E e itx satisfies f(t) t p TV (t R). Conclusion. f(t) t I(X) (t R). 3

3 Convolution of densities of bounded variation Let S = X +X 2 +X 3 be the sum of three independent random variables with densities p, p 2, p 3 having bounded total variation. Proposition. One has 2I(S) p TV p 2 TV + p TV p 3 TV + p 2 TV p 3 TV. In particular, if p = p 2 = p 3 = p, I(X + X 2 + X 3 ) 3 2 p 2 TV. Definition. p TV = sup n k= p(x k ) p(x k ), where the sup is over all x 0 < x <... < x n, and where we may assume that p(x) is in between p(x ) and p(x+), for all x. Particular case: If X j Uniform on intervals of length a j, then 2 I(X + X 2 + X 3 ) + +. a a 2 a a 3 a 2 a 3 However, I(X + X 2 ) = +. 4

4 Proof of Proposition Let P denote the collection of all densities of bounded variation. Let U denote the collection of all uniform densities q(x) =, for a < x < b. b a Note that q TV = 2 b a. Proposition follows from the case of uniform densities and the following: Lemma. Any density p P can be represented as a convex mixture of uniform densities p(x) = U q(x) dπ(q) a.e. and with the property that p TV = U q TV dπ(q). Remark. The mixing probability measure π on U seems to be unique, but no explicit construction is available. When p is piece-wise constant, the lemma can be proved by induction on the number of supporting intervals. 5

5 Proof of Theorem Let S n = X +... + X n with i.i.d. summands and characteristic function f n (t) = E e its n = f (t) n. If I n = I(S n ) < +, then, as noted, f (t) n = f n (t) t In f (t) = O(t /n ). Now, assume that, for some (fixed) n, Then S n has density + f (t) n t dt < +. p n (x) = 2π + e itx f (t) n dt, which has a continuous derivative satisfying ( + x 2 ) p n(x) = i 2π Hence, p n(x) + e itx (tf n(t) + 2f n(t) tf n (t)) dt. C +x 2 and p n TV < +. By Proposition, I 3n < +. 6

6 Towards Theorem 2 Let (X n ) n be i.i.d., EX = 0, Var(X ) =, Z n = X +... + X n n, I(Z n ) < + (n n 0 ), with densities p n so that I(Z n Z) = + (p n(x) + xp n (x)) 2 p n (x) dx = I 0 + I, I 0 = T n T n (p n(x) + xp n (x)) 2 p n (x) dx, I = x T n... Good choice: T n = (s 2) log n + s log log n + ρ n (s > 2), where ρ n + sufficiently slowly to guarantee that sup x T n p n (x) ϕ(x) 0. Case s = 4: T 2 n 2 log n + 4 log log n + ρ n. 7

7 Edgeworth-type expansion for densities Let EX s < + (s 3 integer). For x T n, one may use a suitable approximation of p n. Not enough: ( + x s ) (p n (x) ϕ(x)) = O. n Edgeworth approximation of p n : with q k (x) = ϕ(x) ϕ s (x) = ϕ(x) + s 2 H k+2j(x) r!... r k! k= q k (x) n k/2 (γ 3 ) r... ( γ k+2 ) r k. 3! (k + 2)! Here r + 2r 2 +... + kr k = k, j = r +... + r k r dr γ r = i dt log E r eitx t=0 and (3 r s). Lemma. Let I(Z n0 ) < +, for some n 0. Fix l = 0,,... Then, for all sufficiently large n, for all x, p (l) n (x) ϕ (l) s (x) where ε n 0, as n, and sup x ψ l,n (x), ψ l,n(x) + x s ε n n (s 2)/2, + ψ l,n(x) 2 dx. 8

8 Moderate deviations Second step: I = x T n (p n(x) + xp n (x)) 2 p n (x) dx = o n (s 2)/2 (log n) (s 3)/2. We have where I 2I, + 2I,2, I, = x T n p n(x) 2 p n (x) dx, I,2 = x T n x 2 p n (x) dx easy. Integration by parts: I +, = + T n p n(x) 2 p n (x) dx = p n(t n ) log p n (T n ) + T n p n(x) log p n (x) dx. Lemma 2. Assume p is representable as convolution of three densities with Fisher information I. Then, for all x, p (x) I 3/4 p(x), p (x) I 5/4 p(x). 9