Concentration Inequalities

Size: px

Start display at page:

Download "Concentration Inequalities"

Marilynn Robbins
5 years ago
Views:

1 Chapter Concentration Inequalities I. Moment generating functions, the Chernoff method, and sub-gaussian and sub-exponential random variables a. Goal for this section: given a random variable X, how does X concentrate around its mean? That is, assuming w.l.o.g. that E[X] = 0, how well can we bound b. Chernoff bounds: P(X t?. First, recall Markov s inequality, that if X 0, then P(X t E[X]/t.. Extension: use moment generating funtions to get exponential tails (Chernoff bound. For any λ > 0, P(X t = P(e λx e λt E[eλX ] e λt = ϕ X (λe λt, where ϕ X := E[e λx ] is the moment generating function of X. 3. In particular, we have P(X t inf λ 0 exp(logϕ X(λ λt. c. Sub-Gaussian and sub-exponential random variables. A mean-zero random variable X is sub-gaussian with parameter σ if for all λ R, E[e λx σ ] exp. If X N(0,σ, then this holds with equality.. A mean-zero random variable X is sub-exponential with parameters (τ,b if for all λ such that λ /b, E[e λx τ ] exp. Any sub-gaussian random variable is sub-exponential d. Examples and consequences 0

2 Stanford Statistics 3/Electrical Engineering 377. Example. (Bounded random variables: Suppose that X [a, b], where < a b < + and E[X] = 0. Then Hoeffding s lemma states that E[e λx (b a ] exp, 8 so that X is (b a /4-sub-Gaussian.. Chernoff bounds extend naturally to sums of independent random variables. For example, Hoeffding s inequality is the following. Proposition.. Let X i be independent, mean-zero, σi -sub-gaussian random variables. Then ( ( P X i t exp t n σ i Proof We simply apply the Chernoff technique repeatedly, then optimize over λ 0. Indeed, we have ( [ ( P X i t E exp λ [ ( = E exp n X i ] e λt. λ X i ]e λt E[exp(λX n ] σn exp λt n exp σ i [ ( E exp λ λt, n ] X i where we have used induction. Taking derivatives of the terms inside the exponent with respect to λ, we see that λ = t/ n σ i 0 minimizes the expression, whence we obtain the desired result. As an immediate corollary of this proposition and Example., we obtain the usual Hoeffding bound: if X i [a i,b i ] and E[X i ] = 0, then II. Entropy and concentration ( ( t P X i t exp n (b i a i. a. We would like to develop techniques to give control over more complicated functions than simply sums of the X i ; suppose we have Z = f(x,...,x n and we would like to know if Z is concentrated around its mean. b. Let φ : R R be a convex function. The φ-entropy of a random variable X is assuming the relevant expectations exist. H φ (X := E[φ(X] φ(e[x], (.0.

3 Stanford Statistics 3/Electrical Engineering 377. Example: φ(t = t, then H φ (X = E[X ] E[X] = Var(X. Note that H φ (X 0 always by Jensen s inequality, and strictly so for non-constant X with strictly convex φ. c. Idea: if X is concentrated around its mean, then H φ (X should be small as well, at least for nice φ.. Entropy we focus on: use φ(t = tlogt, which gives us the entropy H(X = E[XlogX] E[X]logE[X] as long as X 0.. In particular, consider the transformation e λx. Then assuming E[e λx ] <, we study H(e λx. d. The Herbst arguments (making rigorous the idea that H(X being small should imply concentration of X. Proposition.3. Let X be a random variable and assume that there exists a constant σ < such that H(e λx λ σ ϕ X(λ. (.0. for all λ R (or λ R + where ϕ X (λ = E[e λx ] denotes the moment generating function of X. Then X E[X] is σ -sub-gaussian. Proof Let ϕ = ϕ X for shorthand. The proof procedes by an integration argument, where we show that logϕ(λ λ σ. First, note that so that inequality (.0. is equivalent to ϕ (λ = E[Xe λx ], λϕ (λ ϕ(λlogϕ(λ = H(e λx λ σ ϕ(λ, and dividing both sides by λ ϕ(λ yields the equivalent statement But by inspection, we have Moreover, we have that ϕ (λ λϕ(λ σ logϕ(λ λ. λλ logϕ(λ = ϕ (λ λϕ(λ λ logϕ(λ. log ϕ(λ logϕ(λ logϕ(0 lim = lim = ϕ (0 λ 0 λ λ 0 λ ϕ(0 = E[X]. Integrating from 0 to any λ 0, we thus obtain λ0 [ logϕ(λ 0 E[X] = λ 0 0 λ ] λ logϕ(λ dλ λ0 0 σ dλ = σ λ 0.

4 Stanford Statistics 3/Electrical Engineering 377 Multiplying each side by λ 0 gives as desired. loge[e λ 0(X E[X] ] = loge[e λ 0X ] λ 0 E[X] σ λ 0,. Note: can be extended to sub-exponential random variables III. Information theoretic inequalities a. Idea: let us relate divergences to entropy quantities. For this part, let be the collection of all variables except X i. b. Intermediate step: Han s inequality X \i = (X,...,X i,x i+,...,x n Proposition.4. Let X,...,X n be discrete random variables. Then H(X n n H(X \i. Proof The proof is a consequence of the chain rule for entropy and that conditioning reduces entropy. We have H(X n = H(X i X \i +H(X \i H(X i X i +H(X \i. Writing this inequality for each i =,...,n, we obtain nh(x n H(X \i + H(X i X i = H(X \i +H(X, n and subtractin H(X n from both sides gives the result. c. Intermediate step: a divergence version of Han s inequality. Let Q be an arbitrary distribution over X n and P = P P n be a product distribution. For A X n, definining the marginal densities Q (i (A := Q(X \i A and P (i (A = P(X \i A. Proposition.5. With the above definitions, D kl (Q P [D kl (Q P D kl ( Q (i P (i]. 3

5 Stanford Statistics 3/Electrical Engineering 377 Proof We have seen earlier in the notes (recall the definition (.. of the KL divergence as a supremum over all quantizers and the surrounding discussion that it is no loss of generality to assume that X is discrete. Thus, noting that the probability mass functions q (i (x \i = x q(x i,x,x n i+ and p (i (x \i = j ip j (x j, we have that Han s inequality (Proposition.4 is equivalent to (n x n q(x n logq(x n x \i q (i (x \i logq (i (x \i. Now, by subtracting q(x n logp(xn from both sides of the preceding display, we obtain (n D kl (Q P = (n x n q(x n logq(x n (n x n q(x n logp(x n q (i (x \i logq (i (x \i (n q(x n logp(x n. x \i We expand the final term. Indeed, by the product nature of the distributions p, we have x n (n x n q(x n logp(x n = (n x n = x n q(x n q(x n j i logp i (x i logp i (x i = } {{ } =logp (i (x \i x \i q (i (x \i logp (i (x \i. Noting that q (i (x \i logq (i (x \i q (i (x \i logp (i (x \i = D kl (Q (i P (i x \i x \i and rearranging gives the desired result. d. Tilting a distribution. Frequent idea (large deviations, statistics, reliability, heavy-tailed data.. Intuition: Let Y = f(x,...,x n 0. If Y is concentrated around its mean (for distribution P, we would expect that f constant under the distribution P, that is, f(x n p(xn cp(xn, and thus the distribution should have D kl (Q P small q(x n := f(xn p(xn E P [f(x n ] These insights allow us to tensorize the entropy: 4

6 Stanford Statistics 3/Electrical Engineering 377 Theorem.6. Let X,...,X n be independent random variables and Y = f(x n, where f is a non-negative function. Define H(Y X \i = E[Y logy X \i ]. Then [ ] H(Y E H(Y X \i. (.0.3 Proof It is clear that if inequality (.0.3 holds for Y, it also holds identically for cy, so we assume without loss of generality that E P [Y] =. Thus, by defining the tilted distribution q(x n = f(xn p(xn, we have Q(Xn =, and moreover, we have D kl (Q P = q(x n log q(xn p(x n dxn = f(x n p(x n logf(x n dx n = H(Y, and similarly, if φ(t = tlogt, then D kl (Q P D kl ( Q (i P (i = E[φ(Y] X n ( f(x i,x,x n i+p i (xdx log p(i (x \i f(x i,x,x n i+ p i(xdx p (i p (i (x \i dx \i (x \i = E[φ(Y] E[Y x \i ]loge[y x \i ]p (i (x \i dx \i X n = E[φ(Y] E[φ(E[Y X \i ]]. Noting by the tower property of expectations that E[φ(Y] E[φ(E[Y X \i ]] = E[E[φ(Y X \i ] E[φ(E[Y X \i ]]] = E[H(Y X \i ] and using Han s inequality for relative entropies (Proposition.4 gives H(Y = D kl (Q P [D kl (Q P D kl ( Q (i P (i] = E[H(Y X \i ], which is our desired result. 3. Some intuition: if we can show that individually H(Y X \i is not too big, then the Herbst argument (Proposition.3 coupled with the Hoeffding-type bound will give strong sub-gaussian tails. IV. Convex functions and concentration 5

EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 2018 Please submit on Gradescope. Start every question on a new page.

EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 28 Please submit on Gradescope. Start every question on a new page.. Maximum Differential Entropy (a) Show that among all distributions supported