Hoeffding, Chernoff, Bennet, and Bernstein Bounds

Size: px

Start display at page:

Download "Hoeffding, Chernoff, Bennet, and Bernstein Bounds"

Charlene Nicholson
6 years ago
Views:

1 Stat 928: Statistical Learning Theory Lecture: 6 Hoeffding, Chernoff, Bennet, Bernstein Bounds Instructor: Sham Kakade 1 Hoeffding s Bound We say X is a sub-gaussian rom variable if it has quadratically bounded logarithmic moment generating function,e.g. For a sub-gaussian rom variable, we have ln Ee λ(x µ) λ2 2 b. P ( X n µ + ɛ) e nɛ2 /2b. Similarly, P ( X n µ ɛ) e nɛ2 /2b. 2 Chernoff Bound For a binary rom variable, recall the KullbackLeibler divergence is KL(p q) = p ln(p/q) + (1 p) ln((1 p)/(1 q)). Theorem 2.1. (Relative Entropy Chernoff Bound) Assume that X [0, 1] EX = µ. We have the following inequality P ( X n µ + ɛ) e nkl(µ+ɛ µ) First, let us underst the worst case MGF for X. P ( X n µ ɛ) e nkl(µ ɛ µ), Lemma 2.2. Assume that X [0, 1] EX = µ. We have the following inequality Ee λx (1 µ)e 0 + µe λ This shows that the maximum logarithmic moment generating function is achieved with a {0, 1} valued rom variable, i.e. Ee λx E X µ[e λx ] where X is a {0, 1} valued rom variable which takes the value 1 with probability µ. Proof. Let M X (λ) = Ee λx M X (λ) = (1 µ)e 0 + µe λ. Then M X (0) = M X (0). Moreover, which completes the proof. M X(λ) = EXe λx EXe λ 1 = µe λ = M X (λ) 1

2 Now we are ready to provide the proof. Proof. By the previous lemma, we only need to prove the result for binary X {0, 1}, with mean 1. Recall from Lemma 1.4 in the previous lecture that, I(µ + ɛ) = KL(P µ+ɛ P ) where P µ+ɛ was the variational distribution P λ where λ was is set such that E X Pλ [X] = µ + ɛ. Since X is binary, it must be that P µ+ɛ is just distribution which is 1 with probability µ + ɛ. Hence KL(P µ+ɛ P ) is just the KL between two binary distributions with means µ + ɛ µ, which completes the proof. 2.1 Useful Forms of the Chernoff Bound Note that by Hoeffding s lemma (as X is sub-gaussian), we have (from Lecture 5) that KL(µ + ɛ µ) = inf λ>0 [ λ(µ + ɛ) + ln((1 µ)e0 + µe λ )] 2ɛ 2 Define V ar p be the variance of a X which is 1 with probability p 0 with probability 1 p. It is straightforward to show that the second derivative with respect to δ is: Define KL (µ + δ µ) = 1/V ar δ MaxVar[µ, µ + ɛ] = max p [µ,µ+ɛ] V ar p which provides a lower bound on the second derivative for δ between 0 ɛ. Hence, we have that: which leads to a nicer version of the Chernoff bound. KL(µ + ɛ µ) 1 2 ɛ2 /MaxVar[µ, µ + ɛ] Theorem 2.3. (Nicer Form of the Chernoff Bound) Assume that X [0, 1] EX = µ. Fix ɛ. Define: MaxVar[µ, µ + ɛ] = max p [µ,µ+ɛ] V ar p as before (i.e. it is the maximal variance (of {0, 1} variable) between µ µ + ɛ). We have the following inequality P ( X n µ + ɛ) e n ɛ 2 2 MaxVar[µ,µ+ɛ] P ( X n µ ɛ) e n ɛ 2 2 MaxVar[µ ɛ,µ] The following corollary (while always true) is much sharper bound than Hoeffding s bound when µ 0. Corollary 2.4. We have the following bound: P ( X n µ + ɛ) exp[ nɛ 2 /2(µ + ɛ)] thus P ( X n µ ɛ) exp[ nɛ 2 /2µ]. 2

3 This implies a multiplicative form of the Chernoff bound since: δ 2 P ( X n (1 + δ)µ) exp[ nµ 2(1 + δ) ] P ( X n (1 δ)µ) exp[ nµδ 2 /2] Similar results for Bernstein Bennet inequalities are available. 3 Bennet Inequality In Bennet inequality, we assume that the variable is upper bounded, want to estimate its moment generating function using variance information. Lemma 3.1. If X EX 1, then λ 0: where µ = EX ln Ee λ(x µ) (e λ λ 1)V ar(x). Proof. It suffices to prove the lemma when µ = 0. Using ln z z 1, we have ln Ee λx = ln Ee λx Ee λx 1 = λ 2 E eλx λx 1 (λx) 2 (X) 2 λ 2 E eλ λ 1 λ 2 (X) 2, where the second inequality follows from the fact that the function (e z z 1)/z 2 is non-decreasing λx λ. Lemma 3.2. We have inf [ λɛ + λ>0 (eλ λ 1)V ar(x)] = V ar(x)φ(ɛ/v ar(x)) 2(V ar(x) + ɛ/3). where φ(z) = (1 + z) ln(1 + z) z. Proof. Take derivative with respect to λ, we obtain ɛ + (e λ 1)V ar(x) = 0. Therefore λ = ln(1 + ɛ/v ar(x)). Plug in, we obtain the equality. It is easy to verify using Taylor expansion of the exponential function that for λ (0, 3): e λ λ 1 λ2 2 Now by picking λ = ɛ/(v ar(x) + ɛ/3), we have λɛ + (λ/3) m λ 2 = 2(1 λ/3). m=0 λ 2 2(1 λ/3) = ɛ2 /[2V ar(x) + 2ɛ/3]. 3 ɛ 2

4 This proves the desired bound. The above bound implies the following bound: If X EX b, for some b > 0, then P [X EX + ɛ] exp[ nɛ 2 /(2V ar(x) + 2ɛb/3)]. This is similar to the Gaussian result, except for the term 2ɛb/3. Behaves similar to Gaussian tail bound when ɛb V ar(x). 4 Bernstein Inequality In Bernstein inequality, we obtain a result similar to the simplified Bennet bound but with a moment condition. There are different forms. We consider one form. Lemma 4.1. If X satisfies the moment condition with b > 0 for integers m 2: EX m m!b m 2 V/2, then when λ (0, 1/b): thus ln Ee λx λex + 0.5λ 2 V (1 λb) 1, P [ X n EX + ɛ] exp[ nɛ 2 /(2V + 2ɛb)]. Proof. We have the following estimation of logarithmic moment generating function: ln Ee λx Ee λx 1 λex + 0.5V λ 2 m=2 b m 2 λ m 2 = λex + 0.5λ 2 V (1 λb) 1. The last inequality is similar to the proof of Bennet inequality. Exercise: finish the proof. 5 Independent but non-iid rom variables If X 1,..., X n are independent but not iid. Let X n = n 1 n i=1 X i, µ = E X n, then we have In particular, we have the following results: P ( X n µ + ɛ) inf [ λn(µ + ɛ) + n ln Ee λxi ]. λ>0 Lemma 5.1. If X i are sub-gaussians with Ee λxi λex i + 0.5λ 2 V i, then P ( X n µ + ɛ) exp [ n2 ɛ 2 ] 2 n i=1 V. i An example is Radamecher average: let σ i = {±1} be independent rom Bernoulli variables, a i be fixed numbers, then n [ P (n 1 nɛ 2 ] σ i a i ɛ) exp 2n 1 n. i=1 a2 i Similarly one can derive bounds for Bennet Bernstein inequalities. i=1 i=1 4

5 Lemma 5.2. If X i EX i b for all i, then [ P ( X n µ + ɛ) exp n 2 ɛ 2 2 n i=1 V ar(x i) + 2nbɛ/3 ]. 6 Alternative Expression Tail inequality: P (deviation ɛ) δ(ɛ). Equivalent expression: with probability 1 δ: deviation ɛ(δ), where ɛ(δ) is the inverse function of δ(ɛ). For example the Chernoff bound P ( X n µ ɛ) exp( 2nɛ 2 ) = δ, means with probability 1 δ: Xn EX ln(1/δ)/(2n). For Bennet inequality, we set thus using x + y x + y: P [ X n EX + ɛ] exp[ nɛ 2 /(2V ar(x) + 2ɛb/3)], δ = exp[ nɛ 2 /(2V ar(x) + 2ɛb/3)], ɛ = 2V ar(x) ln(1/δ)/n + b 2 ln(1/δ) 2 /(9n 2 ) + b ln(1/δ) That is, with probability at least 1 δ, we have X n EX 2V ar(x) ln(1/δ)/n + 2V ar(x) ln(1/δ)/n + 2b ln(1/δ). 2b ln(1/δ) 5

The Moment Method; Convex Duality; and Large/Medium/Small Deviations

Stat 928: Statistical Learning Theory Lecture: 5 The Moment Method; Convex Duality; and Large/Medium/Small Deviations Instructor: Sham Kakade The Exponential Inequality and Convex Duality The exponential