The Moment Method; Convex Duality; and Large/Medium/Small Deviations

Size: px

Start display at page:

Download "The Moment Method; Convex Duality; and Large/Medium/Small Deviations"

Grace Ray
5 years ago
Views:

1 Stat 928: Statistical Learning Theory Lecture: 5 The Moment Method; Convex Duality; and Large/Medium/Small Deviations Instructor: Sham Kakade The Exponential Inequality and Convex Duality The exponential inequality for sum of independent random variables is very easy to apply because independence allows us to change the problem of estimating the exponential moment of the sum of independent random variables into the estimating of the exponential moment of a single random variable. Theorem.. For any n and ɛ > 0: n ln P ( X n µ + ɛ) inf λ>0 [ λɛ + ln Eeλ(X µ) ]. Similarly n ln P ( X n µ ɛ) inf λ<0 [λɛ + ln Eeλ(X µ) ]. Another way to write tail bound is Corollary.2. We have that where I(z) defined as is the rate function. P ( X n µ + ɛ) exp[ ni(µ + ɛ)], I(z) = inf λ>0 [ λz + ln EeλX ] Example: Gaussian random variable X i N(µ, σ 2 ), then Ee λ(x µ) = Therefore (with optimal λ = ɛ/σ 2 below) 2πσ e λx e x2 /2σ 2 dx = 2π e λ2 σ 2 /2 e (x/σ λσ)2 /2 dx/σ = e λ2 σ 2 /2. inf [ λɛ + ln λ>0 Ee λ(x µ) ] = inf [ λɛ + λ>0 λ2 σ 2 /2] = ɛ 2 /2σ 2. Exactly the same (and tight) estimate of Gaussian tail inequality derived by integration.. The Fenchel Conjugate Assume we have a vector space X, equipped with an inner product <, > (and a dual space X for all practical purposes, think of X = X ). Let f be a function: f : X R {+ }

2 i.e. f takes values on the extended real number line. The convex conjugate is f, where f : X R {+ }, is defined as: Equivalently, f (λ) = sup ( < λ, x > f(x) ) x X f (λ) = inf ( < λ, x > +f(x) ) x X.2 A Variational interpretation of the Rate Function The function Γ(λ) = ln Ee λx is called the cumulant generating function (i.e. it is the logarithmic moment generating function) of a random variable X. A slightly modified function is to constrain λ to bigger than 0. So a modified function is: { Γ(λ) if λ > 0 Γ + (λ) = else This modification is so that when we take an inf over λ, we effectively are only considering the positive λ. Corollary.3. We have that and that P ( X n µ + ɛ) exp[ nγ +(µ + ɛ)], Γ +(z) = I(z) Let us now understand Γ + Let P be the original measure on X. Let us now define a different measure, P λ on X as follows: Note that it is normalized. Given some ɛ, suppose we find the λ such that: dp λ (X x) = e λx Γ(λ) dp (X x) E X Pλ [X] = µ + ɛ Slightly abusing notation let P µ+ɛ be this distribution (e.g. P µ+ɛ is P λ for the λ which solves the above). Intuitively, P µ+ɛ is the perturbed distribution which shifts the mean by ɛ. Lemma.4. Assume P µ+ɛ exists. Then: Proof. Left as a homework problem. I(µ + ɛ) = KL(P µ+ɛ P ) 2

3 2 Small, Medium, and Large Deviations The tail probability we are interested in is: P ( X n µ + ɛ) For large n, we seek to understand the behavior of: where ɛ n is a function of n. P ( X n µ + ɛ n ) We can think of three natural asymptotics. If ɛ n behaves as / n, e.g. ) ɛ n = Θ (z σ2 n then the CLT capture the tail probability (under the tail probability with respect z, under the standard normal). If ɛ is a constant, then this large deviation tail probability tends to 0. Here, we can characterize As we will see, the rate function governs this asymptote. lim n n P ( X n µ + ɛ) In practice, we are often interested in medium deviations. By this, we often use ɛ n 0 and nɛ n e.g. we typically are interested in ɛ n going to 0 at a rate slower than n (this is because as n increases, we also increase the model complexity). As we now see, the the rate function accurately captures both the large and small regime, so we expect it to naturally capture the medium deviation regime. 2. The Small Deviation Limit Note that Theorem 2.. We have that: P ( X n µ + z σ2 n ) ni(µ + z σ2 n ) σ2 lim ni(µ + z ) = z2 n n 2 Proof. HW exercise Note this is upper bound is sharp approximation to the true Gaussian tail probability (as seen in the last lecture). Hence, in the small deviation regime, we have not lost much. 2.2 The Large Deviation Limit For large deviation, the exponential Markov inequality is asymptotically tight in the following sense: 3

4 Theorem 2.2. For all ɛ > ɛ > 0: lim n n ln P ( X n µ + ɛ) I(µ + ɛ ). Proof. We only prove the first inequality. Let X i have CDF at x as P (X i x), and define distribution of X i with density at x as dp λ (X i x) = e λx Γ(λ) dp (X i x) (e.g. under P λ ). Let X n = n n i= X i. Then e nλ P i xi+nγ(λ) i dp (X i x i) = i d(p (X i x i ). We use I to denote indicator function: P ( X n µ + ɛ) P ( X n µ [ɛ, ɛ ]) =E X,...,Xn I( X n µ [ɛ, ɛ ]) =E X,...,X n e λn X n +nγ(λ) I( X n µ [ɛ, ɛ ]) e λn(µ+ɛ )+nγ(λ) E X n I( X n µ [ɛ, ɛ ]) e λn(µ+ɛ )+nγ(λ) ( o()) e inf λ>0( λn(µ+ɛ )+nγ(λ) ( o()). Note that Γ (λ) = E X n X n. Therefore with λ = arg min inf λ [ λ(µ + (ɛ + ɛ)/2) + Γ(λ)], we have E X n X n = Γ (λ) = µ + (ɛ + ɛ)/2, and thus by the law of large numbers E X n I( X n µ [ɛ, ɛ ]) = o() as n. This proves the lower bound. Compare with Gaussian for similarity. Generally, with a more careful estimate, one can obtain a tight lower bound for finite n at ɛ = ɛ + O( V ar(x)/n). That is, for exponential inequality, the looseness in terms of deviation ɛ is only O( V ar(x)/n). 3 Sub-Gaussian Random Variables Recall that for a Gaussian random variable: ln Ee λ(x µ) = λ2 2 σ2. We say X is a sub-gaussian random variable if it has quadratically bounded logarithmic moment generating function,e.g. ln Ee λ(x µ) λ2 2 b. For this case, we have for z > µ: Taking derivative at optimal λ : which implies that λ = (µ z)/b and That is, we have Similarly, λ2 I(z) = inf ( λz + λµ + λ>0 2 b). z + µ = λ b, I(z) = (z µ)2. 2b P ( X n µ + ɛ) e nɛ2 /2b. P ( X n µ ɛ) e nɛ2 /2b. 4

5 Bounded Random Variables Clearly, a Gaussian variable N(µ, σ 2 ) is sub-gaussian with b = σ 2. Now let us consider the case of a bounded random variable. Lemma 3.. (Hoeffding s Lemma) Suppose that X [b, b + ] with probability, then X is sub-gaussian with b = (b + b ) 2 /4. To prove this, first let us observe the following: Lemma 3.2. We have the following equality for the second derivative of Γ which is the variance of X under P λ. Proof. Left as a HW exercise. Now we can prove Hoeffding s lemma: Proof. First observe that: Hence: Since Γ (0) = µ, we have that: Γ (λ) = Var Pλ (X) X b + b b + b 2 2 Var Pλ (X) = Var Pλ (X b + b ) (b + b ) Γ (λ) λµ + λ 2 (b + b ) 2 8 (by integration and the fact the upper bound has the same first derivative as Γ at λ = 0). Corollary 3.3. Suppose that X [b, b + ] with probability, then P ( X n µ + ɛ) e 2nɛ2 /(b + b ) 2. Similarly, P ( X n µ ɛ) e n2ɛ2 /(b + b ) 2. 5

Hoeffding, Chernoff, Bennet, and Bernstein Bounds

Hoeffding, Chernoff, Bennet, and Bernstein Bounds Stat 928: Statistical Learning Theory Lecture: 6 Hoeffding, Chernoff, Bennet, Bernstein Bounds Instructor: Sham Kakade 1 Hoeffding s Bound We say X is a sub-gaussian rom variable if it has quadratically