March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang.

Size: px

Start display at page:

Download "March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang."

Warren Patterson
5 years ago
Views:

1 Florida State University March 1, 2018

2 Framework 1. (Lizhe) Basic inequalities Chernoff bounding Review for STA (Lizhe) Discrete-time martingales inequalities via martingale approach 3. (Boning)

3 Part I:

4 Why concentration inequalities? inequalities quantify how a random variable X deviates around its mean µ. They usually take the form of two-sides bounds for the tails of X µ, such as P( X µ t) something very small, t > 0. Based on CLT, we can get the asymptotic results when n. But in machine learning community and in high dimensional data analysis, we prefer to exploit the non-asymptotic properties for random variables.

5 Basic inequalities I First of all, we recall some basic tools and inequalities. For any nonnegative random variable X, E(X ) = 0 P(X t)dt. This implies Markov inequality: for any nonnegative random variable X, and t > 0, P(X t) E(X ). t

6 Basic inequalities II In general, if φ is a strictly monotonically increasing nonnegative-valued function, then for any random variable X and real number t, P(X t) = P{φ(X ) φ(t)} E(φ(X )). φ(t) For example, φ(x) = x 2 will induce Chebyshev s inequality: if X is an arbitrary random variable and t > 0, then P{ X EX t} = P{ X EX 2 t 2 } E X EX 2 t 2 = var{x } t 2

7 Chernoff bounding technique Taking φ(x) = exp(λx) where λ is an arbitrary positive number, for any random variable X, and any t > 0, we have P(X t) = P{exp(λX ) exp(λt)} exp( λt)e[exp(λx )] exp( λt + log E[exp(λX )]), λ > 0. Please note that if we want to bound the probability of the lower tail, P(X t), we follow the same steps, but with X rather than X. Now, we need to obtain tight uppers bound for exp( λt + log E[exp(λX )]).

8 Chernoff bounding technique: Hoeffding inequality I Theorem Let X be a random variable, such that X [a, b] a.s. for some finite a < b. Then, for any λ 0, ( ) E[exp(λ(X EX ))] exp λ 2 (b a) 2 /8 Proof. For any p [0, 1] and x R, let us define the function H p (s) = log[p exp(s(1 p)) + (1 p) exp( sp)]. Let ξ = X EX, where ξ [a EX, b EX ]. Using the convexity of the exponential function, we can write

9 Chernoff bounding technique: Hoeffding inequality II Proof. exp(λξ) = ( X a b X ) exp λ(b EX ) + b a b a t(a EX ) ( X a ) (b X ) exp(λ(b EX )) + exp(λ(a EX )). b a b a Taking expectations of both sides, we can get E[exp(λξ)] ( EX a ) (b EX ) exp(λ(b EX ))+ exp(λ(a EX )). b a b a Let p = EX a b a and s = λ(b a), we have exp(h p (s)) = ( EX a ) (b EX ) exp(λ(b EX ))+ exp(λ(a EX )). b a b a

10 Chernoff bounding technique: Hoeffding inequality III Proof. Using Taylor expansion, we can get the bound H p (s) s2 8 for all p [0, 1] and all s R, by using the above definitions of p and s, we can get the Hoeffding inequality. A disadvantage of Hoeffding inequality is that it ignores information about variance of the X. And the Bernstein inequality provide an improvement in this respect. Question: Why we use Chernoff bounding technique?

11 Example: bounding the random walk Symmetric Bernoulli distribution A random variable X has symmetric Bernoulli distribution (also called Rademacher distribution) if P(X = 1) = P(X = 1) = 1 2 Let X 1, X 2,, X n be independent symmetric Bernoulli random variables. Then for any t 0, we have P ( n X i t ) exp ( t2 ) 2n

12 Chernoff bounding technique: beyond Hoeffding inequality Bernstein s condition: Given a random variable X with EX = µ and var(x ) = σ 2, we say that the Bernstein s condition with parameter b holds if E[(X µ) k ] 1 2 k!σ2 b k 2, k = 3, 4, Theorem For any random variable X satisfying the Bernstein condition, we have E[exp(λ(X µ))] exp( λ2 σ 2 /2 1 b λ ), for all λ < 1 b, and moreover, we have the concentration inequality P( X µ t) 2 exp( 2(σ 2 ), for all t 0. + bt) t 2

13 Review for STA 6448

14 Part II:

15 Discrete-time martingales Definition Let (Ω, F, P) be a probability space. A sequence {X i, F i } n i=0, n N, where the X i are random variables and the F i are σ-algebras, is a martingale if the following conditions are satisifed: 1. The F i form a filteration, i.e., {, Ω} = F 0 F 1 F n = F. 2. X i L 1 (Ω, F i, P) for every i {0, 1,, n}. 3. For all i {1, 2,, n}, the equality E[X i F i 1 ] = X i 1 holds almost sure (a.s.). A martingale can be generated by the following procedure: given a r.v. X associated with a filtration {F i } n i=0, let X i = E[X F i ], i {1, 2, 3,, n}. Then, the sequence X 0, X 1, X 2,, X n forms a martingale.

16 decomposition Consider a r.v. X associated with a filtration {F i } n i=0, where F 0 = {, Ω} and F n = F, we have Consider E[X F 0 ] = EX and E[X F n ] = X. X EX = E[X F n ] E[X F 0 ] n = (E[X F i ] E[X F i 1 ]) = n ξ i, in which ξ i = E[X F i ] E[X F i 1 ]. Here, we call {ξ i } n martingale difference.

17 Review: Chernoff bounding technique Here, if we consider logarithmic moment generating function again, we have the following equality: log E[exp(λ(X EX ))] = log E[exp(λ n ξ i )] n = log E[ exp(λξ i )] Here, a intuitive idea is to bound each exp(λξ i ), i = 1, 2,, n.

18 Azuma inequality I Theorem Let {X i, F i } n i=0 be a real-valued martingale sequence. Suppose that there exist nonnegative real number d 1, d 2,, d n, such that X i X i 1 d i a.s. for all i {1, 2,, n}. Then, for every t > 0, P( X n X 0 t) 2 exp( 2 n t 2 d i 2 ) Proof. Here, we just consider P(X n X 0 t), t > 0. Let ξ i = X i X i 1 for i = 1, 2,, n denote the martingale difference. According to the assumption, we have ξ i d i and E[X i F i 1 ] = 0 a.s. for every i {1, 2,, n}.

19 Azuma inequality II Proof. Now we use the Chernoff technique: P(X n X 0 t) = P( n ξ i t) exp( λt)e[exp(λ n ξ i )], λ 0. Furthermore, E[exp(λ n [ ξ i )] = E E [ exp(λ n ξ i ) ] ] Fn 1 [ n 1 ] = E exp(λ ξ i )E[exp(λξ i ) F n 1 ]

20 Azuma inequality III Proof. The last equality holds since exp(λ n 1 ξ i) is F n 1 -measurable. And we can apply Hoeffding inequality to ξ i conditioned on F n 1. Because we know that E[ξ n F n 1 ] = 0 and ξ n [ d n, d n ] a.s., According to the Hoeffding inequality, we have E[exp(λξ n ) F n 1 ] exp ( λ 2 d 2 n 2 By continuing recursively the above inequality, we can bound the inequality by E [ exp(λ n ξ i ) ] n exp ( λ 2 di 2 ) (λ 2 = exp 2 2 ) n di 2 )

21 Azuma inequality IV Proof. Plugging the above bound into the inequality, we have P(X n X 0 t) exp ( λt + λ2 2 n di 2 ) t 0. After minimizing the right side of the above inequality, we get Above all, we have P(X n X 0 t) exp( 2 n t 2 P( X n X 0 t) 2 exp( 2 n d i 2 t 2 ) d i 2 )

22 McDiarmid inequality Bounded difference assumption Let f : R n R be a function that satisfies the bounded difference assumption sup f (x 1,, x i,, x n ) f (x 1,, x x 1,x 2,,x n,x i R i,, x n ) d i for every 1 i n, where d 1,, d n are some nonnegative real constants. Theorem Suppose that a measurable function f satisfies the bounded difference assumption with parameters (d 1, d 2,, d n ) and let X i n be independent (not necessary i.i.d) in some measurable space. Then P( f (X 1, X 2,, X n ) E[f (X 1, X 2,, X n )] t) 2 exp ( 2t2 ) n d i 2

23 McDiarmid inequality: the outline of the proof Construct the martingale difference ξ i = E[f (X 1, X 2,, X n ) F i ] E[f (X 1, X 2,, X n ) F i 1 ] Thus we get f (X 1, X 2,, X n ) E[f (X 1, X 2,, X n )] = n ξ i Bounded ξ i, construct r.v. A i and B i, prove A i ξ i B i and B i A i d i. Using Hoeffding inequality, similar to Azuma inequality.

24 A summary

25 Reference [1]. Lecture notes and related materials in STA [2]. Raginsky, Maxim, and Igal Sason. of measure inequalities in information theory, communications, and coding. Foundations and Trends in Communications and Information Theory (2013):

26 Thank you Thank you

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 59 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d