STAT 200C: High-dimensional Statistics

Size: px

Start display at page:

Download "STAT 200C: High-dimensional Statistics"

Giles White
5 years ago
Views:

1 STAT 200C: High-dimensional Statistics Arash A. Amini May 30, / 59

2 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d γ or even d n, e.g genes with only 50 samples. Classical methods fail. E.g., Linear regression y = X β + ε, where ε N(0, σ 2 I n ). ˆβ OLS = argmin β R d y X β 2 2 We have MSE( ˆβ OLS ) = O( σ2 d n ). Solution: Assume some underlying low-dimensional structure (e.g. sparsity). 2 / 59

3 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 3 / 59

4 Concentration inequalities Main tools in dealing with high-dimensional randomness. Non-asymptotic versions of the CLT. General form: P( X EX > t) < something small. Classical examples: Markov and Chebyshev inequalities: Markov: Assume X 0, then P(X t) EX t. Chebyshev: Assume EX 2 <, and let µ = EX. Then, P( X µ t) var(x ) t 2. Stronger assumption: E X k <. Then, P( X µ t) E X µ k t k. 4 / 59

5 Concentration inequalities Example 1 X 1,..., X n Ber(1/2) and S n = n i=1 X i. Then, by CLT Z n := S n n/2 n/4 d N(0, 1). Letting g N(0, 1), P (S n n2 ) n4 + t P(g t) 1 /2 2 e t2. Letting t = α n, P (S n n ) 2 (1 + α) 1 2 e n α2 /2. Problem: Approximation is not tight in general. 5 / 59

6 Theorem 1 (Berry Esseen CLT) Under the assumption of CLT, with ρ = E X 1 µ 3 /σ 3, P(Z n t) P(g t) ρ n. ( The bound is tight since P(S n = n/2) = 1 n ) 2 n n/2 n 1, for the Bernoulli example. Conclusion, the approximation error is O(n 1/2 ) which is a lot larger than the exponential bound O(e n α2 /2 ) that we want to establish. Solution: directly obtain the concentration inequalities, often using Chernoff bounding technique: for any λ > 0, P(Z n t) = P(e λzn e λt ) EeλZn e λt, t R. Leads to the study of the MGF of random variables. 6 / 59

7 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 7 / 59

8 Sub-Gaussian concentration Definition 1 A zero mean random variable X is sub-gaussian if for some σ > 0. Ee λx e σ2 λ 2 /2, for all λ R. (1) A general random variable is sub-gaussian if X EX is sub-gaussian. X N(0, σ 2 ) satisfies (1) with equality. A Rademacher variable: also called symmetric Bernoulli P(X = ±1) = 1 2 is sub-gaussian, Ee λx = cosh(λ) e λ2 /2. Any bounded RV is sub-gaussian: X [a, b] a.s., then (1) with σ = b a 2. 8 / 59

9 Proposition 1 Assume that X is zero-mean sub-gaussian satisfying (1). Then, ) P(X t) exp ( t2 2σ 2, for all t 0. Same bound holds with X replaced with X. Proof: Chernoff bound [ ] ( P(X t) inf e λt Ee λx = inf exp λ>0 λ>0 λt + λ2 σ 2 Union bound gives two-sided bound: P( X t) 2 exp( t2 2σ 2 ). What if µ := EX 0? Apply to X µ, ) P( X µ t) 2 exp ( t2 2σ 2. 2 ). 9 / 59

10 Proposition 2 Assume that {X i } are independent, zero-mean sub-gaussian with parameters {σ i }. Then, S n = i X i is sub-gaussian with parameter σ := i σ2 i. Sub-Gaussian parameter squared behaves like the variance. Proof: Ee λsn = i EeλX i. 10 / 59

11 Theorem 2 (Hoeffding) Assume that {X i } are independent, zero-mean sub-gaussian with parameters {σ i }. Then, letting σ 2 := i σ2 i, ( ) ) P X i t exp ( t2 2σ 2, t 0. Same bound holds with X i replaced with X i. i Alternative form, assume there are n variables, and let σ 2 := 1 n n i=1 σ2 i, and X n := 1 n n i=1 X i. Then, P ( Xn t ) ) exp ( nt2 2 σ 2, t 0. Example: X i iid Rad so that σ = σi = / 59

12 Equivalent characterizations of sub-gaussianity For a RV X, the following are equivalent: (HDP, Prop ) 1. The tails of X satisfy P( X t) 2 exp( t 2 /K1 2 ), for all t The moments of X satisfy X p = (E X p ) 1/p K 2 p, for all p The MGF of X 2 satisfies E exp(λx 2 ) exp(k 2 3 λ 2 ), for all λ 1 K 3 4. The MGF of X 2 is bounded at some point, E exp(x 2 /K4 2 ) 2. Assuming EX = 0, the above are equivalent to: 5. The MGF of X satisfies E exp(λx ) exp(k5 2 λ 2 ), for all λ R. 12 / 59

13 Sub-Gaussian norm The sub-gaussian norm is the smallest K 4 in property 4, i.e., X ψ2 = inf { t > 0 : E exp(x 2 /t 2 ) 2 }. X is sub-gaussian iff X ψ2 <. ψ2 is a proper norm on the space of sub-gaussian RVs. Every sub-gaussian variable satisfies the following bounds: P( X t) 2 exp( ct 2 / X 2 ψ 2 ), for all t 0. X p C X ψ2 p, for all p 1. E exp(x 2 / X 2 ψ 2 ) 2 When EX = 0, E exp(λx ) exp(cλ 2 X 2 ψ 2 ) for all λ R. for some universal constant C, c > / 59

14 Some consequences Recall what a universal/numerical/absolute constant means. Sub-Gaussian norm is within a constant factor of the sub-gaussian parameter σ: for numerical constant c 1, c 2 > 0, c 1 X ψ2 σ(x ) c 2 X ψ2. Easy to see that X ψ2 X. (Bounded variables are sub-gaussian) a b means a Cb for some universal constant C. Lemma 1 (Centering) If X is sub-gaussian, then X EX is sub-gaussian too and X EX ψ2 C X ψ2 where C is a universal constant. Proof: EX ψ2 EX E X = X 1 X ψ2. Note: X EX ψ2 could be much smaller than X ψ2. 14 / 59

15 Alternative forms Alternative form of Proposition 2: Proposition 3 (HDP 2.6.1) Assume that {X i } are independent, zero-mean sub-gaussian RVs. Then i X i is also sub-gaussian and i X i 2 ψ 2 C i X i 2 ψ 2 where C is an absolute constant. 15 / 59

16 Alternative form of Theorem 2: Theorem 3 (Hoeffding) Assume that {X i } are independent, zero-mean sub-gaussian RVs. Then, ( ) P X i t 2 exp ( c ) t2 i X, t 0. i ψ2 c > 0 is some universal constant. i 16 / 59

17 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 17 / 59

18 Sub-exponential concentration Definition 2 A zero mean random variable X is sub-exponential if for some ν, α > 0. Ee λx e ν2 λ 2 /2, for all λ < 1 α. (2) A general random variable is sub-exponential if X EX is sub-exponential. If Z N(0, 1), then Z 2 is sub-exponential. { e λ Ee λ(z 2 1) = 1 2λ λ < 1/2 λ > 1/2. We have Ee λ(z 2 1) e 4λ2 /2, λ < 1/4. hence sub-exponential with parameters (2, 4). Tails of Z 2 1 are heavier than a Gaussian. 18 / 59

19 Proposition 4 Assume that X is zero-mean sub-exponential satisfying (2). Then, P(X t) exp ( 1 { t 2 2 min ν 2, t }), for all t 0. α Same bound holds with X replaced with X. Proof: Chernoff bound P(X t) inf [e ] λt Ee λx λ 0 Let f (λ) = λt + λ 2 ν 2 /2. Minimizer of f over R is λ = t/ν 2. inf exp 0 λ < 1 α ( λt + λ2 ν 2 2 ). 19 / 59

20 Hence minimizer of f over [0, 1/α] is { λ t/ν 2 t/ν 2 < 1/α = 1/α t/ν 2 1/α. and the minimum is t2 f (λ 2ν ) = 2 t < ν 2 /α t α + ν2 2α 2 t. 2α t ν2 /α Thus, f (λ ) max { t2 2ν 2, t } = 1 { t 2 2α 2 min ν 2, t }. α 20 / 59

21 Bernstein inequality for sub-exponential RVs Theorem 4 (Bernstein) Assume that {X i } are independent, zero-mean sub-exponential RVs with parameters (ν i, α i ). Let ν := ( ) 1/2 i ν2 i and α := maxi α i. Then i X i is sub-exponential with parameters (ν, α), and Proof: We have ( ) P X i t exp ( 1 { t 2 2 min ν 2, t }). α i Ee λx i e λ2 ν 2 i /2, for all λ < Let S n = i X i. By independence Ee λsn = i 1 max i α i. Ee λx i e λ2 i ν2 i /2, for all λ < The tail bound follows from Proposition 4. 1 max i α i. 21 / 59

22 Equivalent characterizations of sub-exponential RVs For a RV X, the following are equivalent: (HDP, Prop ) 1. The tails of X satisfy 2. The moments of X satisfy 3. The MGF of X satisfies P( X t) 2 exp( t/k 1 ), for all t 0. X p = (E X p ) 1/p K 2 p, for all p 1. E exp(λ X ) exp(k 3 λ), for all 0 λ 1 K 3 4. The MGF of X is bounded at some point, E exp( X /K 4 ) 2. Assuming EX = 0, the above are equivalent to: 5. The MGF of X satisfies E exp(λx ) exp(k5 2 λ 2 ), for all λ 1. K 5 22 / 59

23 Equivalent characterizations of sub-gaussianity For a RV X, the following are equivalent: (HDP, Prop ) 1. The tails of X satisfy 2. The moments of X satisfy 3. The MGF of X 2 satisfies P( X t) 2 exp( t 2 /K 2 1 ), for all t 0. X p = (E X p ) 1/p K 2 p, for all p 1. E exp(λx 2 ) exp(k3 2 λ 2 ), for all λ 1 K 3 4. The MGF of X 2 is bounded at some point, E exp(x 2 /K4 2 ) 2. Assuming EX = 0, the above are equivalent to: 5. The MGF of X satisfies E exp(λx ) exp(k5 2 λ 2 ), for all λ R. 23 / 59

24 Sub-exponential norm The sub-exponential norm is the smallest K 4 in property 4, i.e., X ψ1 = inf { t > 0 : E exp( X /t) 2 }. X is sub-exponential iff X ψ1 <. ψ1 is a proper norm on the space of sub-exponential RVs. Every sub-exponential variable satisfies the following bounds: P( X t) 2 exp( ct/ X ψ1 ), for all t 0. X p C X ψ1 p, for all p 1. E exp( X / X ψ1 ) 2 When EX = 0, E exp(λx ) exp(cλ 2 X ψ1 ) for all λ 1/ X ψ1. for some universal constant C, c > / 59

25 Lemma 2 A random variable X is sub-gaussian if and only if X 2 is sub-exponential, in fact X 2 ψ1 = X 2 ψ 2. Proof: Immediate from definition. Lemma 3 If X and Y are sub-gaussian, then XY is sub-exponential, and XY ψ1 X ψ2 Y ψ2 Proof: Assume X ψ2 = Y ψ2 = 1, WLOG. Apply Young s inequality ab (a 2 + b 2 )/2 for all a, b R, twice Ee XY Ee (X 2 +Y 2 )/2 = E[e X 2 /2 e Y 2 /2 ] 1 2 E[ e X 2 + e Y 2] / 59

26 Alternative form of Proposition 4: Theorem 5 (Bernstein) Assume that {X i } are independent, zero-mean sub-exponential RVs. Then, ( ) [ ( P X i t 2 exp c min i c > 0 is some universal constant. Corollary 1 (Bernstein) t 2 i X i 2 ψ 1, t )], t 0. max i X i ψ1 Assume that {X i } are independent, zero-mean sub-exponential RVs with X i ψ1 K for all i. Then, ( 1 P n n i=1 ) [ ( t 2 X i t 2 exp c n min K 2, t )], t 0. K 26 / 59

27 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 27 / 59

28 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 28 / 59

29 Concentration of χ 2 RVs I Example 2 Let Y χ 2 n, i.e., Y = n i=1 Z 2 i where Z i iid N(0, 1). Z 2 i are sub-exponential with parameters (2, 4). Then, Y is sub-exponential with parameters (2 n, 4) and we obtain or replacing t with nt ( 1 P n n i=1 P( Y EY ) 2 exp [ 1 ( t 2 2 min 4n, t )] 4 ) i 1 t Z 2 [ 2 exp 1 ] 8 n min(t2, t), t / 59

30 Concentration of χ 2 RVs II In particular, ( 1 P n n i=1 ) i 1 t 2e nt2 /8, t [0, 1]. Z 2 Second approach ignoring constants: We have Z 2 i 1 ψ1 C Z 2 i ψ1 = C Z i 2 ψ 2 = C. Applying Corollary 1 with K = C ( 1 P n n i=1 ) i 1 t Z 2 where c 2 = c min(1/c 2, 1/C). [ ( t 2 2 exp c n min [ 2 exp c 2 n min(t 2, t) C 2, t )], C ], t 0 30 / 59

31 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 31 / 59

32 Random projection for dimension reduction Suppose that we have data points {u 1,..., u N } R d. Want to project them down to a lower-dimensional space R m (m d) such that pairwise distances u i u j are approximately preserved. Can be done by a linear random projection X : R d R m, which can be viewed as a random matrix X R m d. Lemma 4 (Johnson-Lindenstrauss embedding) Let X := 1 m Z R m d where Z has iid N(0, 1) entries. Consider any collection of points {u 1,..., u N } R d. Take ε, δ (0, 1) and assume that Then with probability at least 1 ε, m 16 δ 2 log ( N ε ). (1 δ) u i u j 2 2 Xu i Xu j 2 2 (1 + δ) u i u j 2 2, i j 32 / 59

33 Proof Fix u R d and let Y := Zu 2 2 u 2 2 = m u z i, 2 u 2 i=1 where z T i is the ith row of Z. Then, Y χ 2 m. Recalling X = Z/ m, for all δ (0, 1), ( Xu 2 2 P u 2 2 ) 1 δ ( Y ) = P m 1 δ 2e mδ2 /8 Applying to u = u i u j, for any fixed pair (i, j), we have ( X (u i u j ) 2 2 P u i u j 2 2 ) 1 δ 2e mδ2 /8 33 / 59

34 Apply a further union bound for all pairs i j ) ( N 1 δ, for some i j 2 2 ( X (u i u j ) 2 2 P u i u j 2 2 ) e mδ2 /8 Since 2 ( N 2) N 2, the result follows by solving the following for m N 2 e mδ2 /8 ε. 34 / 59

35 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 35 / 59

36 l 2 norm of sub-gaussian vectors Here X 2 = n i=1 X 2 i. Proposition 5 (Concenration of norm, HDP 3.1.1) Let X = (X 1,..., X n ) R n be a random vector with independent, sub-gaussian coordinates X i that satisfy EXi 2 = 1. Then, X 2 n ψ2 CK 2 where K = max i X i ψ2 and C is an absolute constant. The result says that the norm is highly concentration around n: X 2 n in high dimensions (n large). Assuming K = O(1), it shows that w.h.p. X 2 = n + O(1). More precisely, w.p. 1 e c1v 2, we have n K 2 v X 2 n + K 2 v 36 / 59

37 Simple argument: Assuming sd(x 2 1 ) = O(1), E X 2 2 = n var( X 2 2) = n var(x 2 1 ) sd( X 2 2) = n sd(x 2 1 ) X 2 n ± O( n) = n ± O(1), the latter can be shown by Taylor expansion. 37 / 59

38 Proof of Proposition 5: Argue that we can take K 1. Since X i is sub-gaussian, X 2 i is sub-exponential and X 2 i 1 ψ1 C X 2 i ψ1 = C X i 2 ψ 2 CK 2. Applying Bernstein s inequality (Corollary 1), for any u 0, ( X 2 2 P n ) 1 u ( 2 exp c ) 1n K 4 min(u2, u), where we used K 4 K 2 and absorbed C into c 1. Using the inequality z 1 δ = z 2 1 max(δ, δ 2 ), z, ( X ) 2 P 1 δ n ( X 2 2 P n 2 exp ) 1 max(δ, δ 2 ) ( c 1n K 4 δ2). f (u) = min(u 2, u) and g(δ) = max(δ, δ 2 ), then f (g(δ)) = δ 2 for all δ 0. Change of variable δ = t/ n gives the result. 38 / 59

39 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 39 / 59

40 l norm of sub-gaussian vectors For any vector X R n, the l norm is: X = max X i. i=1,...,n Lemma 5 Let X = (X 1,..., X n ) R n be a random vector with zero-mean, independent, sub-gaussian coordinates X i with parameter σ i. Then, for any γ 0, where σ = max i σ i. P ( X σ 2(1 + γ) log n ) 2n γ Proof: We have P( X i t) 2 exp( t 2 /2σ 2 ), hence taking t = 2σ 2 (1 + γ) log n. ) P(max X i t) 2n exp ( t2 i 2σ 2 = 2n γ 40 / 59

41 Theorem 6 Assume {X i } n i=1 are zero-mean RVs, sub-gaussian with parameter σ. Then, E[ max i=1,...,n X i ] 2σ 2 log n, n 1 Proof of 6: Jensen s inequality on e λz where Z = max i X i. 41 / 59

42 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 42 / 59

43 Theorem 7 (Azuma Hoeffding) Assume that X = (X 1,..., X n ) has independent coordinates, and let Z = f (X ). Let us write E i [Z] = E[Z X 1,..., X i ] and let Assume that i := E i [Z] E i 1 [Z]. E i 1 [e λ i ] e σ2 i λ2 /2, λ R (3) almost surely, for all i = 1,..., n. Then, Z EZ is sub-gaussian with parameter n σ = i=1 σ2 i. In particular, we have the tail bound P( Z EZ t) 2 exp( t2 2σ 2 ). { i } is called Doob s martingale difference sequence. It is a martingale difference seq. since E i 1 [ i ] = / 59

44 Proof Let S j := j i=1 i which is only a function of X i, i j. We have, noting that E 0 [Z] = Z, S n = n i = Z EZ By properties of conditional expectation, and assumption (3), Taking E n 2 of both sides: i=1 E n 1 [e λsn ] = e λsn 1 E n 1 [e λ n ] e λsn 1 e σ2 n λ2 /2 E n 2 [e λsn ] e σ2 n λ2 /2 E n 2 [e λsn 1 ] e λsn 2 e (σ2 n +σ2 n 1 )λ2 /2 Repeating the process, we get E 0 [e λsn ] exp(( n i=1 σ2 i )λ2 /2). 44 / 59

45 Bounded difference inequality Conditional sub-g. assump. holds under bounded difference property: f (x1,..., x i 1, x i, x i+1,..., x n ) f (x 1,..., x i 1, x i, x i+1,..., x n ) Li (4) for all x 1,..., x n, x i X, and all i [n], for some constants (L 1,..., L n ). Theorem 8 (Bounded difference) Assume that X = (X 1,..., X n ) has independent coordinates, and assume that f : X n R satisfies the bounded difference property (4). Then, P( f (X ) Ef (X ) ) ) t 2 exp ( 2t2 n, t 0. i=1 L2 i 45 / 59

46 Proof (Naive bound) We have i = E i [Z] E i 1 [E i [Z]] = g i (X 1,..., X i ) E i 1 [g i (X 1,..., X i )] Let X i be an independent copy of X i. Conditioned on X 1,..., X i 1, we are effectively looking at g i (x 1,..., x i 1, X i ) E[g i (x 1,..., x i 1, X i )] due to independence of {X 1,..., X i, X i }. Thus, i L i condition on X 1,..., X i 1. That is, E i 1 [e λ i ] e σ2 i λ2 /2 where σ 2 i = (2L i ) 2 /4 = L 2 i. 46 / 59

47 Proof (Better bound) Can show that i I i where I i L i, improving the constant by 4. Conditioned on X 1,..., X i, we are effectively looking at i = g i (x 1,..., x i 1, X i ) µ i where µ i is a constant (only a function of x 1,..., x i 1 ). Then, i + µ i [a i, b i ] where a i = inf g i(x 1,..., x i 1, x), b i = sup g i (x 1,..., x i 1, x). x We have (need to argue that g i satisfies bounded difference) [ b i a i = sup gi (x 1,..., x i 1, x) g i (x 1,..., x i 1, y) ] L i. x,y Thus E i 1 [e λ i ] e σ2 λ 2 /2 where σ 2 i = (b i a i ) 2 /4 L 2 i /4. x 47 / 59

48 The role of independence in the second argument is subtle. The only place we used independence is to argue that E i [Z] satisfies bounded difference for all i. We argue that E i [Z] = g i (X 1,..., X i ), which is where we use independence. Then, g i by definition and Jensen satisfies bounded difference. 48 / 59

49 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 49 / 59

50 Example: (bounded) U-statistics g : R 2 R a symmetric function, X 1,..., X n and iid sequence, U := ( 1 n ) g(x i, X j ) 2 i<j is called a U-statistic (of order 2). U is not a sum of independent variables, e.g. n = 3 gives U = 1 3( g(x1, X 2 ) + g(x 1, X 3 ) + g(x 2, X 3 ) ), but the dependence between terms is relatively weak (made precise shortly). For example, g(x, y) = 1 2 (x y)2 gives an unbiased estimator of the variance. (Exercise) 50 / 59

51 Assume that g is bounded, i.e. g b, meaning i.e., g(x, y) b for all x, y R. g := sup g(x, y) b x,y Writing U = f (X 1,..., X n ), we observe that (for fixed k) f (x) f (x \k ) ( 1 n ) g(x i, x k ) g(x i, x k) 2 i k (n 1)2b n(n 1)/2 = 4b n thus f has bounded differences with parameters L k = 4b/n. Applying Theorem 8 P ( U EU t ) 2e nt2 /8b 2, t / 59

52 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 52 / 59

53 Clique number of Erdös-Rényi Let G be an undirected graph.on n nodes. A clique in G is a complete (induced) sub-graph. Clique number of G denoted as ω(g) is the size of the largest clique(s). For two graphs G and G that differ in at most 1 edge, ω(g) ω(g ) 1. Thus E(G) ω(g) has bounded difference property with L = 1. Let G be an Erdös-Rényi random graph: Edges are independently drawn with probability p. Then, with m = ( n 2), ( ) P ω(g) E ω(g) δ 2e 2δ2 /m or setting ω(g) = ω(g)/m, ( ) P ω(g) E ω(g) δ 2e 2mδ2 53 / 59

54 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 54 / 59

55 Lipschitz functions of standard Gaussian vector A function f : R n R is L-Lipschitz w.r.t. 2 if f (x) f (y) L x y 2, x, y R n Theorem 9 (Gaussian concentration) Let X N(0, I n ) be a standard Gaussian vector and assume that f : R n R is L-Lipschitz w.r.t. the Euclidean norm. Then, P( f (X ) E[f (X )] ) ) t 2 exp ( t2 2L 2, t 0. (5) In other words, f (X ) is sub-gaussian with parameter L. Deep result, no easy proof! Has far-reaching consequences. One-sided bounds holds with prefactor 2 removed. 55 / 59

56 Example: χ 2 and norm concentrations revisited Let X N(0, I n ) and condition the function f (x) = x 2 / n. f is L-Lipschitz with L = 1/ n. Hence, ( X 2 P E X ) 2 t e nt2 /2, t 0 n n Since E X 2 n (why?), we have ( X 2 ) P 1 t e n t2 /2, t 0. n For t [0, 1], (1 + t) t, hence or setting 3t = δ, ( X 2 ) P 2 1 3t e n t2 /2, t [0, 1]. n ( X 2 ) P 2 1 δ e n δ2 /18, δ [0, 3]. n 56 / 59

57 Example: order statistics Let X N(0, I n ), and let f (x) = x (k) be the kth order statistic: For x R n, For any x, y R n, we have hence f is 1-Lipschitz. (Exercise) It follows that x (1) x (2) x (n) x (k) y (k) x y 2 P ( X (k) EX (k) t ) 2e t2 /2, t 0 iid In particular, if X i N(0, 1), i = 1,..., n, then ) P( max X n E[ max X n] t 2e t2 /2, t 0 i=1,...,n i=1,...,n 57 / 59

58 Example: singular values Consider a matrix X R n d where n > d. Let σ 1 (X ) σ 2 (X ) σ k (X ) be (ordered) singular values of X. By Weyl s theorem, for any X, Y R n d : σ k (X ) σ k (Y ) X Y op X Y F (Note that this is a generalization of order-statistics inequality.) Thus, X σ k (X ) is 1-Lipschitz: Proposition 6 Let X R n d be a random matrix with iid N(0, 1) entries. Then, ( σk P (X ) E[σ k (X )] ) δ 2e δ2 /2, δ 0 It remains to characterize E[σ k (X )]. For an overview of matrix norms, see matrix norms.pdf 58 / 59

59 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 59 / 59

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics Arash A. Amini April 27, 2018 1 / 80 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d