STAT 200C: High-dimensional Statistics

Size: px

Start display at page:

Download "STAT 200C: High-dimensional Statistics"

Rosanna Parker
5 years ago
Views:

1 STAT 200C: High-dimensional Statistics Arash A. Amini April 27, / 80

2 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d γ or even d n, e.g genes with only 50 samples. Classical methods fail. E.g., Linear regression y = X β + ε, where ε N(0, σ 2 I n ). ˆβ OLS = argmin β R d y X β 2 2 We have MSE( ˆβ OLS ) = O( σ2 d n ). Solution: Assume some underlying low-dimensional structure (e.g. sparsity). 2 / 80

3 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 3 / 80

4 Concentration inequalities Main tools in dealing with high-dimensional randomness. Non-asymptotic versions of the CLT. General form: P( X EX > t) < something small. Classical examples: Markov and Chebyshev inequalities: Markov: Assume X 0, then P(X t) EX t. Chebyshev: Assume EX 2 <, and let µ = EX. Then, P( X µ t) var(x ) t 2. Stronger assumption: E X k <. Then, P( X µ t) E X µ k t k. 4 / 80

5 Concentration inequalities Example 1 X 1,..., X n Ber(1/2) and S n = n i=1 X i. Then, by CLT Z n := S n n/2 n/4 d N(0, 1). Letting g N(0, 1), P (S n n2 ) n4 + t P(g t) 1 /2 2 e t2. Letting t = α n, P (S n n ) 2 (1 + α) 1 2 e n α2 /2. Problem: Approximation is not tight in general. 5 / 80

6 Theorem 1 (Berry Esseen CLT) Under the assumption of CLT, with ρ = E X 1 µ 3 /σ 3, P(Z n t) P(g t) ρ n. ( The bound is tight since P(S n = n/2) = 1 n ) 2 n n/2 n 1, for the Bernoulli example. Conclusion, the approximation error is O(n 1/2 ) which is a lot larger than the exponential bound O(e n α2 /2 ) that we want to establish. Solution: directly obtain the concentration inequalities, often using Chernoff bounding technique: for any λ > 0, P(Z n t) = P(e λzn e λt ) EeλZn e λt, t R. Leads to the study of the MGF of random variables. 6 / 80

7 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 7 / 80

8 Sub-Gaussian concentration Definition 1 A zero mean random variable X is sub-gaussian if for some σ > 0. Ee λx e σ2 λ 2 /2, for all λ R. (1) A general random variable is sub-gaussian if X EX is sub-gaussian. X N(0, σ 2 ) satisfies (1) with equality. A Rademacher variable: also called symmetric Bernoulli P(X = ±1) = 1 2 is sub-gaussian, Ee λx = cosh(λ) e λ2 /2. Any bounded RV is sub-gaussian: X [a, b] a.s., then (1) with σ = b a 2. 8 / 80

9 Proposition 1 Assume that X is zero-mean sub-gaussian satisfying (1). Then, ) P(X t) exp ( t2 2σ 2, for all t 0. Same bound holds with X replaced with X. Proof: Chernoff bound [ ] ( P(X t) inf e λt Ee λx = inf exp λ>0 λ>0 λt + λ2 σ 2 Union bound gives two-sided bound: P( X t) 2 exp( t2 2σ 2 ). What if µ := EX 0? Apply to X µ, ) P( X µ t) 2 exp ( t2 2σ 2. 2 ). 9 / 80

10 Proposition 2 Assume that {X i } are independent, zero-mean sub-gaussian with parameters {σ i }. Then, S n = i X i is sub-gaussian with parameter σ := i σ2 i. Sub-Gaussian parameter squared behaves like the variance. Proof: Ee λsn = i EeλX i. 10 / 80

11 Theorem 2 (Hoeffding) Assume that {X i } are independent, zero-mean sub-gaussian with parameters {σ i }. Then, letting σ 2 := i σ2 i, ( ) ) P X i t exp ( t2 2σ 2, t 0. Same bound holds with X i replaced with X i. i Alternative form, assume there are n variables, and let σ 2 := 1 n n i=1 σ2 i, and X n := 1 n n i=1 X i. Then, P ( Xn t ) ) exp ( nt2 2 σ 2, t 0. Example: X i iid Rad so that σ = σi = / 80

12 Equivalent characterizations of sub-gaussianity For a RV X, the following are equivalent: (HDP, Prop ) 1. The tails of X satisfy P( X t) 2 exp( t 2 /K1 2 ), for all t The moments of X satisfy X p = (E X p ) 1/p K 2 p, for all p The MGF of X 2 satisfies E exp(λx 2 ) exp(k 2 3 λ 2 ), for all λ 1 K 3 4. The MGF of X 2 is bounded at some point, E exp(x 2 /K4 2 ) 2. Assuming EX = 0, the above are equivalent to: 5. The MGF of X satisfies E exp(λx ) exp(k5 2 λ 2 ), for all λ R. 12 / 80

13 Sub-Gaussian norm The sub-gaussian norm is the smallest K 4 in property 4, i.e., X ψ2 = inf { t > 0 : E exp(x 2 /t 2 ) 2 }. X is sub-gaussian iff X ψ2 <. ψ2 is a proper norm on the space of sub-gaussian RVs. Every sub-gaussian variable satisfies the following bounds: P( X t) 2 exp( ct 2 / X 2 ψ 2 ), for all t 0. X p C X ψ2 p, for all p 1. E exp(x 2 / X 2 ψ 2 ) 2 When EX = 0, E exp(λx ) exp(cλ 2 X 2 ψ 2 ) for all λ R. for some universal constant C, c > / 80

14 Some consequences Recall what a universal/numerical/absolute constant means. Sub-Gaussian norm is within a constant factor of the sub-gaussian parameter σ: for numerical constant c 1, c 2 > 0, c 1 X ψ2 σ(x ) c 2 X ψ2. Easy to see that X ψ2 X. (Bounded variables are sub-gaussian) a b means a Cb for some universal constant C. Lemma 1 (Centering) If X is sub-gaussian, then X EX is sub-gaussian too and X EX ψ2 C X ψ2 where C is a universal constant. Proof: EX ψ2 EX E X = X 1 X ψ2. Note: X EX ψ2 could be much smaller than X ψ2. 14 / 80

15 Alternative forms Alternative form of Proposition 2: Proposition 3 (HDP 2.6.1) Assume that {X i } are independent, zero-mean sub-gaussian RVs. Then i X i is also sub-gaussian and i X i 2 ψ 2 C i X i 2 ψ 2 where C is an absolute constant. 15 / 80

16 Alternative form of Theorem 2: Theorem 3 (Hoeffding) Assume that {X i } are independent, zero-mean sub-gaussian RVs. Then, ( ) P X i t 2 exp ( c ) t2 i X, t 0. i ψ2 c > 0 is some universal constant. i 16 / 80

17 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 17 / 80

18 Sub-exponential concentration Definition 2 A zero mean random variable X is sub-exponential if for some ν, α > 0. Ee λx e ν2 λ 2 /2, for all λ < 1 α. (2) A general random variable is sub-exponential if X EX is sub-exponential. If Z N(0, 1), then Z 2 is sub-exponential. { e λ Ee λ(z 2 1) = 1 2λ λ < 1/2 λ > 1/2. We have Ee λ(z 2 1) e 4λ2 /2, λ < 1/4. hence sub-exponential with parameters (2, 4). Tails of Z 2 1 are heavier than a Gaussian. 18 / 80

19 Proposition 4 Assume that X is zero-mean sub-exponential satisfying (2). Then, P(X t) exp ( 1 { t 2 2 min ν 2, t }), for all t 0. α Same bound holds with X replaced with X. Proof: Chernoff bound P(X t) inf [e ] λt Ee λx λ 0 Let f (λ) = λt + λ 2 ν 2 /2. Minimizer of f over R is λ = t/ν 2. inf exp 0 λ < 1 α ( λt + λ2 ν 2 2 ). 19 / 80

20 Hence minimizer of f over [0, 1/α] is { λ t/ν 2 t/ν 2 < 1/α = 1/α t/ν 2 1/α. and the minimum is t2 f (λ 2ν ) = 2 t < ν 2 /α t α + ν2 2α 2 t. 2α t ν2 /α Thus, f (λ ) max { t2 2ν 2, t } = 1 { t 2 2α 2 min ν 2, t }. α 20 / 80

21 Bernstein inequality for sub-exponential RVs Theorem 4 (Bernstein) Assume that {X i } are independent, zero-mean sub-exponential RVs with parameters (ν i, α i ). Let ν := ( ) 1/2 i ν2 i and α := maxi α i. Then i X i is sub-exponential with parameters (ν, α), and Proof: We have ( ) P X i t exp ( 1 { t 2 2 min ν 2, t }). α i Ee λx i e λ2 ν 2 i /2, for all λ < Let S n = i X i. By independence Ee λsn = i 1 max i α i. Ee λx i e λ2 i ν2 i /2, for all λ < The tail bound follows from Proposition 4. 1 max i α i. 21 / 80

22 Equivalent characterizations of sub-exponential RVs For a RV X, the following are equivalent: (HDP, Prop ) 1. The tails of X satisfy 2. The moments of X satisfy 3. The MGF of X satisfies P( X t) 2 exp( t/k 1 ), for all t 0. X p = (E X p ) 1/p K 2 p, for all p 1. E exp(λ X ) exp(k 3 λ), for all 0 λ 1 K 3 4. The MGF of X is bounded at some point, E exp( X /K 4 ) 2. Assuming EX = 0, the above are equivalent to: 5. The MGF of X satisfies E exp(λx ) exp(k5 2 λ 2 ), for all λ 1. K 5 22 / 80

23 Equivalent characterizations of sub-gaussianity For a RV X, the following are equivalent: (HDP, Prop ) 1. The tails of X satisfy 2. The moments of X satisfy 3. The MGF of X 2 satisfies P( X t) 2 exp( t 2 /K 2 1 ), for all t 0. X p = (E X p ) 1/p K 2 p, for all p 1. E exp(λx 2 ) exp(k3 2 λ 2 ), for all λ 1 K 3 4. The MGF of X 2 is bounded at some point, E exp(x 2 /K4 2 ) 2. Assuming EX = 0, the above are equivalent to: 5. The MGF of X satisfies E exp(λx ) exp(k5 2 λ 2 ), for all λ R. 23 / 80

24 Sub-exponential norm The sub-exponential norm is the smallest K 4 in property 4, i.e., X ψ1 = inf { t > 0 : E exp( X /t) 2 }. X is sub-exponential iff X ψ1 <. ψ1 is a proper norm on the space of sub-exponential RVs. Every sub-exponential variable satisfies the following bounds: P( X t) 2 exp( ct/ X ψ1 ), for all t 0. X p C X ψ1 p, for all p 1. E exp( X / X ψ1 ) 2 When EX = 0, E exp(λx ) exp(cλ 2 X ψ1 ) for all λ 1/ X ψ1. for some universal constant C, c > / 80

25 Lemma 2 A random variable X is sub-gaussian if and only if X 2 is sub-exponential, in fact X 2 ψ1 = X 2 ψ 2. Proof: Immediate from definition. Lemma 3 If X and Y are sub-gaussian, then XY is sub-exponential, and XY ψ1 X ψ2 Y ψ2 Proof: Assume X ψ2 = Y ψ2 = 1, WLOG. Apply Young s inequality ab (a 2 + b 2 )/2 for all a, b R, twice Ee XY Ee (X 2 +Y 2 )/2 = E[e X 2 /2 e Y 2 /2 ] 1 2 E[ e X 2 + e Y 2] / 80

26 Alternative form of Proposition 4: Theorem 5 (Bernstein) Assume that {X i } are independent, zero-mean sub-exponential RVs. Then, ( ) [ ( P X i t 2 exp c min i c > 0 is some universal constant. Corollary 1 (Bernstein) t 2 i X i 2 ψ 1, t )], t 0. max i X i ψ1 Assume that {X i } are independent, zero-mean sub-exponential RVs with X i ψ1 K for all i. Then, ( 1 P n n i=1 ) [ ( t 2 X i t 2 exp c n min K 2, t )], t 0. K 26 / 80

27 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 27 / 80

28 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 28 / 80

29 Concentration of χ 2 RVs I Example 2 Let Y χ 2 n, i.e., Y = n i=1 Z 2 i where Z i iid N(0, 1). Z 2 i are sub-exponential with parameters (2, 4). Then, Y is sub-exponential with parameters (2 n, 4) and we obtain or replacing t with nt ( 1 P n n i=1 P( Y EY ) 2 exp [ 1 ( t 2 2 min 4n, t )] 4 ) i 1 t Z 2 [ 2 exp 1 ] 8 n min(t2, t), t / 80

30 Concentration of χ 2 RVs II In particular, ( 1 P n n i=1 ) i 1 t 2e nt2 /8, t [0, 1]. Z 2 Second approach ignoring constants: We have Z 2 i 1 ψ1 C Z 2 i ψ1 = C Z i 2 ψ 2 = C. Applying Corollary 1 with K = C ( 1 P n n i=1 ) i 1 t Z 2 where c 2 = c min(1/c 2, 1/C). [ ( t 2 2 exp c n min [ 2 exp c 2 n min(t 2, t) C 2, t )], C ], t 0 30 / 80

31 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 31 / 80

32 Random projection for dimension reduction Suppose that we have data points {u 1,..., u N } R d. Want to project them down to a lower-dimensional space R m (m d) such that pairwise distances u i u j are approximately preserved. Can be done by a linear random projection X : R d R m, which can be viewed as a random matrix X R m d. Lemma 4 (Johnson-Lindenstrauss embedding) Let X := 1 m Z R m d where Z has iid N(0, 1) entries. Consider any collection of points {u 1,..., u N } R d. Take ε, δ (0, 1) and assume that Then with probability at least 1 ε, m 16 δ 2 log ( N ε ). (1 δ) u i u j 2 2 Xu i Xu j 2 2 (1 + δ) u i u j 2 2, i j 32 / 80

33 Proof Fix u R d and let Y := Zu 2 2 u 2 2 = m u z i, 2 u 2 i=1 where z T i is the ith row of Z. Then, Y χ 2 m. Recalling X = Z/ m, for all δ (0, 1), ( Xu 2 2 P u 2 2 ) 1 δ ( Y ) = P m 1 δ 2e mδ2 /8 Applying to u = u i u j, for any fixed pair (i, j), we have ( X (u i u j ) 2 2 P u i u j 2 2 ) 1 δ 2e mδ2 /8 33 / 80

34 Apply a further union bound for all pairs i j ) ( N 1 δ, for some i j 2 2 ( X (u i u j ) 2 2 P u i u j 2 2 ) e mδ2 /8 Since 2 ( N 2) N 2, the result follows by solving the following for m N 2 e mδ2 /8 ε. 34 / 80

35 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 35 / 80

36 l 2 norm of sub-gaussian vectors Here X 2 = n i=1 X 2 i. Proposition 5 (Concenration of norm, HDP 3.1.1) Let X = (X 1,..., X n ) R n be a random vector with independent, sub-gaussian coordinates X i that satisfy EXi 2 = 1. Then, X 2 n ψ2 CK 2 where K = max i X i ψ2 and C is an absolute constant. The result says that the norm is highly concentration around n: X 2 n in high dimensions (n large). Assuming K = O(1), it shows that w.h.p. X 2 = n + O(1). More precisely, w.p. 1 e c1v 2, we have n K 2 v X 2 n + K 2 v 36 / 80

37 Simple argument: Assuming sd(x 2 1 ) = O(1), E X 2 2 = n var( X 2 2) = n var(x 2 1 ) sd( X 2 2) = n sd(x 2 1 ) X 2 n ± O( n) = n ± O(1), the latter can be shown by Taylor expansion. 37 / 80

38 Proof of Proposition 5: Argue that we can take K 1. Since X i is sub-gaussian, X 2 i is sub-exponential and X 2 i 1 ψ1 C X 2 i ψ1 = C X i 2 ψ 2 CK 2. Applying Bernstein s inequality (Corollary 1), for any u 0, ( X 2 2 P n ) 1 u ( 2 exp c ) 1n K 4 min(u2, u), where we used K 4 K 2 and absorbed C into c 1. Using the inequality z 1 δ = z 2 1 max(δ, δ 2 ), z, ( X ) 2 P 1 δ n ( X 2 2 P n 2 exp ) 1 max(δ, δ 2 ) ( c 1n K 4 δ2). f (u) = min(u 2, u) and g(δ) = max(δ, δ 2 ), then f (g(δ)) = δ 2 for all δ 0. Change of variable δ = t/ n gives the result. 38 / 80

39 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 39 / 80

40 l norm of sub-gaussian vectors For any vector X R n, the l norm is: X = max X i. i=1,...,n Lemma 5 Let X = (X 1,..., X n ) R n be a random vector with zero-mean, independent, sub-gaussian coordinates X i with parameter σ i. Then, for any γ 0, where σ = max i σ i. P ( X σ 2(1 + γ) log n ) 2n γ Proof: We have P( X i t) 2 exp( t 2 /2σ 2 ), hence taking t = 2σ 2 (1 + γ) log n. ) P(max X i t) 2n exp ( t2 i 2σ 2 = 2n γ 40 / 80

41 Theorem 6 Assume {X i } n i=1 are zero-mean RVs, sub-gaussian with parameter σ. Then, E[ max i=1,...,n X i ] 2σ 2 log n, n 1 Proof of 6: Jensen s inequality on e λz where Z = max i X i. 41 / 80

42 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 42 / 80

43 Theorem 7 (Azuma Hoeffding) Assume that X = (X 1,..., X n ) has independent coordinates, and let Z = f (X ). Let us write E i [Z] = E[Z X 1,..., X i ] and let Assume that i := E i [Z] E i 1 [Z]. E i 1 [e λ i ] e σ2 i λ2 /2, λ R (3) almost surely, for all i = 1,..., n. Then, Z EZ is sub-gaussian with parameter n σ = i=1 σ2 i. In particular, we have the tail bound P( Z EZ t) 2 exp( t2 2σ 2 ). { i } is called Doob s martingale difference sequence. It is a martingale difference seq. since E i 1 [ i ] = / 80

44 Proof Let S j := j i=1 i which is only a function of X i, i j. We have, noting that E 0 [Z] = Z, S n = n i = Z EZ By properties of conditional expectation, and assumption (3), Taking E n 2 of both sides: i=1 E n 1 [e λsn ] = e λsn 1 E n 1 [e λ n ] e λsn 1 e σ2 n λ2 /2 E n 2 [e λsn ] e σ2 n λ2 /2 E n 2 [e λsn 1 ] e λsn 2 e (σ2 n +σ2 n 1 )λ2 /2 Repeating the process, we get E 0 [e λsn ] exp(( n i=1 σ2 i )λ2 /2). 44 / 80

45 Bounded difference inequality Conditional sub-g. assump. holds under bounded difference property: f (x1,..., x i 1, x i, x i+1,..., x n ) f (x 1,..., x i 1, x i, x i+1,..., x n ) Li (4) for all x 1,..., x n, x i X, and all i [n], for some constants (L 1,..., L n ). Theorem 8 (Bounded difference) Assume that X = (X 1,..., X n ) has independent coordinates, and assume that f : X n R satisfies the bounded difference property (4). Then, P( f (X ) Ef (X ) ) ) t 2 exp ( 2t2 n, t 0. i=1 L2 i 45 / 80

46 Proof (Naive bound) We have i = E i [Z] E i 1 [E i [Z]] = g i (X 1,..., X i ) E i 1 [g i (X 1,..., X i )] Let X i be an independent copy of X i. Conditioned on X 1,..., X i 1, we are effectively looking at g i (x 1,..., x i 1, X i ) E[g i (x 1,..., x i 1, X i )] due to independence of {X 1,..., X i, X i }. Thus, i L i condition on X 1,..., X i 1. That is, E i 1 [e λ i ] e σ2 i λ2 /2 where σ 2 i = (2L i ) 2 /4 = L 2 i. 46 / 80

47 Proof (Better bound) Can show that i I i where I i L i, improving the constant by 4. Conditioned on X 1,..., X i, we are effectively looking at i = g i (x 1,..., x i 1, X i ) µ i where µ i is a constant (only a function of x 1,..., x i 1 ). Then, i + µ i [a i, b i ] where a i = inf g i(x 1,..., x i 1, x), b i = sup g i (x 1,..., x i 1, x). x We have (need to argue that g i satisfies bounded difference) [ b i a i = sup gi (x 1,..., x i 1, x) g i (x 1,..., x i 1, y) ] L i. x,y Thus E i 1 [e λ i ] e σ2 λ 2 /2 where σ 2 i = (b i a i ) 2 /4 L 2 i /4. x 47 / 80

48 The role of independence in the second argument is subtle. The only place we used independence is to argue that E i [Z] satisfies bounded difference for all i. We argue that E i [Z] = g i (X 1,..., X i ), which is where we use independence. Then, g i by definition and Jensen satisfies bounded difference. 48 / 80

49 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 49 / 80

50 Example: (bounded) U-statistics g : R 2 R a symmetric function, X 1,..., X n and iid sequence, U := ( 1 n ) g(x i, X j ) 2 i<j is called a U-statistic (of order 2). U is not a sum of independent variables, e.g. n = 3 gives U = 1 3( g(x1, X 2 ) + g(x 1, X 3 ) + g(x 2, X 3 ) ), but the dependence between terms is relatively weak (made precise shortly). For example, g(x, y) = 1 2 (x y)2 gives an unbiased estimator of the variance. (Exercise) 50 / 80

51 Assume that g is bounded, i.e. g b, meaning i.e., g(x, y) b for all x, y R. g := sup g(x, y) b x,y Writing U = f (X 1,..., X n ), we observe that (for fixed k) f (x) f (x \k ) ( 1 n ) g(x i, x k ) g(x i, x k) 2 i k (n 1)2b n(n 1)/2 = 4b n thus f has bounded differences with parameters L k = 4b/n. Applying Theorem 8 P ( U EU t ) 2e nt2 /8b 2, t / 80

52 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 52 / 80

53 Clique number of Erdös-Rényi Let G be an undirected graph.on n nodes. A clique in G is a complete (induced) sub-graph. Clique number of G denoted as ω(g) is the size of the largest clique(s). For two graphs G and G that differ in at most 1 edge, ω(g) ω(g ) 1. Thus E(G) ω(g) has bounded difference property with L = 1. Let G be an Erdös-Rényi random graph: Edges are independently drawn with probability p. Then, with m = ( n 2), ( ) P ω(g) E ω(g) δ 2e 2δ2 /m or setting ω(g) = ω(g)/m, ( ) P ω(g) E ω(g) δ 2e 2mδ2 53 / 80

54 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 54 / 80

55 Lipschitz functions of standard Gaussian vector A function f : R n R is L-Lipschitz w.r.t. 2 if f (x) f (y) L x y 2, x, y R n Theorem 9 (Gaussian concentration) Let X N(0, I n ) be a standard Gaussian vector and assume that f : R n R is L-Lipschitz w.r.t. the Euclidean norm. Then, P( f (X ) E[f (X )] ) ) t 2 exp ( t2 2L 2, t 0. (5) In other words, f (X ) is sub-gaussian with parameter L. Deep result, no easy proof! Has far-reaching consequences. One-sided bounds holds with prefactor 2 removed. 55 / 80

56 Example: χ 2 and norm concentrations revisited Let X N(0, I n ) and condition the function f (x) = x 2 / n. f is L-Lipschitz with L = 1/ n. Hence, ( X 2 P E X ) 2 t e nt2 /2, t 0 n n Since E X 2 n (why?), we have ( X 2 ) P 1 t e n t2 /2, t 0. n For t [0, 1], (1 + t) t, hence or setting 3t = δ, ( X 2 ) P 2 1 3t e n t2 /2, t [0, 1]. n ( X 2 ) P 2 1 δ e n δ2 /18, δ [0, 3]. n 56 / 80

57 Example: order statistics Let X N(0, I n ), and let f (x) = x (k) be the kth order statistic: For x R n, For any x, y R n, we have hence f is 1-Lipschitz. (Exercise) It follows that x (1) x (2) x (n) x (k) y (k) x y 2 P ( X (k) EX (k) t ) 2e t2 /2, t 0 iid In particular, if X i N(0, 1), i = 1,..., n, then ) P( max X n E[ max X n] t 2e t2 /2, t 0 i=1,...,n i=1,...,n 57 / 80

58 Example: singular values Consider a matrix X R n d where n > d. Let σ 1 (X ) σ 2 (X ) σ k (X ) be (ordered) singular values of X. By Weyl s theorem, for any X, Y R n d : σ k (X ) σ k (Y ) X Y op X Y F (Note that this is a generalization of order-statistics inequality.) Thus, X σ k (X ) is 1-Lipschitz: Proposition 6 Let X R n d be a random matrix with iid N(0, 1) entries. Then, ( σk P (X ) E[σ k (X )] ) δ 2e δ2 /2, δ 0 It remains to characterize E[σ k (X )]. For an overview of matrix norms, see matrix norms.pdf 58 / 80

59 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 59 / 80

60 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 60 / 80

61 Linear regression setup The data is (y, X ) where y R n and X R n d, and the model θ R d is an unknown parameter. y = X θ + w. w R n is the vector of noise variables. Equivalently, y i = θ, x i + w i, i = 1,..., n where x i R d is the nth row of X : x1 T x2 T X =.. xn T }{{} d Recall θ, x i = d j=1 θ j x ij. 61 / 80

62 Sparsity models When n < d, no hope of estimating θ, unless we impose some sort of of low-dimensional model on θ. Support of θ (recall [d] = {1,..., d}): supp(θ ) := S(θ ) = { j [d] : θ j 0 }. Hard sparsity assumption: s = S(θ ) d. Weaker sparsity assumption via l q balls for q [0, 1] q = gives l 1 ball. B q (R q ) = { θ R d : q = 0 the l 0 ball, same as hard sparsity: d θ j q R q }. j=1 θ 0 := S(θ ) = # { j; θ j 0 } 62 / 80

63 (from HDS book) 63 / 80

64 Basis pursuit Consider the noiseless case y = X θ. We assume that θ 0 is small. Ideal program to solve: min θ 0 subject to y = X θ θ R d 0 is highly non-convex, relax to 1 : This is called basis pursuit (regression). (6) is a convex program. In fact, can be written as a linear program 1. min θ 1 subject to y = X θ (6) θ R d Global solutions can be obtained very efficiently. 1 Exercise: Introduce auxiliary variables s j R and note that minimizing j s j subject to θ j s j gives the l 1 norm of θ. 64 / 80

65 Define C(S) = { R d : S c 1 S 1 }. (7) Theorem 10 The following two are equivalent: For any θ R d with support S, the basis pursuit program (6) applied to the data (y = X θ, X ) has unique solution θ = θ. The restricted null space property holds, i.e., C(S) ker(x ) = {0}. (8) 65 / 80

66 Proof Consider the tangent cone to the l 1 ball (of radius θ 1 ) at θ : T(θ ) = { R d : θ + t 1 θ 1, for some t > 0.} i.e., the set of descent direction for l 1 norm at point θ. Feasible set is θ + ker(x ), i.e. ker(x ) is the set of feasible directions = θ θ. Hence, there is a minimizer other than θ if and only if T(θ ) ker(x ) {0} (9) It is enough to show that C(S) = T(θ ). θ R d : supp(θ ) S 66 / 80

67 B1 θ (1) T(θ (2) ) Ker(X) T(θ (1) ) θ (2) C(S) d = 2, [d] = {1, 2}, S = {2}, θ(1) = (0, 1), θ = (0, 1). (2) C(S) = {( 1, 2 ) : 1 2 }. 67 / 80

68 It is enough to show that C(S) = T(θ ) (10) θ R d : supp(θ ) S We have T 1 (θ ) iff S c 1 θs 1 θs + S 1 We have T 1 (θ ) for some θ R d s.t. supp(θ ) S iff S c 1 sup θs Rd ] [ θs 1 θs + S 1 = S 1 (Let T 1 (θ ) be the subset of T 1 where t = 1.) 68 / 80

69 Sufficient conditions for restricted nullspace [d] := {1,..., d} For a matrix X R d, let X j be its jth column (for j [d]). The pairwise incoherence of X is defined as δ PW (X ) := max i,j [d] X i, X j n Alternative form: X T X is the Gram matrix of X, (X T X ) ij = X i, X j. 1{i = j} δ PW (X ) := X T X n I p where is the vector l norm of the matrix. 69 / 80

70 Proposition 7 (Uniform) restricted nullspace holds for all S with S s if δ PW (X ) 1 3s Proof: Exercise / 80

71 A more relaxed condition: Definition 3 (RIP) X R n d satisfies a restricted isometry property (RIP) of order s with constant δ s (X ) > 0 if X T S X S n I s op δ s (X ), for all S with S s PW incoherence is close to RIP with s = 2; for example, when X j / n 2 = 1 for all j, we have δ 2 (X ) = δ PW (X ). In general, for any s 2, δ PW (X ) δ s (X ) s δ PW (X ). 71 / 80

72 RIP gives sufficient conditions: Proposition 8 (Uniform) restricted null space holds for all S with S s if δ 2s (X ) 1 3 Consider a sub-gaussian matrix X with i.i.d. entries (Exercise 7.7): We have δ PW (X ) < 1 3s w.h.p. whenever n s 2 log d By contrast for certain classes of random matrices we have δ 2s < 1 3 whenever n s log(ed/s) 72 / 80

73 Noisy sparse regression A very popular estimator is the l 1 -regularized least-squares: [ 1 ] θ argmin θ R 2n y X θ λ θ 1 d (11) The idea: minimizing l 1 norm leads to sparse solutions. (11) is a convex program; global solution can be obtained efficiently. Other options: constrained form of lasso and relaxed basis persuit min θ 1 R 1 2n y X θ 2 2 (12) min θ R d θ 1 s.t. 1 2n y X θ 2 2 b 2 (13) 73 / 80

74 For a constant α 1, C α (S) := { R d S c 1 α S 1 }. Definition 4 (RE condition) A matrix X satisfies the restricted eigenvalue (RE) condition over S with parameters (κ, α) if 1 n X 2 2 κ 2 2 for all C α (S). Intuition: θ minimizes L(θ) := 1 2n X θ y 2. Ideally, δl := L( θ) L(θ ) is small. Want to translate deviation in loss to deviations in parameter θ θ. Controlled by the curvature of the loss, captured by the Hessian 2 L(θ) = 1 n X T X. 74 / 80

$Ideally would like strong convexity (in all directions): or in the context of regression, 2 L(θ) κ 2, R d \ {0}.$

75 Ideally would like strong convexity (in all directions): or in the context of regression, 2 L(θ) κ 2, R d \ {0}. 1 n X 2 2 κ 2, R d \ {0}. In high-dimensions, cannot guarantee this in all directions, the loss is flat over ker X. 75 / 80

76 Theorem 11 Assume that y = X θ + w, where X R n d and θ R d, and θ is supported on S [d] with S s X satisfies RE(κ, 3) over S. Let us define z = X T w n and γ 2 := w 2 2 2n. Then, we have the following: (a) Any solution of Lasso (11) with λ 2 z satisfies θ θ 2 κ 3 s λ (b) Any solution of constrained Lasso (12) with R = θ 1 satisfies θ θ 2 4 κ s z (c) Any solution of relaxed basis pursuit (13) with b 2 γ 2 satisfies θ θ 2 4 κ s z + 2 κ b2 γ 2 76 / 80

77 Example (fixed design regression) Assume y = X θ + w where w N(0, σ 2 I n ), and X R n d fixed and satisfying RE condition and normalization where X j is the jth column of X. Recall z = X T w/n. X j max C. j=1,...,d n It is easy to show that w.p. 1 2e nδ2 /2, ( 2 log d ) z Cσ + δ n Thus, setting λ = 2Cσ ( 2 log d n + δ ), Lasso solution satisfies w.p. at least 1 2e nδ2 /2. θ θ 2 6Cσ ( 2 log d ) s + δ κ n 77 / 80

78 Proof Let us simplify the loss L(θ) := 1 2n X θ y 2. Setting = θ θ, where z = X T w/n. Hence, L(θ) = 1 2n X (θ θ ) w 2 = 1 X w 2 2n = 1 2n X 2 1 X, w + const. n = 1 2n X 2 1 n, X T w + const. = 1 2n X 2, z + const. L(θ) L(θ ) = 1 2n X 2, z. (14) Exercise: Show that (14) is the Taylor expansion of L around θ. 78 / 80

79 Proof (constrained version) By optimality of θ and feasibility of θ : L( θ) L(θ ) Error vector := θ θ satisfies basic inequality Using Holder inequality 1 2n X 2 2 z,. 1 2n X 2 2 z 1. Since θ 1 θ 1, we have = θ θ C 1 (S), hence 1 = S 1 + S c 1 2 S 1 2 s 2. Combined with RE condition ( C 3 (S) as well) which gives the desired result. 1 2 κ s z / 80

80 Proof (Lagrangian version) Let L(θ) := L(θ) + λ θ 1 be the regularized loss. Basic inequality is L( θ) + λ θ 1 L(θ ) + λ θ 1 Rearranging We have 1 2n X 2 2 z, + λ( θ 1 θ 1 ) Since λ 2 z, θ 1 θ 1 = θ S 1 θ S + S 1 S c 1 S 1 S c 1 1 n X 2 2 λ 1 + 2λ( S 1 S c 1 ) λ(3 S 1 S c 1 ) It follows that C 3 (S) and the rest of proof follows. 80 / 80

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 59 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d