STAT 200C: High-dimensional Statistics

Size: px
Start display at page:

Download "STAT 200C: High-dimensional Statistics"

Transcription

1 STAT 200C: High-dimensional Statistics Arash A. Amini April 27, / 80

2 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d γ or even d n, e.g genes with only 50 samples. Classical methods fail. E.g., Linear regression y = X β + ε, where ε N(0, σ 2 I n ). ˆβ OLS = argmin β R d y X β 2 2 We have MSE( ˆβ OLS ) = O( σ2 d n ). Solution: Assume some underlying low-dimensional structure (e.g. sparsity). 2 / 80

3 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 3 / 80

4 Concentration inequalities Main tools in dealing with high-dimensional randomness. Non-asymptotic versions of the CLT. General form: P( X EX > t) < something small. Classical examples: Markov and Chebyshev inequalities: Markov: Assume X 0, then P(X t) EX t. Chebyshev: Assume EX 2 <, and let µ = EX. Then, P( X µ t) var(x ) t 2. Stronger assumption: E X k <. Then, P( X µ t) E X µ k t k. 4 / 80

5 Concentration inequalities Example 1 X 1,..., X n Ber(1/2) and S n = n i=1 X i. Then, by CLT Z n := S n n/2 n/4 d N(0, 1). Letting g N(0, 1), P (S n n2 ) n4 + t P(g t) 1 /2 2 e t2. Letting t = α n, P (S n n ) 2 (1 + α) 1 2 e n α2 /2. Problem: Approximation is not tight in general. 5 / 80

6 Theorem 1 (Berry Esseen CLT) Under the assumption of CLT, with ρ = E X 1 µ 3 /σ 3, P(Z n t) P(g t) ρ n. ( The bound is tight since P(S n = n/2) = 1 n ) 2 n n/2 n 1, for the Bernoulli example. Conclusion, the approximation error is O(n 1/2 ) which is a lot larger than the exponential bound O(e n α2 /2 ) that we want to establish. Solution: directly obtain the concentration inequalities, often using Chernoff bounding technique: for any λ > 0, P(Z n t) = P(e λzn e λt ) EeλZn e λt, t R. Leads to the study of the MGF of random variables. 6 / 80

7 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 7 / 80

8 Sub-Gaussian concentration Definition 1 A zero mean random variable X is sub-gaussian if for some σ > 0. Ee λx e σ2 λ 2 /2, for all λ R. (1) A general random variable is sub-gaussian if X EX is sub-gaussian. X N(0, σ 2 ) satisfies (1) with equality. A Rademacher variable: also called symmetric Bernoulli P(X = ±1) = 1 2 is sub-gaussian, Ee λx = cosh(λ) e λ2 /2. Any bounded RV is sub-gaussian: X [a, b] a.s., then (1) with σ = b a 2. 8 / 80

9 Proposition 1 Assume that X is zero-mean sub-gaussian satisfying (1). Then, ) P(X t) exp ( t2 2σ 2, for all t 0. Same bound holds with X replaced with X. Proof: Chernoff bound [ ] ( P(X t) inf e λt Ee λx = inf exp λ>0 λ>0 λt + λ2 σ 2 Union bound gives two-sided bound: P( X t) 2 exp( t2 2σ 2 ). What if µ := EX 0? Apply to X µ, ) P( X µ t) 2 exp ( t2 2σ 2. 2 ). 9 / 80

10 Proposition 2 Assume that {X i } are independent, zero-mean sub-gaussian with parameters {σ i }. Then, S n = i X i is sub-gaussian with parameter σ := i σ2 i. Sub-Gaussian parameter squared behaves like the variance. Proof: Ee λsn = i EeλX i. 10 / 80

11 Theorem 2 (Hoeffding) Assume that {X i } are independent, zero-mean sub-gaussian with parameters {σ i }. Then, letting σ 2 := i σ2 i, ( ) ) P X i t exp ( t2 2σ 2, t 0. Same bound holds with X i replaced with X i. i Alternative form, assume there are n variables, and let σ 2 := 1 n n i=1 σ2 i, and X n := 1 n n i=1 X i. Then, P ( Xn t ) ) exp ( nt2 2 σ 2, t 0. Example: X i iid Rad so that σ = σi = / 80

12 Equivalent characterizations of sub-gaussianity For a RV X, the following are equivalent: (HDP, Prop ) 1. The tails of X satisfy P( X t) 2 exp( t 2 /K1 2 ), for all t The moments of X satisfy X p = (E X p ) 1/p K 2 p, for all p The MGF of X 2 satisfies E exp(λx 2 ) exp(k 2 3 λ 2 ), for all λ 1 K 3 4. The MGF of X 2 is bounded at some point, E exp(x 2 /K4 2 ) 2. Assuming EX = 0, the above are equivalent to: 5. The MGF of X satisfies E exp(λx ) exp(k5 2 λ 2 ), for all λ R. 12 / 80

13 Sub-Gaussian norm The sub-gaussian norm is the smallest K 4 in property 4, i.e., X ψ2 = inf { t > 0 : E exp(x 2 /t 2 ) 2 }. X is sub-gaussian iff X ψ2 <. ψ2 is a proper norm on the space of sub-gaussian RVs. Every sub-gaussian variable satisfies the following bounds: P( X t) 2 exp( ct 2 / X 2 ψ 2 ), for all t 0. X p C X ψ2 p, for all p 1. E exp(x 2 / X 2 ψ 2 ) 2 When EX = 0, E exp(λx ) exp(cλ 2 X 2 ψ 2 ) for all λ R. for some universal constant C, c > / 80

14 Some consequences Recall what a universal/numerical/absolute constant means. Sub-Gaussian norm is within a constant factor of the sub-gaussian parameter σ: for numerical constant c 1, c 2 > 0, c 1 X ψ2 σ(x ) c 2 X ψ2. Easy to see that X ψ2 X. (Bounded variables are sub-gaussian) a b means a Cb for some universal constant C. Lemma 1 (Centering) If X is sub-gaussian, then X EX is sub-gaussian too and X EX ψ2 C X ψ2 where C is a universal constant. Proof: EX ψ2 EX E X = X 1 X ψ2. Note: X EX ψ2 could be much smaller than X ψ2. 14 / 80

15 Alternative forms Alternative form of Proposition 2: Proposition 3 (HDP 2.6.1) Assume that {X i } are independent, zero-mean sub-gaussian RVs. Then i X i is also sub-gaussian and i X i 2 ψ 2 C i X i 2 ψ 2 where C is an absolute constant. 15 / 80

16 Alternative form of Theorem 2: Theorem 3 (Hoeffding) Assume that {X i } are independent, zero-mean sub-gaussian RVs. Then, ( ) P X i t 2 exp ( c ) t2 i X, t 0. i ψ2 c > 0 is some universal constant. i 16 / 80

17 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 17 / 80

18 Sub-exponential concentration Definition 2 A zero mean random variable X is sub-exponential if for some ν, α > 0. Ee λx e ν2 λ 2 /2, for all λ < 1 α. (2) A general random variable is sub-exponential if X EX is sub-exponential. If Z N(0, 1), then Z 2 is sub-exponential. { e λ Ee λ(z 2 1) = 1 2λ λ < 1/2 λ > 1/2. We have Ee λ(z 2 1) e 4λ2 /2, λ < 1/4. hence sub-exponential with parameters (2, 4). Tails of Z 2 1 are heavier than a Gaussian. 18 / 80

19 Proposition 4 Assume that X is zero-mean sub-exponential satisfying (2). Then, P(X t) exp ( 1 { t 2 2 min ν 2, t }), for all t 0. α Same bound holds with X replaced with X. Proof: Chernoff bound P(X t) inf [e ] λt Ee λx λ 0 Let f (λ) = λt + λ 2 ν 2 /2. Minimizer of f over R is λ = t/ν 2. inf exp 0 λ < 1 α ( λt + λ2 ν 2 2 ). 19 / 80

20 Hence minimizer of f over [0, 1/α] is { λ t/ν 2 t/ν 2 < 1/α = 1/α t/ν 2 1/α. and the minimum is t2 f (λ 2ν ) = 2 t < ν 2 /α t α + ν2 2α 2 t. 2α t ν2 /α Thus, f (λ ) max { t2 2ν 2, t } = 1 { t 2 2α 2 min ν 2, t }. α 20 / 80

21 Bernstein inequality for sub-exponential RVs Theorem 4 (Bernstein) Assume that {X i } are independent, zero-mean sub-exponential RVs with parameters (ν i, α i ). Let ν := ( ) 1/2 i ν2 i and α := maxi α i. Then i X i is sub-exponential with parameters (ν, α), and Proof: We have ( ) P X i t exp ( 1 { t 2 2 min ν 2, t }). α i Ee λx i e λ2 ν 2 i /2, for all λ < Let S n = i X i. By independence Ee λsn = i 1 max i α i. Ee λx i e λ2 i ν2 i /2, for all λ < The tail bound follows from Proposition 4. 1 max i α i. 21 / 80

22 Equivalent characterizations of sub-exponential RVs For a RV X, the following are equivalent: (HDP, Prop ) 1. The tails of X satisfy 2. The moments of X satisfy 3. The MGF of X satisfies P( X t) 2 exp( t/k 1 ), for all t 0. X p = (E X p ) 1/p K 2 p, for all p 1. E exp(λ X ) exp(k 3 λ), for all 0 λ 1 K 3 4. The MGF of X is bounded at some point, E exp( X /K 4 ) 2. Assuming EX = 0, the above are equivalent to: 5. The MGF of X satisfies E exp(λx ) exp(k5 2 λ 2 ), for all λ 1. K 5 22 / 80

23 Equivalent characterizations of sub-gaussianity For a RV X, the following are equivalent: (HDP, Prop ) 1. The tails of X satisfy 2. The moments of X satisfy 3. The MGF of X 2 satisfies P( X t) 2 exp( t 2 /K 2 1 ), for all t 0. X p = (E X p ) 1/p K 2 p, for all p 1. E exp(λx 2 ) exp(k3 2 λ 2 ), for all λ 1 K 3 4. The MGF of X 2 is bounded at some point, E exp(x 2 /K4 2 ) 2. Assuming EX = 0, the above are equivalent to: 5. The MGF of X satisfies E exp(λx ) exp(k5 2 λ 2 ), for all λ R. 23 / 80

24 Sub-exponential norm The sub-exponential norm is the smallest K 4 in property 4, i.e., X ψ1 = inf { t > 0 : E exp( X /t) 2 }. X is sub-exponential iff X ψ1 <. ψ1 is a proper norm on the space of sub-exponential RVs. Every sub-exponential variable satisfies the following bounds: P( X t) 2 exp( ct/ X ψ1 ), for all t 0. X p C X ψ1 p, for all p 1. E exp( X / X ψ1 ) 2 When EX = 0, E exp(λx ) exp(cλ 2 X ψ1 ) for all λ 1/ X ψ1. for some universal constant C, c > / 80

25 Lemma 2 A random variable X is sub-gaussian if and only if X 2 is sub-exponential, in fact X 2 ψ1 = X 2 ψ 2. Proof: Immediate from definition. Lemma 3 If X and Y are sub-gaussian, then XY is sub-exponential, and XY ψ1 X ψ2 Y ψ2 Proof: Assume X ψ2 = Y ψ2 = 1, WLOG. Apply Young s inequality ab (a 2 + b 2 )/2 for all a, b R, twice Ee XY Ee (X 2 +Y 2 )/2 = E[e X 2 /2 e Y 2 /2 ] 1 2 E[ e X 2 + e Y 2] / 80

26 Alternative form of Proposition 4: Theorem 5 (Bernstein) Assume that {X i } are independent, zero-mean sub-exponential RVs. Then, ( ) [ ( P X i t 2 exp c min i c > 0 is some universal constant. Corollary 1 (Bernstein) t 2 i X i 2 ψ 1, t )], t 0. max i X i ψ1 Assume that {X i } are independent, zero-mean sub-exponential RVs with X i ψ1 K for all i. Then, ( 1 P n n i=1 ) [ ( t 2 X i t 2 exp c n min K 2, t )], t 0. K 26 / 80

27 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 27 / 80

28 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 28 / 80

29 Concentration of χ 2 RVs I Example 2 Let Y χ 2 n, i.e., Y = n i=1 Z 2 i where Z i iid N(0, 1). Z 2 i are sub-exponential with parameters (2, 4). Then, Y is sub-exponential with parameters (2 n, 4) and we obtain or replacing t with nt ( 1 P n n i=1 P( Y EY ) 2 exp [ 1 ( t 2 2 min 4n, t )] 4 ) i 1 t Z 2 [ 2 exp 1 ] 8 n min(t2, t), t / 80

30 Concentration of χ 2 RVs II In particular, ( 1 P n n i=1 ) i 1 t 2e nt2 /8, t [0, 1]. Z 2 Second approach ignoring constants: We have Z 2 i 1 ψ1 C Z 2 i ψ1 = C Z i 2 ψ 2 = C. Applying Corollary 1 with K = C ( 1 P n n i=1 ) i 1 t Z 2 where c 2 = c min(1/c 2, 1/C). [ ( t 2 2 exp c n min [ 2 exp c 2 n min(t 2, t) C 2, t )], C ], t 0 30 / 80

31 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 31 / 80

32 Random projection for dimension reduction Suppose that we have data points {u 1,..., u N } R d. Want to project them down to a lower-dimensional space R m (m d) such that pairwise distances u i u j are approximately preserved. Can be done by a linear random projection X : R d R m, which can be viewed as a random matrix X R m d. Lemma 4 (Johnson-Lindenstrauss embedding) Let X := 1 m Z R m d where Z has iid N(0, 1) entries. Consider any collection of points {u 1,..., u N } R d. Take ε, δ (0, 1) and assume that Then with probability at least 1 ε, m 16 δ 2 log ( N ε ). (1 δ) u i u j 2 2 Xu i Xu j 2 2 (1 + δ) u i u j 2 2, i j 32 / 80

33 Proof Fix u R d and let Y := Zu 2 2 u 2 2 = m u z i, 2 u 2 i=1 where z T i is the ith row of Z. Then, Y χ 2 m. Recalling X = Z/ m, for all δ (0, 1), ( Xu 2 2 P u 2 2 ) 1 δ ( Y ) = P m 1 δ 2e mδ2 /8 Applying to u = u i u j, for any fixed pair (i, j), we have ( X (u i u j ) 2 2 P u i u j 2 2 ) 1 δ 2e mδ2 /8 33 / 80

34 Apply a further union bound for all pairs i j ) ( N 1 δ, for some i j 2 2 ( X (u i u j ) 2 2 P u i u j 2 2 ) e mδ2 /8 Since 2 ( N 2) N 2, the result follows by solving the following for m N 2 e mδ2 /8 ε. 34 / 80

35 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 35 / 80

36 l 2 norm of sub-gaussian vectors Here X 2 = n i=1 X 2 i. Proposition 5 (Concenration of norm, HDP 3.1.1) Let X = (X 1,..., X n ) R n be a random vector with independent, sub-gaussian coordinates X i that satisfy EXi 2 = 1. Then, X 2 n ψ2 CK 2 where K = max i X i ψ2 and C is an absolute constant. The result says that the norm is highly concentration around n: X 2 n in high dimensions (n large). Assuming K = O(1), it shows that w.h.p. X 2 = n + O(1). More precisely, w.p. 1 e c1v 2, we have n K 2 v X 2 n + K 2 v 36 / 80

37 Simple argument: Assuming sd(x 2 1 ) = O(1), E X 2 2 = n var( X 2 2) = n var(x 2 1 ) sd( X 2 2) = n sd(x 2 1 ) X 2 n ± O( n) = n ± O(1), the latter can be shown by Taylor expansion. 37 / 80

38 Proof of Proposition 5: Argue that we can take K 1. Since X i is sub-gaussian, X 2 i is sub-exponential and X 2 i 1 ψ1 C X 2 i ψ1 = C X i 2 ψ 2 CK 2. Applying Bernstein s inequality (Corollary 1), for any u 0, ( X 2 2 P n ) 1 u ( 2 exp c ) 1n K 4 min(u2, u), where we used K 4 K 2 and absorbed C into c 1. Using the inequality z 1 δ = z 2 1 max(δ, δ 2 ), z, ( X ) 2 P 1 δ n ( X 2 2 P n 2 exp ) 1 max(δ, δ 2 ) ( c 1n K 4 δ2). f (u) = min(u 2, u) and g(δ) = max(δ, δ 2 ), then f (g(δ)) = δ 2 for all δ 0. Change of variable δ = t/ n gives the result. 38 / 80

39 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 39 / 80

40 l norm of sub-gaussian vectors For any vector X R n, the l norm is: X = max X i. i=1,...,n Lemma 5 Let X = (X 1,..., X n ) R n be a random vector with zero-mean, independent, sub-gaussian coordinates X i with parameter σ i. Then, for any γ 0, where σ = max i σ i. P ( X σ 2(1 + γ) log n ) 2n γ Proof: We have P( X i t) 2 exp( t 2 /2σ 2 ), hence taking t = 2σ 2 (1 + γ) log n. ) P(max X i t) 2n exp ( t2 i 2σ 2 = 2n γ 40 / 80

41 Theorem 6 Assume {X i } n i=1 are zero-mean RVs, sub-gaussian with parameter σ. Then, E[ max i=1,...,n X i ] 2σ 2 log n, n 1 Proof of 6: Jensen s inequality on e λz where Z = max i X i. 41 / 80

42 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 42 / 80

43 Theorem 7 (Azuma Hoeffding) Assume that X = (X 1,..., X n ) has independent coordinates, and let Z = f (X ). Let us write E i [Z] = E[Z X 1,..., X i ] and let Assume that i := E i [Z] E i 1 [Z]. E i 1 [e λ i ] e σ2 i λ2 /2, λ R (3) almost surely, for all i = 1,..., n. Then, Z EZ is sub-gaussian with parameter n σ = i=1 σ2 i. In particular, we have the tail bound P( Z EZ t) 2 exp( t2 2σ 2 ). { i } is called Doob s martingale difference sequence. It is a martingale difference seq. since E i 1 [ i ] = / 80

44 Proof Let S j := j i=1 i which is only a function of X i, i j. We have, noting that E 0 [Z] = Z, S n = n i = Z EZ By properties of conditional expectation, and assumption (3), Taking E n 2 of both sides: i=1 E n 1 [e λsn ] = e λsn 1 E n 1 [e λ n ] e λsn 1 e σ2 n λ2 /2 E n 2 [e λsn ] e σ2 n λ2 /2 E n 2 [e λsn 1 ] e λsn 2 e (σ2 n +σ2 n 1 )λ2 /2 Repeating the process, we get E 0 [e λsn ] exp(( n i=1 σ2 i )λ2 /2). 44 / 80

45 Bounded difference inequality Conditional sub-g. assump. holds under bounded difference property: f (x1,..., x i 1, x i, x i+1,..., x n ) f (x 1,..., x i 1, x i, x i+1,..., x n ) Li (4) for all x 1,..., x n, x i X, and all i [n], for some constants (L 1,..., L n ). Theorem 8 (Bounded difference) Assume that X = (X 1,..., X n ) has independent coordinates, and assume that f : X n R satisfies the bounded difference property (4). Then, P( f (X ) Ef (X ) ) ) t 2 exp ( 2t2 n, t 0. i=1 L2 i 45 / 80

46 Proof (Naive bound) We have i = E i [Z] E i 1 [E i [Z]] = g i (X 1,..., X i ) E i 1 [g i (X 1,..., X i )] Let X i be an independent copy of X i. Conditioned on X 1,..., X i 1, we are effectively looking at g i (x 1,..., x i 1, X i ) E[g i (x 1,..., x i 1, X i )] due to independence of {X 1,..., X i, X i }. Thus, i L i condition on X 1,..., X i 1. That is, E i 1 [e λ i ] e σ2 i λ2 /2 where σ 2 i = (2L i ) 2 /4 = L 2 i. 46 / 80

47 Proof (Better bound) Can show that i I i where I i L i, improving the constant by 4. Conditioned on X 1,..., X i, we are effectively looking at i = g i (x 1,..., x i 1, X i ) µ i where µ i is a constant (only a function of x 1,..., x i 1 ). Then, i + µ i [a i, b i ] where a i = inf g i(x 1,..., x i 1, x), b i = sup g i (x 1,..., x i 1, x). x We have (need to argue that g i satisfies bounded difference) [ b i a i = sup gi (x 1,..., x i 1, x) g i (x 1,..., x i 1, y) ] L i. x,y Thus E i 1 [e λ i ] e σ2 λ 2 /2 where σ 2 i = (b i a i ) 2 /4 L 2 i /4. x 47 / 80

48 The role of independence in the second argument is subtle. The only place we used independence is to argue that E i [Z] satisfies bounded difference for all i. We argue that E i [Z] = g i (X 1,..., X i ), which is where we use independence. Then, g i by definition and Jensen satisfies bounded difference. 48 / 80

49 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 49 / 80

50 Example: (bounded) U-statistics g : R 2 R a symmetric function, X 1,..., X n and iid sequence, U := ( 1 n ) g(x i, X j ) 2 i<j is called a U-statistic (of order 2). U is not a sum of independent variables, e.g. n = 3 gives U = 1 3( g(x1, X 2 ) + g(x 1, X 3 ) + g(x 2, X 3 ) ), but the dependence between terms is relatively weak (made precise shortly). For example, g(x, y) = 1 2 (x y)2 gives an unbiased estimator of the variance. (Exercise) 50 / 80

51 Assume that g is bounded, i.e. g b, meaning i.e., g(x, y) b for all x, y R. g := sup g(x, y) b x,y Writing U = f (X 1,..., X n ), we observe that (for fixed k) f (x) f (x \k ) ( 1 n ) g(x i, x k ) g(x i, x k) 2 i k (n 1)2b n(n 1)/2 = 4b n thus f has bounded differences with parameters L k = 4b/n. Applying Theorem 8 P ( U EU t ) 2e nt2 /8b 2, t / 80

52 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 52 / 80

53 Clique number of Erdös-Rényi Let G be an undirected graph.on n nodes. A clique in G is a complete (induced) sub-graph. Clique number of G denoted as ω(g) is the size of the largest clique(s). For two graphs G and G that differ in at most 1 edge, ω(g) ω(g ) 1. Thus E(G) ω(g) has bounded difference property with L = 1. Let G be an Erdös-Rényi random graph: Edges are independently drawn with probability p. Then, with m = ( n 2), ( ) P ω(g) E ω(g) δ 2e 2δ2 /m or setting ω(g) = ω(g)/m, ( ) P ω(g) E ω(g) δ 2e 2mδ2 53 / 80

54 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 54 / 80

55 Lipschitz functions of standard Gaussian vector A function f : R n R is L-Lipschitz w.r.t. 2 if f (x) f (y) L x y 2, x, y R n Theorem 9 (Gaussian concentration) Let X N(0, I n ) be a standard Gaussian vector and assume that f : R n R is L-Lipschitz w.r.t. the Euclidean norm. Then, P( f (X ) E[f (X )] ) ) t 2 exp ( t2 2L 2, t 0. (5) In other words, f (X ) is sub-gaussian with parameter L. Deep result, no easy proof! Has far-reaching consequences. One-sided bounds holds with prefactor 2 removed. 55 / 80

56 Example: χ 2 and norm concentrations revisited Let X N(0, I n ) and condition the function f (x) = x 2 / n. f is L-Lipschitz with L = 1/ n. Hence, ( X 2 P E X ) 2 t e nt2 /2, t 0 n n Since E X 2 n (why?), we have ( X 2 ) P 1 t e n t2 /2, t 0. n For t [0, 1], (1 + t) t, hence or setting 3t = δ, ( X 2 ) P 2 1 3t e n t2 /2, t [0, 1]. n ( X 2 ) P 2 1 δ e n δ2 /18, δ [0, 3]. n 56 / 80

57 Example: order statistics Let X N(0, I n ), and let f (x) = x (k) be the kth order statistic: For x R n, For any x, y R n, we have hence f is 1-Lipschitz. (Exercise) It follows that x (1) x (2) x (n) x (k) y (k) x y 2 P ( X (k) EX (k) t ) 2e t2 /2, t 0 iid In particular, if X i N(0, 1), i = 1,..., n, then ) P( max X n E[ max X n] t 2e t2 /2, t 0 i=1,...,n i=1,...,n 57 / 80

58 Example: singular values Consider a matrix X R n d where n > d. Let σ 1 (X ) σ 2 (X ) σ k (X ) be (ordered) singular values of X. By Weyl s theorem, for any X, Y R n d : σ k (X ) σ k (Y ) X Y op X Y F (Note that this is a generalization of order-statistics inequality.) Thus, X σ k (X ) is 1-Lipschitz: Proposition 6 Let X R n d be a random matrix with iid N(0, 1) entries. Then, ( σk P (X ) E[σ k (X )] ) δ 2e δ2 /2, δ 0 It remains to characterize E[σ k (X )]. For an overview of matrix norms, see matrix norms.pdf 58 / 80

59 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 59 / 80

60 Table of Contents 1 Concentration inequalities Sub-Gaussian concentration (Hoeffding inequality) Sub-exponential concentration (Bernstein inequality) Applications of Bernstein inequality χ 2 Concentration Johnson-Lindenstrauss embedding l 2 norm concentration l norm Bounded difference inequality (Azuma Hoeffding) Concentration of (bounded) U-statistics Concentration of clique numbers Gaussian concentration Gaussian chaos (Hanson Wright inequality) 2 Sparse linear models 60 / 80

61 Linear regression setup The data is (y, X ) where y R n and X R n d, and the model θ R d is an unknown parameter. y = X θ + w. w R n is the vector of noise variables. Equivalently, y i = θ, x i + w i, i = 1,..., n where x i R d is the nth row of X : x1 T x2 T X =.. xn T }{{} d Recall θ, x i = d j=1 θ j x ij. 61 / 80

62 Sparsity models When n < d, no hope of estimating θ, unless we impose some sort of of low-dimensional model on θ. Support of θ (recall [d] = {1,..., d}): supp(θ ) := S(θ ) = { j [d] : θ j 0 }. Hard sparsity assumption: s = S(θ ) d. Weaker sparsity assumption via l q balls for q [0, 1] q = gives l 1 ball. B q (R q ) = { θ R d : q = 0 the l 0 ball, same as hard sparsity: d θ j q R q }. j=1 θ 0 := S(θ ) = # { j; θ j 0 } 62 / 80

63 (from HDS book) 63 / 80

64 Basis pursuit Consider the noiseless case y = X θ. We assume that θ 0 is small. Ideal program to solve: min θ 0 subject to y = X θ θ R d 0 is highly non-convex, relax to 1 : This is called basis pursuit (regression). (6) is a convex program. In fact, can be written as a linear program 1. min θ 1 subject to y = X θ (6) θ R d Global solutions can be obtained very efficiently. 1 Exercise: Introduce auxiliary variables s j R and note that minimizing j s j subject to θ j s j gives the l 1 norm of θ. 64 / 80

65 Define C(S) = { R d : S c 1 S 1 }. (7) Theorem 10 The following two are equivalent: For any θ R d with support S, the basis pursuit program (6) applied to the data (y = X θ, X ) has unique solution θ = θ. The restricted null space property holds, i.e., C(S) ker(x ) = {0}. (8) 65 / 80

66 Proof Consider the tangent cone to the l 1 ball (of radius θ 1 ) at θ : T(θ ) = { R d : θ + t 1 θ 1, for some t > 0.} i.e., the set of descent direction for l 1 norm at point θ. Feasible set is θ + ker(x ), i.e. ker(x ) is the set of feasible directions = θ θ. Hence, there is a minimizer other than θ if and only if T(θ ) ker(x ) {0} (9) It is enough to show that C(S) = T(θ ). θ R d : supp(θ ) S 66 / 80

67 B1 θ (1) T(θ (2) ) Ker(X) T(θ (1) ) θ (2) C(S) d = 2, [d] = {1, 2}, S = {2}, θ(1) = (0, 1), θ = (0, 1). (2) C(S) = {( 1, 2 ) : 1 2 }. 67 / 80

68 It is enough to show that C(S) = T(θ ) (10) θ R d : supp(θ ) S We have T 1 (θ ) iff S c 1 θs 1 θs + S 1 We have T 1 (θ ) for some θ R d s.t. supp(θ ) S iff S c 1 sup θs Rd ] [ θs 1 θs + S 1 = S 1 (Let T 1 (θ ) be the subset of T 1 where t = 1.) 68 / 80

69 Sufficient conditions for restricted nullspace [d] := {1,..., d} For a matrix X R d, let X j be its jth column (for j [d]). The pairwise incoherence of X is defined as δ PW (X ) := max i,j [d] X i, X j n Alternative form: X T X is the Gram matrix of X, (X T X ) ij = X i, X j. 1{i = j} δ PW (X ) := X T X n I p where is the vector l norm of the matrix. 69 / 80

70 Proposition 7 (Uniform) restricted nullspace holds for all S with S s if δ PW (X ) 1 3s Proof: Exercise / 80

71 A more relaxed condition: Definition 3 (RIP) X R n d satisfies a restricted isometry property (RIP) of order s with constant δ s (X ) > 0 if X T S X S n I s op δ s (X ), for all S with S s PW incoherence is close to RIP with s = 2; for example, when X j / n 2 = 1 for all j, we have δ 2 (X ) = δ PW (X ). In general, for any s 2, δ PW (X ) δ s (X ) s δ PW (X ). 71 / 80

72 RIP gives sufficient conditions: Proposition 8 (Uniform) restricted null space holds for all S with S s if δ 2s (X ) 1 3 Consider a sub-gaussian matrix X with i.i.d. entries (Exercise 7.7): We have δ PW (X ) < 1 3s w.h.p. whenever n s 2 log d By contrast for certain classes of random matrices we have δ 2s < 1 3 whenever n s log(ed/s) 72 / 80

73 Noisy sparse regression A very popular estimator is the l 1 -regularized least-squares: [ 1 ] θ argmin θ R 2n y X θ λ θ 1 d (11) The idea: minimizing l 1 norm leads to sparse solutions. (11) is a convex program; global solution can be obtained efficiently. Other options: constrained form of lasso and relaxed basis persuit min θ 1 R 1 2n y X θ 2 2 (12) min θ R d θ 1 s.t. 1 2n y X θ 2 2 b 2 (13) 73 / 80

74 For a constant α 1, C α (S) := { R d S c 1 α S 1 }. Definition 4 (RE condition) A matrix X satisfies the restricted eigenvalue (RE) condition over S with parameters (κ, α) if 1 n X 2 2 κ 2 2 for all C α (S). Intuition: θ minimizes L(θ) := 1 2n X θ y 2. Ideally, δl := L( θ) L(θ ) is small. Want to translate deviation in loss to deviations in parameter θ θ. Controlled by the curvature of the loss, captured by the Hessian 2 L(θ) = 1 n X T X. 74 / 80

75 Ideally would like strong convexity (in all directions): or in the context of regression, 2 L(θ) κ 2, R d \ {0}. 1 n X 2 2 κ 2, R d \ {0}. In high-dimensions, cannot guarantee this in all directions, the loss is flat over ker X. 75 / 80

76 Theorem 11 Assume that y = X θ + w, where X R n d and θ R d, and θ is supported on S [d] with S s X satisfies RE(κ, 3) over S. Let us define z = X T w n and γ 2 := w 2 2 2n. Then, we have the following: (a) Any solution of Lasso (11) with λ 2 z satisfies θ θ 2 κ 3 s λ (b) Any solution of constrained Lasso (12) with R = θ 1 satisfies θ θ 2 4 κ s z (c) Any solution of relaxed basis pursuit (13) with b 2 γ 2 satisfies θ θ 2 4 κ s z + 2 κ b2 γ 2 76 / 80

77 Example (fixed design regression) Assume y = X θ + w where w N(0, σ 2 I n ), and X R n d fixed and satisfying RE condition and normalization where X j is the jth column of X. Recall z = X T w/n. X j max C. j=1,...,d n It is easy to show that w.p. 1 2e nδ2 /2, ( 2 log d ) z Cσ + δ n Thus, setting λ = 2Cσ ( 2 log d n + δ ), Lasso solution satisfies w.p. at least 1 2e nδ2 /2. θ θ 2 6Cσ ( 2 log d ) s + δ κ n 77 / 80

78 Proof Let us simplify the loss L(θ) := 1 2n X θ y 2. Setting = θ θ, where z = X T w/n. Hence, L(θ) = 1 2n X (θ θ ) w 2 = 1 X w 2 2n = 1 2n X 2 1 X, w + const. n = 1 2n X 2 1 n, X T w + const. = 1 2n X 2, z + const. L(θ) L(θ ) = 1 2n X 2, z. (14) Exercise: Show that (14) is the Taylor expansion of L around θ. 78 / 80

79 Proof (constrained version) By optimality of θ and feasibility of θ : L( θ) L(θ ) Error vector := θ θ satisfies basic inequality Using Holder inequality 1 2n X 2 2 z,. 1 2n X 2 2 z 1. Since θ 1 θ 1, we have = θ θ C 1 (S), hence 1 = S 1 + S c 1 2 S 1 2 s 2. Combined with RE condition ( C 3 (S) as well) which gives the desired result. 1 2 κ s z / 80

80 Proof (Lagrangian version) Let L(θ) := L(θ) + λ θ 1 be the regularized loss. Basic inequality is L( θ) + λ θ 1 L(θ ) + λ θ 1 Rearranging We have 1 2n X 2 2 z, + λ( θ 1 θ 1 ) Since λ 2 z, θ 1 θ 1 = θ S 1 θ S + S 1 S c 1 S 1 S c 1 1 n X 2 2 λ 1 + 2λ( S 1 S c 1 ) λ(3 S 1 S c 1 ) It follows that C 3 (S) and the rest of proof follows. 80 / 80

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 59 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

High Dimensional Probability

High Dimensional Probability High Dimensional Probability for Mathematicians and Data Scientists Roman Vershynin 1 1 University of Michigan. Webpage: www.umich.edu/~romanv ii Preface Who is this book for? This is a textbook in probability

More information

March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang.

March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang. Florida State University March 1, 2018 Framework 1. (Lizhe) Basic inequalities Chernoff bounding Review for STA 6448 2. (Lizhe) Discrete-time martingales inequalities via martingale approach 3. (Boning)

More information

1 Regression with High Dimensional Data

1 Regression with High Dimensional Data 6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:

More information

Concentration inequalities and the entropy method

Concentration inequalities and the entropy method Concentration inequalities and the entropy method Gábor Lugosi ICREA and Pompeu Fabra University Barcelona what is concentration? We are interested in bounding random fluctuations of functions of many

More information

(Part 1) High-dimensional statistics May / 41

(Part 1) High-dimensional statistics May / 41 Theory for the Lasso Recall the linear model Y i = p j=1 β j X (j) i + ɛ i, i = 1,..., n, or, in matrix notation, Y = Xβ + ɛ, To simplify, we assume that the design X is fixed, and that ɛ is N (0, σ 2

More information

Supremum of simple stochastic processes

Supremum of simple stochastic processes Subspace embeddings Daniel Hsu COMS 4772 1 Supremum of simple stochastic processes 2 Recap: JL lemma JL lemma. For any ε (0, 1/2), point set S R d of cardinality 16 ln n S = n, and k N such that k, there

More information

IEOR 265 Lecture 3 Sparse Linear Regression

IEOR 265 Lecture 3 Sparse Linear Regression IOR 65 Lecture 3 Sparse Linear Regression 1 M Bound Recall from last lecture that the reason we are interested in complexity measures of sets is because of the following result, which is known as the M

More information

P (A G) dp G P (A G)

P (A G) dp G P (A G) First homework assignment. Due at 12:15 on 22 September 2016. Homework 1. We roll two dices. X is the result of one of them and Z the sum of the results. Find E [X Z. Homework 2. Let X be a r.v.. Assume

More information

Selected Exercises on Expectations and Some Probability Inequalities

Selected Exercises on Expectations and Some Probability Inequalities Selected Exercises on Expectations and Some Probability Inequalities # If E(X 2 ) = and E X a > 0, then P( X λa) ( λ) 2 a 2 for 0 < λ

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN Lecture Notes 5 Convergence and Limit Theorems Motivation Convergence with Probability Convergence in Mean Square Convergence in Probability, WLLN Convergence in Distribution, CLT EE 278: Convergence and

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms 南京大学 尹一通 Martingales Definition: A sequence of random variables X 0, X 1,... is a martingale if for all i > 0, E[X i X 0,...,X i1 ] = X i1 x 0, x 1,...,x i1, E[X i X 0 = x 0, X 1

More information

19.1 Problem setup: Sparse linear regression

19.1 Problem setup: Sparse linear regression ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 19: Minimax rates for sparse linear regression Lecturer: Yihong Wu Scribe: Subhadeep Paul, April 13/14, 2016 In

More information

Statistics 300B Winter 2018 Final Exam Due 24 Hours after receiving it

Statistics 300B Winter 2018 Final Exam Due 24 Hours after receiving it Statistics 300B Winter 08 Final Exam Due 4 Hours after receiving it Directions: This test is open book and open internet, but must be done without consulting other students. Any consultation of other students

More information

Probability Background

Probability Background Probability Background Namrata Vaswani, Iowa State University August 24, 2015 Probability recap 1: EE 322 notes Quick test of concepts: Given random variables X 1, X 2,... X n. Compute the PDF of the second

More information

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song Presenter: Jiwei Zhao Department of Statistics University of Wisconsin Madison April

More information

Concentration inequalities and tail bounds

Concentration inequalities and tail bounds Concentration inequalities and tail bounds John Duchi Outline I Basics and motivation 1 Law of large numbers 2 Markov inequality 3 Cherno bounds II Sub-Gaussian random variables 1 Definitions 2 Examples

More information

STA 711: Probability & Measure Theory Robert L. Wolpert

STA 711: Probability & Measure Theory Robert L. Wolpert STA 711: Probability & Measure Theory Robert L. Wolpert 6 Independence 6.1 Independent Events A collection of events {A i } F in a probability space (Ω,F,P) is called independent if P[ i I A i ] = P[A

More information

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method CS 395T: Sublinear Algorithms Fall 2016 Prof. Eric Price Lecture 13 October 6, 2016 Scribe: Kiyeon Jeon and Loc Hoang 1 Overview In the last lecture we covered the lower bound for p th moment (p > 2) and

More information

Outline. Martingales. Piotr Wojciechowski 1. 1 Lane Department of Computer Science and Electrical Engineering West Virginia University.

Outline. Martingales. Piotr Wojciechowski 1. 1 Lane Department of Computer Science and Electrical Engineering West Virginia University. Outline Piotr 1 1 Lane Department of Computer Science and Electrical Engineering West Virginia University 8 April, 01 Outline Outline 1 Tail Inequalities Outline Outline 1 Tail Inequalities General Outline

More information

STAT 100C: Linear models

STAT 100C: Linear models STAT 100C: Linear models Arash A. Amini June 9, 2018 1 / 56 Table of Contents Multiple linear regression Linear model setup Estimation of β Geometric interpretation Estimation of σ 2 Hat matrix Gram matrix

More information

The Moment Method; Convex Duality; and Large/Medium/Small Deviations

The Moment Method; Convex Duality; and Large/Medium/Small Deviations Stat 928: Statistical Learning Theory Lecture: 5 The Moment Method; Convex Duality; and Large/Medium/Small Deviations Instructor: Sham Kakade The Exponential Inequality and Convex Duality The exponential

More information

Chapter 7. Basic Probability Theory

Chapter 7. Basic Probability Theory Chapter 7. Basic Probability Theory I-Liang Chern October 20, 2016 1 / 49 What s kind of matrices satisfying RIP Random matrices with iid Gaussian entries iid Bernoulli entries (+/ 1) iid subgaussian entries

More information

High-Dimensional Probability

High-Dimensional Probability High-Dimensional Probability An Introduction with Applications in Data Science Roman Vershynin University of California, Irvine January 22, 2018 https://www.math.uci.edu/~rvershyn/ Contents Preface vi

More information

Chapter 6. Convergence. Probability Theory. Four different convergence concepts. Four different convergence concepts. Convergence in probability

Chapter 6. Convergence. Probability Theory. Four different convergence concepts. Four different convergence concepts. Convergence in probability Probability Theory Chapter 6 Convergence Four different convergence concepts Let X 1, X 2, be a sequence of (usually dependent) random variables Definition 1.1. X n converges almost surely (a.s.), or with

More information

A NOTE ON SUMS OF INDEPENDENT RANDOM MATRICES AFTER AHLSWEDE-WINTER

A NOTE ON SUMS OF INDEPENDENT RANDOM MATRICES AFTER AHLSWEDE-WINTER A NOTE ON SUMS OF INDEPENDENT RANDOM MATRICES AFTER AHLSWEDE-WINTER 1. The method Ashwelde and Winter [1] proposed a new approach to deviation inequalities for sums of independent random matrices. The

More information

Random regular digraphs: singularity and spectrum

Random regular digraphs: singularity and spectrum Random regular digraphs: singularity and spectrum Nick Cook, UCLA Probability Seminar, Stanford University November 2, 2015 Universality Circular law Singularity probability Talk outline 1 Universality

More information

Anti-concentration Inequalities

Anti-concentration Inequalities Anti-concentration Inequalities Roman Vershynin Mark Rudelson University of California, Davis University of Missouri-Columbia Phenomena in High Dimensions Third Annual Conference Samos, Greece June 2007

More information

Limiting Distributions

Limiting Distributions Limiting Distributions We introduce the mode of convergence for a sequence of random variables, and discuss the convergence in probability and in distribution. The concept of convergence leads us to the

More information

The Canonical Gaussian Measure on R

The Canonical Gaussian Measure on R The Canonical Gaussian Measure on R 1. Introduction The main goal of this course is to study Gaussian measures. The simplest example of a Gaussian measure is the canonical Gaussian measure P on R where

More information

Notes 6 : First and second moment methods

Notes 6 : First and second moment methods Notes 6 : First and second moment methods Math 733-734: Theory of Probability Lecturer: Sebastien Roch References: [Roc, Sections 2.1-2.3]. Recall: THM 6.1 (Markov s inequality) Let X be a non-negative

More information

6.1 Moment Generating and Characteristic Functions

6.1 Moment Generating and Characteristic Functions Chapter 6 Limit Theorems The power statistics can mostly be seen when there is a large collection of data points and we are interested in understanding the macro state of the system, e.g., the average,

More information

Upper Bound for Intermediate Singular Values of Random Sub-Gaussian Matrices 1

Upper Bound for Intermediate Singular Values of Random Sub-Gaussian Matrices 1 Upper Bound for Intermediate Singular Values of Random Sub-Gaussian Matrices 1 Feng Wei 2 University of Michigan July 29, 2016 1 This presentation is based a project under the supervision of M. Rudelson.

More information

Lecture 1 Measure concentration

Lecture 1 Measure concentration CSE 29: Learning Theory Fall 2006 Lecture Measure concentration Lecturer: Sanjoy Dasgupta Scribe: Nakul Verma, Aaron Arvey, and Paul Ruvolo. Concentration of measure: examples We start with some examples

More information

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 3: Sparse signal recovery: A RIPless analysis of l 1 minimization Yuejie Chi The Ohio State University Page 1 Outline

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee227c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee227c@berkeley.edu

More information

Matrix concentration inequalities

Matrix concentration inequalities ELE 538B: Mathematics of High-Dimensional Data Matrix concentration inequalities Yuxin Chen Princeton University, Fall 2018 Recap: matrix Bernstein inequality Consider a sequence of independent random

More information

High-dimensional statistics: Some progress and challenges ahead

High-dimensional statistics: Some progress and challenges ahead High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture Joint work with: Alekh

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information

Random Methods for Linear Algebra

Random Methods for Linear Algebra Gittens gittens@acm.caltech.edu Applied and Computational Mathematics California Institue of Technology October 2, 2009 Outline The Johnson-Lindenstrauss Transform 1 The Johnson-Lindenstrauss Transform

More information

Inference for High Dimensional Robust Regression

Inference for High Dimensional Robust Regression Department of Statistics UC Berkeley Stanford-Berkeley Joint Colloquium, 2015 Table of Contents 1 Background 2 Main Results 3 OLS: A Motivating Example Table of Contents 1 Background 2 Main Results 3 OLS:

More information

Geometry of log-concave Ensembles of random matrices

Geometry of log-concave Ensembles of random matrices Geometry of log-concave Ensembles of random matrices Nicole Tomczak-Jaegermann Joint work with Radosław Adamczak, Rafał Latała, Alexander Litvak, Alain Pajor Cortona, June 2011 Nicole Tomczak-Jaegermann

More information

Randomized Algorithms Week 2: Tail Inequalities

Randomized Algorithms Week 2: Tail Inequalities Randomized Algorithms Week 2: Tail Inequalities Rao Kosaraju In this section, we study three ways to estimate the tail probabilities of random variables. Please note that the more information we know about

More information

Small Ball Probability, Arithmetic Structure and Random Matrices

Small Ball Probability, Arithmetic Structure and Random Matrices Small Ball Probability, Arithmetic Structure and Random Matrices Roman Vershynin University of California, Davis April 23, 2008 Distance Problems How far is a random vector X from a given subspace H in

More information

Concentration Inequalities

Concentration Inequalities Chapter Concentration Inequalities I. Moment generating functions, the Chernoff method, and sub-gaussian and sub-exponential random variables a. Goal for this section: given a random variable X, how does

More information

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma Suppose again we have n sample points x,..., x n R p. The data-point x i R p can be thought of as the i-th row X i of an n p-dimensional

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

ECE534, Spring 2018: Solutions for Problem Set #4 Due Friday April 6, 2018

ECE534, Spring 2018: Solutions for Problem Set #4 Due Friday April 6, 2018 ECE534, Spring 2018: s for Problem Set #4 Due Friday April 6, 2018 1. MMSE Estimation, Data Processing and Innovations The random variables X, Y, Z on a common probability space (Ω, F, P ) are said to

More information

STA205 Probability: Week 8 R. Wolpert

STA205 Probability: Week 8 R. Wolpert INFINITE COIN-TOSS AND THE LAWS OF LARGE NUMBERS The traditional interpretation of the probability of an event E is its asymptotic frequency: the limit as n of the fraction of n repeated, similar, and

More information

Lecture 2: Review of Basic Probability Theory

Lecture 2: Review of Basic Probability Theory ECE 830 Fall 2010 Statistical Signal Processing instructor: R. Nowak, scribe: R. Nowak Lecture 2: Review of Basic Probability Theory Probabilistic models will be used throughout the course to represent

More information

Sparse and Low Rank Recovery via Null Space Properties

Sparse and Low Rank Recovery via Null Space Properties Sparse and Low Rank Recovery via Null Space Properties Holger Rauhut Lehrstuhl C für Mathematik (Analysis), RWTH Aachen Convexity, probability and discrete structures, a geometric viewpoint Marne-la-Vallée,

More information

If g is also continuous and strictly increasing on J, we may apply the strictly increasing inverse function g 1 to this inequality to get

If g is also continuous and strictly increasing on J, we may apply the strictly increasing inverse function g 1 to this inequality to get 18:2 1/24/2 TOPIC. Inequalities; measures of spread. This lecture explores the implications of Jensen s inequality for g-means in general, and for harmonic, geometric, arithmetic, and related means in

More information

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R. Ergodic Theorems Samy Tindel Purdue University Probability Theory 2 - MA 539 Taken from Probability: Theory and examples by R. Durrett Samy T. Ergodic theorems Probability Theory 1 / 92 Outline 1 Definitions

More information

Limiting Distributions

Limiting Distributions We introduce the mode of convergence for a sequence of random variables, and discuss the convergence in probability and in distribution. The concept of convergence leads us to the two fundamental results

More information

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics Data from one or a series of random experiments are collected. Planning experiments and collecting data (not discussed here). Analysis:

More information

Susceptible-Infective-Removed Epidemics and Erdős-Rényi random

Susceptible-Infective-Removed Epidemics and Erdős-Rényi random Susceptible-Infective-Removed Epidemics and Erdős-Rényi random graphs MSR-Inria Joint Centre October 13, 2015 SIR epidemics: the Reed-Frost model Individuals i [n] when infected, attempt to infect all

More information

arxiv: v1 [math.pr] 11 Feb 2019

arxiv: v1 [math.pr] 11 Feb 2019 A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm arxiv:190.03736v1 math.pr] 11 Feb 019 Chi Jin University of California, Berkeley chijin@cs.berkeley.edu Rong Ge Duke

More information

Analysis of Greedy Algorithms

Analysis of Greedy Algorithms Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm

More information

8 Laws of large numbers

8 Laws of large numbers 8 Laws of large numbers 8.1 Introduction We first start with the idea of standardizing a random variable. Let X be a random variable with mean µ and variance σ 2. Then Z = (X µ)/σ will be a random variable

More information

High dimensional Ising model selection

High dimensional Ising model selection High dimensional Ising model selection Pradeep Ravikumar UT Austin (based on work with John Lafferty, Martin Wainwright) Sparse Ising model US Senate 109th Congress Banerjee et al, 2008 Estimate a sparse

More information

Inference For High Dimensional M-estimates. Fixed Design Results

Inference For High Dimensional M-estimates. Fixed Design Results : Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and

More information

s k k! E[Xk ], s <s 0.

s k k! E[Xk ], s <s 0. Chapter Moments and tails M X (s) =E e sx, defined for all s R where it is finite, which includes at least s =0. If M X (s) is defined on ( s 0,s 0 ) for some s 0 > 0 then X has finite moments of all orders

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 2: Introduction to statistical learning theory. 1 / 22 Goals of statistical learning theory SLT aims at studying the performance of

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

1 Dimension Reduction in Euclidean Space

1 Dimension Reduction in Euclidean Space CSIS0351/8601: Randomized Algorithms Lecture 6: Johnson-Lindenstrauss Lemma: Dimension Reduction Lecturer: Hubert Chan Date: 10 Oct 011 hese lecture notes are supplementary materials for the lectures.

More information

Lecture 3. Random Fourier measurements

Lecture 3. Random Fourier measurements Lecture 3. Random Fourier measurements 1 Sampling from Fourier matrices 2 Law of Large Numbers and its operator-valued versions 3 Frames. Rudelson s Selection Theorem Sampling from Fourier matrices Our

More information

SDS : Theoretical Statistics

SDS : Theoretical Statistics SDS 384 11: Theoretical Statistics Lecture 1: Introduction Purnamrita Sarkar Department of Statistics and Data Science The University of Texas at Austin https://psarkar.github.io/teaching Manegerial Stuff

More information

Lecture 1: August 28

Lecture 1: August 28 36-705: Intermediate Statistics Fall 2017 Lecturer: Siva Balakrishnan Lecture 1: August 28 Our broad goal for the first few lectures is to try to understand the behaviour of sums of independent random

More information

Introduction to Compressed Sensing

Introduction to Compressed Sensing Introduction to Compressed Sensing Alejandro Parada, Gonzalo Arce University of Delaware August 25, 2016 Motivation: Classical Sampling 1 Motivation: Classical Sampling Issues Some applications Radar Spectral

More information

Lecture 18: March 15

Lecture 18: March 15 CS71 Randomness & Computation Spring 018 Instructor: Alistair Sinclair Lecture 18: March 15 Disclaimer: These notes have not been subjected to the usual scrutiny accorded to formal publications. They may

More information

Constrained optimization

Constrained optimization Constrained optimization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Compressed sensing Convex constrained

More information

Random matrices: Distribution of the least singular value (via Property Testing)

Random matrices: Distribution of the least singular value (via Property Testing) Random matrices: Distribution of the least singular value (via Property Testing) Van H. Vu Department of Mathematics Rutgers vanvu@math.rutgers.edu (joint work with T. Tao, UCLA) 1 Let ξ be a real or complex-valued

More information

n! (k 1)!(n k)! = F (X) U(0, 1). (x, y) = n(n 1) ( F (y) F (x) ) n 2

n! (k 1)!(n k)! = F (X) U(0, 1). (x, y) = n(n 1) ( F (y) F (x) ) n 2 Order statistics Ex. 4.1 (*. Let independent variables X 1,..., X n have U(0, 1 distribution. Show that for every x (0, 1, we have P ( X (1 < x 1 and P ( X (n > x 1 as n. Ex. 4.2 (**. By using induction

More information

Constructing Explicit RIP Matrices and the Square-Root Bottleneck

Constructing Explicit RIP Matrices and the Square-Root Bottleneck Constructing Explicit RIP Matrices and the Square-Root Bottleneck Ryan Cinoman July 18, 2018 Ryan Cinoman Constructing Explicit RIP Matrices July 18, 2018 1 / 36 Outline 1 Introduction 2 Restricted Isometry

More information

Gaussian vectors and central limit theorem

Gaussian vectors and central limit theorem Gaussian vectors and central limit theorem Samy Tindel Purdue University Probability Theory 2 - MA 539 Samy T. Gaussian vectors & CLT Probability Theory 1 / 86 Outline 1 Real Gaussian random variables

More information

Lecture 9: March 26, 2014

Lecture 9: March 26, 2014 COMS 6998-3: Sub-Linear Algorithms in Learning and Testing Lecturer: Rocco Servedio Lecture 9: March 26, 204 Spring 204 Scriber: Keith Nichols Overview. Last Time Finished analysis of O ( n ɛ ) -query

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Concentration of Measures by Bounded Couplings

Concentration of Measures by Bounded Couplings Concentration of Measures by Bounded Couplings Subhankar Ghosh, Larry Goldstein and Ümit Işlak University of Southern California [arxiv:0906.3886] [arxiv:1304.5001] May 2013 Concentration of Measure Distributional

More information

Moments and tails. Chapter 2

Moments and tails. Chapter 2 Chapter Moments and tails M X (s) =E e sx, defined for all s R where it is finite, which includes at least s =0. If M X (s) is defined on ( s 0,s 0 ) for some s 0 > 0 then X has finite moments of all orders

More information

Entropy and Ergodic Theory Lecture 15: A first look at concentration

Entropy and Ergodic Theory Lecture 15: A first look at concentration Entropy and Ergodic Theory Lecture 15: A first look at concentration 1 Introduction to concentration Let X 1, X 2,... be i.i.d. R-valued RVs with common distribution µ, and suppose for simplicity that

More information

Stochastic Models (Lecture #4)

Stochastic Models (Lecture #4) Stochastic Models (Lecture #4) Thomas Verdebout Université libre de Bruxelles (ULB) Today Today, our goal will be to discuss limits of sequences of rv, and to study famous limiting results. Convergence

More information

Lecture 4. P r[x > ce[x]] 1/c. = ap r[x = a] + a>ce[x] P r[x = a]

Lecture 4. P r[x > ce[x]] 1/c. = ap r[x = a] + a>ce[x] P r[x = a] U.C. Berkeley CS273: Parallel and Distributed Theory Lecture 4 Professor Satish Rao September 7, 2010 Lecturer: Satish Rao Last revised September 13, 2010 Lecture 4 1 Deviation bounds. Deviation bounds

More information

Probability inequalities 11

Probability inequalities 11 Paninski, Intro. Math. Stats., October 5, 2005 29 Probability inequalities 11 There is an adage in probability that says that behind every limit theorem lies a probability inequality (i.e., a bound on

More information

Master s Written Examination

Master s Written Examination Master s Written Examination Option: Statistics and Probability Spring 016 Full points may be obtained for correct answers to eight questions. Each numbered question which may have several parts is worth

More information

Methods for sparse analysis of high-dimensional data, II

Methods for sparse analysis of high-dimensional data, II Methods for sparse analysis of high-dimensional data, II Rachel Ward May 23, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 47 High dimensional

More information

Example continued. Math 425 Intro to Probability Lecture 37. Example continued. Example

Example continued. Math 425 Intro to Probability Lecture 37. Example continued. Example continued : Coin tossing Math 425 Intro to Probability Lecture 37 Kenneth Harris kaharri@umich.edu Department of Mathematics University of Michigan April 8, 2009 Consider a Bernoulli trials process with

More information

Robust Principal Component Analysis

Robust Principal Component Analysis ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M

More information

11.1 Set Cover ILP formulation of set cover Deterministic rounding

11.1 Set Cover ILP formulation of set cover Deterministic rounding CS787: Advanced Algorithms Lecture 11: Randomized Rounding, Concentration Bounds In this lecture we will see some more examples of approximation algorithms based on LP relaxations. This time we will use

More information

EE514A Information Theory I Fall 2013

EE514A Information Theory I Fall 2013 EE514A Information Theory I Fall 2013 K. Mohan, Prof. J. Bilmes University of Washington, Seattle Department of Electrical Engineering Fall Quarter, 2013 http://j.ee.washington.edu/~bilmes/classes/ee514a_fall_2013/

More information

High dimensional ising model selection using l 1 -regularized logistic regression

High dimensional ising model selection using l 1 -regularized logistic regression High dimensional ising model selection using l 1 -regularized logistic regression 1 Department of Statistics Pennsylvania State University 597 Presentation 2016 1/29 Outline Introduction 1 Introduction

More information

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear

More information

1. Stochastic Processes and filtrations

1. Stochastic Processes and filtrations 1. Stochastic Processes and 1. Stoch. pr., A stochastic process (X t ) t T is a collection of random variables on (Ω, F) with values in a measurable space (S, S), i.e., for all t, In our case X t : Ω S

More information

Invertibility of random matrices

Invertibility of random matrices University of Michigan February 2011, Princeton University Origins of Random Matrix Theory Statistics (Wishart matrices) PCA of a multivariate Gaussian distribution. [Gaël Varoquaux s blog gael-varoquaux.info]

More information

Hoeffding, Chernoff, Bennet, and Bernstein Bounds

Hoeffding, Chernoff, Bennet, and Bernstein Bounds Stat 928: Statistical Learning Theory Lecture: 6 Hoeffding, Chernoff, Bennet, Bernstein Bounds Instructor: Sham Kakade 1 Hoeffding s Bound We say X is a sub-gaussian rom variable if it has quadratically

More information

Sparse PCA in High Dimensions

Sparse PCA in High Dimensions Sparse PCA in High Dimensions Jing Lei, Department of Statistics, Carnegie Mellon Workshop on Big Data and Differential Privacy Simons Institute, Dec, 2013 (Based on joint work with V. Q. Vu, J. Cho, and

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

OXPORD UNIVERSITY PRESS

OXPORD UNIVERSITY PRESS Concentration Inequalities A Nonasymptotic Theory of Independence STEPHANE BOUCHERON GABOR LUGOSI PASCAL MASS ART OXPORD UNIVERSITY PRESS CONTENTS 1 Introduction 1 1.1 Sums of Independent Random Variables

More information

n! (k 1)!(n k)! = F (X) U(0, 1). (x, y) = n(n 1) ( F (y) F (x) ) n 2

n! (k 1)!(n k)! = F (X) U(0, 1). (x, y) = n(n 1) ( F (y) F (x) ) n 2 Order statistics Ex. 4. (*. Let independent variables X,..., X n have U(0, distribution. Show that for every x (0,, we have P ( X ( < x and P ( X (n > x as n. Ex. 4.2 (**. By using induction or otherwise,

More information