Concentration inequalities

Size: px

Start display at page:

Download "Concentration inequalities"

Lucas Marshall
5 years ago
Views:

1 Cocetratio iequalities Jea-Yves Audibert 1,2 1. Imagie - ENPC/CSTB - uiversité Paris Est 2. Willow (INRIA/ENS/CNRS) ThRaSH 2010

2 with Problem Tight upper ad lower bouds o f(x 1,..., X ) X 1,..., X i.i.d. radom variables takig their values i some (measurable) space X ad f : X R a fuctio which value depeds o all the variables but ot too much o ay of them. For example: f(x 1,..., X ) = X 1+ +X or f(x 1,..., X ) = sup g G g(x 1 ) + + g(x )

3 Outlie Asymptotic viewpoit No asymptotic Gaussia approximatio Gaussia processes Sum of i.i.d. r.v. Fuctios with bouded differeces Self-boudig fuctios

4 The asymptotic viewpoit What is the limit of f(x 1,..., X )? What is the limit of its cetered ad scaled versio: f(x 1,..., X ) Ef(X 1,..., X ) Var f(x1,..., X )?

5 Covergece of radom variables Covergece i distributio: W d + t R s.t. F W cot. at t, F W (t) F W (t) + f : R R cot. ad bouded, Ef(W ) Ef(W ) + t R, Ee itw Ee itw (with i 2 = 1) + Covergece i probability: W P + W W ε > 0, P( W W ε) + 0 Almost sure covergece: W a.s. + W P(W + W ) = 1 Almost sure cvg cvg i probability cvg i distributio If ε > 0, 1 P( W W > ε) < +, the W a.s. + W

6 Covergece of the empirical mea f(x 1,..., X ) = X 1+ +X LLN (1713): If X, X 1, X 2,... are i.i.d. r.v. with E X < +, the X = i=1 X i a.s. + EX CLT (1733): If X, X 1, X 2,... are i.i.d. r.v. with EX 2 < +, the ( X EX ) or equivaletly: for ay t, d + N (0, Var X), P{ ( ) } Var X X EX > t + + t e u2 2 2π du.

7 If V Slutsky s lemma (1925) Let (V ) ad (W ) be two sequeces of radom vectors or variables. P + v ad W d + W, the 1. V + W d + v + W 2. V W d + vw 3. V 1 d W + v 1 W if v ivertible

8 A example of complicated fuctioal: the t-statistics with Let f(x 1,..., X ) = S 2 = 1 ( X EX) S, (X i X) 2 i=1 Sice S 2 = 1 i=1 (X i EX) 2 (EX X) 2, from the LLN, we have S 2 a.s. + Var X. From the CLT, ( X EX) Thus, from Slutsky s lemma, d + N (0, Var X). f(x 1,..., X ) d + N (0, 1). Appropriate decompositios of complicated fuctioals allow to compute their asymptotic distributio.

9 Noasymptotic bouds Motivatios: Whe the oasymptotic regime plays a crucial role (for istace, multi-armed badit problems, racig algorithms, stoppig times problems) Whe asymptotic aalysis is ot achievable through stadard argumets To derive asymptotic results!

10 The Berry (1941)-Essee (1942) theorem X, X 1,..., X i.i.d. E X 3 < + ad σ 2 = Var X X = X 1+ +X Z N (E X, Var X) sup P( X > x) P(Z > x) E X EX 3 1 x R 2σ 3

11 Slud s theorem (1977) X 1,..., X i.i.d. B(p) with p 1 2 Z N (E X, Var X) for ay x [p, 1 p] P( X > x) P(Z > x)

12 the Paley-Zygmud iequality (1932) X 1,..., X i.i.d. for ay 0 λ < 1, ( ( X EX) P Var X ) > λ (1 λ 2 ) 2 mi ( 1 3, (Var X) 2 ) E(X EX) 4.

13 Supremum of Gaussia processes (GP) Gaussia process (W (g)) g G : for ay g 1,..., g d G ( W (g1 ),..., W (g d ) ) is a Gaussia radom vector GP: a powerful flexible probabilistic model parametrized by µ(g) = EW (g) ad K(g, g ) = Cov ( W (g), W (g ) ) Good ituitio o GP good ituitio o sup g G g(x 1 )+ +g(x ) sup g G g(x 1 ) + + g(x ) sup g G W (g) with µ(g) = Eg(X) ad K(g, g ) = 1 Cov( g(x), g (X) ).

14 The Borell (1975) - Cirel so et al. (1976) iequality Z = sup g G { W (g) EW (g) } σ 2 = sup g G Var W (g) = sup g G K(g, g) for ay λ R, for ay t > 0, log Ee λ(z EZ) λ2 σ 2 P(Z EZ t) e t2 2σ 2 2

15 Dudley s itegral (1967) d(g, g ) = E[W (g) W (g )] 2 N(ε) = ε-packig umber of (G, d) σ 2 = sup g G Var W (g) = sup g G K(g, g) E sup g G { } σ W (g) EW (g) 12 log N(ε)dε, 0

16 Aother Borell (1975) - Cirel so et al. (1976) iequality X 1,..., X i.i.d. N (0, 1) f : R R L-Lipschitz for the Euclidea distace for ay x, x i R, f(x) f(x ) L x x for ay t > 0, P ( f(x 1,..., X ) Ef(X 1,..., X ) t ) e t2 2L 2.

17 Some useful probabilistic iequalities Markov s iequality: for ay r.v. X ad a > 0, sice X a1 X a P( X a) 1 a E X. Jese s ieq.: for ay itegrable r.v. X ad ϕ : R d R covex, ϕ(ex) Eϕ(X). For ay r.v. X, EX + 0 P(X t)dt (with equality if X 0) Markov s iequality is at the basis of Cheroff s argumet: s > 0 P(X t) = P ( e sx e st) e st Ee sx. Cotrol of the Laplace trasform cotrol of the large deviatios.

18 Hoeffdig s iequality (1963) If X, X 1, X 2,... are i.i.d. r.v. with a X b, the 1. s R, 2. For ay t 0, Ee s(x EX) e s2 (b a) 2 8 P ( ) X 2t 2 EX t e (b a) 2, or equivaletly, for ay ε > 0 ( ) log(ε 1 ) P X EX < (b a) 2 i.e., w.h.p. X log(ε EX < (b a) 1 ) 2. 1 ε,

19 1. s R, Ee s(x EX) e s2 (b a) 2 8 Log-Laplace upper boud ϕ(s) = log Ee sx ϕ (s) = E Ps X P s (dω) = esx(ω) Ee sx ϕ (s) = Var Ps X P(dω) Var Ps X = if r R E Ps (X r) 2 ( ) E Ps X a+b 2 2 (b a) 2 4. ϕ(s) = ϕ(0) + sϕ (0) + s 0 (s t)ϕ (t)dt log Ee sx sex + s 0 (s t) sex + (b a)2 s 2 8 (b a)2 dt 4

20 Cheroff s Argumet 2. For ay t 0, P ( ) X 2t 2 EX > t e (b a) 2. P(X EX t) = P ( e s(x EX) e st) e st E[e s(x EX) ] = e st E = e st E (e s i=1 (X i EX) (e s(x EX) e st+s2 b a2 8 ) = e 2t 2 (b a) 2 by choosig s = 4t (b a) 2. )

21 Uio boud P(A) 1 ε ad P(B) 1 ε P(A B) 1 2ε (sice P(A c B c ) P(A c ) + P(B c )) For istace: Hoeffdig to X + Hoeffdig to X + uio boud with proba 1 ε, X EX < (b a) (leads to pessimistic but correct cofidece itervals ulike the CLT) If P(A 1 ) 1 ε,...,p(a m ) 1 ε, the P ( A 1 A m ) 1 mε log(2ε 1 ) 2

22 Berstei s (1946) iequality Hoeffdig s iequality vs CLT: e 2α 2 Var X (b a) 2 P [ Var X ( X EX) > α ] e P(Z > α) + Hoeffdig s iequality is imprecise for r.v. havig low variace Berstei s iequality: If X, X 1, X 2,... are i.i.d. r.v. with X EX c, the for ay ε > 0, with proba at least 1 ε, 2 log(ε X EX + 1 ) Var X + c log(ε 1 ) 3 for ay t 0, P ( X EX > t ) e t 2 2 Var X+2ct/3 α 2 2 α 2π

23 Empirical Berstei s iequality (A., Muos, Szepesvári, 2007; Maurer, Potil, 2009) If X, X 1, X 2,... are i.i.d. r.v. with a X b, the for ay ε > 0, with proba at least 1 ε, EX X + 2 log(ε 1 )ˆσ 2 + 7(b a) log(ε 1 ) 3 with ˆσ 2 = ( to be compared with EX X + i=1 (X i X) log(ε 1 )Var X + (b a) log(ε 1 ) 3 )

24 Hoeffdig-Azuma iequalities (McDiarmid s versio, 1989) If for some c 0, sup i {1,...,} (x 1,...,x ) X x X f(x 1,..., x ) f(x 1,..., x i 1, x, x i+1,..., x ) c, the, for ay λ R, W = f(x 1,..., X ) satisfies ad for ay t 0, Ee λ(w EW ) e λ2 c 2 8 P ( W EW > t ) e 2t2 c 2

25 First example: Hoeffdig s iequality i Hilbert space X 1,..., X i.i.d. r.v. takig values i a separable Hilbert space EX = 0 ad X 1 For ay t 4, P ( X1 + + X t ) e t2 8.

26 Secod example: supremum of empirical process W = f(x 1,..., X ) = sup g G g(x 1 )+ +g(x ) G fiite Assumptios: g G, g takes its values i [ 1, 1] ad Eg(X 1 ) = 0 sup i {1,...,} (x 1,...,x ) X x X f(x i 1 1, x i, x i+1 ) f(xi 1 1, x, x i+1 ) 2, McDiarmid s iequality P ( W EW > t ) e t2 /2 with proba 1 ε, sup g G g(x 1 ) + + g(x ) E sup g G g(x 1 ) + + g(x ) + 2 log(ε 1 )

27 Third example: kerel desity estimatio X 1,..., X i.i.d. r.v. from a distributio with desity p o R. h > 0 ad K : R R + with R K = 1 ˆp(x) = 1 h i=1 K ( x X i h W = f(x 1,..., X ) = ˆp(x) p(x) dx f(x i 1 1, x i, x i+1) f(x i 1 1, x i, x i+1) 1 h ) ( x xi K h ) K ( ) x x i 2 h, W EW 2 log(ε 1 )

28 Self bouded fuctios (Bouchero, Lugosi, Massart, 2003, 2009; Maurer, 2005) f i (x 1,..., x ) = if xi X f(x 1,..., x ) If for some a, b 0, for ay (x 1,..., x ) X, [ f(x1,..., x ) f i (x 1,..., x ) ] 2 af(x1,..., x ) + b, i=1 the, for ay t 0, W = f(x 1,..., X ) satisfies P ( W EW > t ) e t 2 2(aEW +b+at/2)

29 Talagrad s iequality (Talagrad, 1996; Rio, 2002; Bousquet, 2003) W = sup g G g(x 1 )+ +g(x ) Eg(X) = 0 ad g(x) c v = sup g G Var g(x) + 2cEW for ay ε > 0, with proba at least 1 ε, 2v log(ε W EW 1 ) + c log(ε 1 ) 3 for ay t 0, P ( W EW > t ) e 2v+2ct/3 t2

30 Expected maximal deviatios Let σ > 0, m 2, W 1,..., W m r.v. s.t. for all s > 0 ad ay 1 i m, Ee sw i e s2 σ 2 2. The E { max } i σ 2 log m. 1 i m If for ay s > 0, we also have Ee sw i e s2 σ 2 2, the E { max 1 i m W i } σ 2 log(2m). Proof: max W i 1 m 1 i m s log i=1 e sw i 1 s log(mes2 σ 2 /2 ).

31 Extesio to martigale differece sequeces Let X 1, X 2,... ad U 1, U 2,... be r.v. such that E[X i U 1,..., U i 1 ] = 0 for all i 1 Assume that for some c > 0, ad some r.v. A i measurable w.r.t. U 1,..., U i 1, X i takes its values i [A i, A i + 1] for P( X > t) e 2t2 same r.h.s. as if we had i.i.d. r.v. takig values i [0, 1]

32 Other extesios All upper bouds easily exteds to idepedet o idetically distributed r.v. Some upper bouds o the empirical mea ca be exteded to radom vectors All upper bouds o the empirical mea are valid if the X i samples without replacemet are

33 Some ice refereces: Appedix of G. Lugosi ad N. Cesa-Biachi s book: learig ad games predictio, G. Lugosi s lecture otes o cocetratio iequalities. Bouchero, Lugosi, Massart (2003,2009) P. Massart Sait Flour lecture otes

Lecture 3: August 31

36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture,