CONCENTRATION INEQUALITIES

Size: px

Start display at page:

Download "CONCENTRATION INEQUALITIES"

Georgia Crawford
5 years ago
Views:

1 CONCENTRATION INEQUALITIES MAXIM RAGINSKY I te previous lecture, te followig result was stated witout proof. If X 1,..., X are idepedet Beroulliθ radom variables represetig te outcomes of a sequece of tosses of a coi wit bias probability of eads θ, te for ay ε 0, 1 1 P θ θ ε 2e ε2 were θ = 1 is te fractio of eads i X = X 1,..., X. Sice θ = E θ, 1 says tat te sample or empirical average of te X i s cocetrates sarply aroud te statistical average θ = EX 1. Bouds like tese are fudametal i statistical learig teory. I te ext few lectures, we will lear te teciques eeded to derive suc bouds for settigs muc more complicated ta coi tossig. Tis is ot meat to be a complete picture; more details ad additioal results ca be foud i te excellet survey by Boucero et al. [BBL04]. X i 1. Te basic tools We start wit Markov s iequality: Let X R be a oegative radom variable. Te for ay t > 0 we ave 2 Te proof is simple: PX t EX t. PX t = E[1 {X t} ] E[X1 {X t}] t EX t, were: 3 uses te fact tat te probability of a evet ca be expressed as te expectatio of its idicator fuctio: PX A = P X dx = 1 {x A} P X dx = E[1 {X A} ] 4 uses te fact tat 5 uses te fact tat so cosequetly E[X1 {X t} ] EX. A X X t > 0 = X t 1 X 0 = X1 {X t} X, Date: Jauary 24,

2 Markov s iequality leads to our first boud o te probability tat a radom variable deviates from its expectatio by more ta a give amout: Cebysev s iequality. Let X be a arbitrary real radom variable. Te for ay t > 0 6 P X EX t Var X t 2, were Var X E[ X EX 2 ] = EX 2 EX 2 is te variace of X. To prove 6, we apply Markov s iequality 2 to te oegative radom variable X EX 2 : 7 P X EX t = X EX 2 t 2 8 E X EX 2 t 2, were te first step uses te fact tat te fuctio φx = x 2 is mootoically icreasig o [0,, so tat a b 0 if ad oly if a 2 b 2. Now let s apply tese tools to te problem of boudig te probability tat, for a coi wit bias θ, te fractio of eads i trials differs from θ by more ta some ε > 0. To tat ed, let us represet te outcomes of te tosses by idepedet Beroulliθ radom variables X 1,..., X {0, 1}, were PX i = 1 = θ for all i. Let θ = 1 X i. Te ad E θ = E Var θ = Var [ 1 1 ] X i = 1 EX }{{} i = θ =PX i =1 X i = 1 2 Var X i = θ1 θ, were we ave used te fact tat te X i s are i.i.d., so VarX X = Var X i = Var X 1. Now we are i a positio to apply Cebysev s iequality: 9 P θ θ ε Var θ θ1 θ ε 2 = ε 2. At te very least, 9 sows tat te probability of gettig a bad sample decreases wit sample size. Ufortuately, it does ot decrease fast eoug. To see wy, we ca appeal to te Cetral Limit Teorem, wic rougly states tat P θ θ t θ1 θ 1 Φt 1 e t2 /2, 2π t were Φt = 1/ 2π t e x2 /2 dx is te stadard Gaussia CDF. Tis would suggest sometig like P θ θ ε exp ε2, 2θ1 θ wic decays wit muc faster ta te rigt-ad side of 9, 2

3 2. Te Ceroff boudig trick ad Hoeffdig s iequality To fix 9, we will use a very powerful tecique, kow as te Ceroff boudig trick [Ce52]. Let X be a oegative radom variable. Suppose we are iterested i boudig te probability PX t for some particular t > 0. Observe tat for ay s > 0 we ave 10 PX t = P e sx e st e st E [ e sx], were te first step is by mootoicity of te fuctio φx = e sx ad te secod step is by Markov s iequality 2. Te Ceroff trick is to coose a s > 0 tat would make te rigt-ad side of 10 suitably small. I fact, sice 10 olds simultaeously for all s > 0, te optimal tig to do is to take PX t if s>0 e st E [ e sx]. However, ofte a good upper boud o te momet-geeratig fuctio E [ e sx] is eoug. Oe suc boud was developed by Hoeffdig [Hoe63] for te case we X is bouded wit probability oe: Lemma 1 Hoeffdig. Let X be a radom variable wit EX = 0 ad Pa X b = 1 for some < a b <. Te for all s > 0 11 E [ e sx] e s2 b a 2 /8. Proof. Te proof uses elemetary calculus ad covexity. First we ote tat te fuctio φx = e sx is covex o R. Ay x [a, b] ca be writte as Hece Sice EX = 0, we ave x = x a b a b + b x b a a. e sx x a b a esb + b x b a esa. E [ e sx] b b a esa a b a esb b = b a a b a esb a e sa. We ave sb a i te expoet i te pareteses. To get te same tig i te e sa term multiplyig te pareteses, we wit a bit of foresigt seek λ suc tat sa = λsb a, wic gives us λ = a/b a. Te b b a a b a esb a e sa = 1 λ + λe sb a e λsb a. Now let u = sb a, so we ca write 12 E [ e sx] 1 λ + λe u e λu. Agai wit a bit of foresigt, let us express te rigt-ad side of 12 as a expoetial of a fuctio of u: 1 λ + λe u e λu = e φu, were φu = log1 λ + λe u λu. Now te wole affair iges o us beig able to sow tat φu u 2 /8 for ay u 0. To tat ed, we first ote tat φ0 = φ 0 = 0, ad tat φ u 1/4 for all u 0. Terefore, by Taylor s teorem we ave φu = φ0 + φ 0u φ αu 2 3

4 for some α [0, u], ad we ca upper-boud te rigt-ad side of te above expressio by u 2 /8. Tus, wic gives us 11. E [ e sx] e φu e u2 /8 = e s2 b a 2 /8, We will ow use te Ceroff metod ad te above lemma to prove te followig Teorem 1 Hoeffdig s iequality. Let X 1,..., X be idepedet radom variables, suc tat X i [a i, b i ] wit probability oe. Let S X i. Te for ay t > 0 2t 2 13 P S ES t exp b i a i 2 ; 14 2t 2 P S ES t exp b i a i 2. Cosequetly, 15 2t 2 P S ES t 2 exp b i a i 2. Proof. By replacig eac X i wit X i EX i, we may as well assume tat EX i = 0. Te S = X i. Usig Ceroff s trick, we write 16 P S t = P e ss e st e st E [ e ss]. Sice te X i s are idepedet, E [ e ss] [ ] [ 17 = E e sx X = E e sx i ] = E [ e sx ] i. Sice X i [a i, b i ], we ca apply Lemma 1 to write E [ e sx i] e s 2 b i a i 2 /8. Substitutig tis ito 17 ad 16, we obtai If we coose s = P S t e st = exp e s2 b i a i 2 /8 st + s2 8 b i a i 2 4t P b i a i 2, te we obtai 13. Te proof of 14 is similar. Now we will apply Hoeffdig s iequality to improve our crude cocetratio boud 9 for te sum of idepedet Beroulliθ radom variables, X 1,..., X. Sice eac X i {0, 1}, we ca apply Teorem 1 to get, for ay t > 0, P X i θ t 2e 2t2 /. Terefore, wic gives us te claimed boud 1. P θ θ ε = P X i θ ε 2e 2ε2, 4

5 3. From bouded variables to bouded differeces: McDiarmid s iequality Hoeffdig s iequality applies to sums of idepedet radom variables. We will ow develop its geeralizatio, due to McDiarmid [McD89], to arbitrary real-valued fuctios of idepedet radom variables tat satisfy a certai coditio. Let X be some set, ad cosider a fuctio g : X R. We say tat g as bouded differeces if tere exist oegative umbers c 1,..., c, suc tat 18 sup x 1,...,x,x i X gx1,..., x gx 1,..., x i 1, x i, x i+1,..., x ci for all i = 1,...,. I words, if we cage te it variable wile keepig all te oters fixed, te value of g will ot cage by more ta c i. Teorem 2 McDiarmid s iequality [McD89]. Let X = X 1,..., X X be a -tuple of idepedet X-valued radom variables. If a fuctio g : X R as bouded differeces, as i 18, te, for all t > 0, P gx EgX t exp 2t2 ; c2 i P EgX gx t exp 2t2. c2 i Proof. Let me first sketc te geeral idea beid te proof. Let V = gx EgX. Te first step will be to write V as a sum V i, were te terms V i are costructed so tat: 1 V i is a fuctio oly of X i = X 1,..., X i 2 Tere exists a fuctio Ψ i : X i 1 R suc tat, coditioally o X i 1, Ψ i X i 1 V i Ψ i X i 1 + c i. Provided we ca arrage tigs i tis way, we ca apply Lemma 1 to V i coditioally o X i 1 : E [ e sv i X i 1] e s2 c 21 2 i /8. Te, usig Ceroff s metod, we ave P Z EZ t = PV t e st E [ e sv ] = e st E [e s P ] V i = e st E [e s P ] 1 V i e sv = e st E [e s P 1 V i E [e ]] X sv 1 e st e s2 c 2 /8 E [e s P ] 1 V i, were i te ext-to-last step we used te fact tat V 1,..., V 1 deped oly o X 1, ad i te last step we used 21 wit i =. If we cotiue peelig off te terms ivolvig V 1, V 2,..., V 1, we will get P Z EZ t exp st + s2 c 2 i. 8 Takig s = 4t/ c2 i, we ed up wit 19. 5

6 It remais to costruct te V i s wit te desired properties. To tat ed, let Te V i = H i X i = E[Z X i ] ad V i = H i X i H i 1 X i 1. { E[Z X i ] E[Z X i 1 ] } = E[Z X ] EZ = Z EZ = V. Note tat V i depeds oly o X i by costructio. Moreover, let Ψ i X i 1 = if Hi X i 1, x H i 1 X i 1 x X Ψ ix i 1 = sup Hi X i 1, x H i 1 X i 1, x X were, owig to te fact tat te X i s are idepedet, we ave H i X i 1, x = E[Z X i 1, X i = x] = gx i 1, x, x i+1p X i+1 dx i+1 x i+1 deotig te tuple x i+1,..., x. Te Ψ ix i 1 Ψ i X i 1 = sup Hi X i 1, x H i 1 X i 1 if Hi X i 1, x H i 1 X i 1 x X x X = sup sup Hi X i 1, x H i X i 1, x x X x X = sup sup E[Z X i 1, X i = x] E[Z X i 1, X i = x ] x X x X [gx = sup sup i 1, x, x i+1 gx i 1, x, x i+1 ] P dx i+1 x X x X sup sup gx i 1, x, x i+1 gx i 1, x, x i+1 P dx i+1 x X x X c i, were te last step follows from te bouded differece property. Tus, we ca write Ψ i Xi 1 Ψ i X i 1 + c i, wic implies tat, ideed, coditioally o X i 1. Ψ i X i 1 V i Ψ i X i 1 + c i 4. McDiarmid s iequality i actio McDiarmid s iequality is a extremely powerful ad ofte used tool i statistical learig teory. We will ow discuss several examples of its use. To tat ed, we will first itroduce some otatio ad defiitios. Let X be some measurable space. If Q is a probability distributio of a X-valued radom variable X, te we ca compute te expectatio of ay measurable fuctio f : X R w.r.t. Q. So far, we ave deoted tis expectatio by EfX or by E Q fx. We will ofte fid it coveiet to use a alterative otatio, Qf. Let X = X 1,..., X be idepedet idetically distributed i.i.d. X-valued radom variables wit commo distributio P. Te mai object of iterest to us is te empirical distributio iduced by X, wic we will deote by P X. Te empirical distributio assigs te probability 1/ to eac X i, i.e., P X = 1 δ Xi. 6

7 Here, δ x deotes a uit mass cocetrated at a poit x X, i.e., te probability distributio o X defied by δ x A = 1 {x A}, measurable A X. We ote te followig importat facts about P X : 1 Beig a fuctio of te sample X, P X is a radom variable takig values i te space of probability distributios over X. 2 Te probability of a set A X uder P X, P X A = 1 1 {Xi A}, is te empirical frequecy of te set A o te sample X. Te expectatio of P X A is equal to P A, te P -probability of A. Ideed, [ ] E P 1 X A = E 1 {Xi A} = 1 E[1 {Xi A}] = 1 PX i A = P A. 3 Give a fuctio f : X R, we ca compute its expectatio w.r.t. P X : P X f = 1 fx i, wic is just te sample mea of f o X. It is also referred to as te empirical expectatio of f o X. We ave [ ] E P 1 1 X f = E fx i EfX i = EfX = P f. We ca ow proceed to our examples Sums of bouded radom variables. I te special case we X = R, P is a probability distributio supported o a fiite iterval, ad gx is te sum gx = X i, McDiarmid s iequality simply reduces to Hoeffdig s. Ideed, for ay x [a, b] ad x i we ave [a, b] Itercagig te roles of x i ad x i, we get gx i 1, x i, x i+1 gx i 1, x i, x i+1 = x i x i b a. gx i 1, x i, x i+1 gx i 1, x i, x i+1 = x i x i b a. Hece, we may apply Teorem 2 wit c i = b a for all i to get P gx EgX t 2 exp 2t2 b a 2. 7

8 4.2. Uiform deviatios. Let X 1,..., X be i.i.d. X-valued radom variables wit commo distributio P. By te Law of Large Numbers, for ay A X ad ay ε > 0 lim P PX A P A ε = 0. I fact, we ca use Hoeffdig s iequality to sow tat P PX A P A ε 2e 2ε2. Tis probability boud olds for eac A separately. However, i learig teory we are ofte iterested i te deviatio of empirical frequecies from true probabilities simultaeously over some collectio of te subsets of X. To tat ed, let A be suc a collectio ad cosider te fuctio gx sup P 22 X A P A. A A Later i te course we will see tat, for certai coices of A, EgX = O1/. However, regardless of wat A is, it is easy to see tat, by cagig oly oe X i, te value of gx ca cage at most by 1/. Let x = x 1,..., x, coose some oter x i X, ad let x i deote x wit x i replaced by x i : Te x = x i 1, xi, x i+1, x i = xi 1, x i, x i+1. gx gx i = sup P x A P A sup P x A A A i A P A A { = sup if Px A P A P } x A A A A i A P A { sup Px A P A P } x i A P A A A sup P x A P x i A A A = 1 sup 1 {xi A} 1 {x i A} 1. A A Itercagig te roles of x ad x i, we obtai gx i gx 1. Tus, gx gx i 1. Note tat tis boud olds for all i ad all coices of x ad x i. Tis meas tat te fuctio g defied i 22 as bouded differeces wit c 1 =... = c = 1/. Cosequetly, we ca use Teorem 2 to get P gx EgX ε 2e 2ε2. Tis sows tat te uiform deviatio gx cocetrates sarply aroud its mea EgX. 8

9 4.3. Uiform deviatios cotiued. Te same idea applies to arbitrary real-valued fuctios over X. Let X = X 1,..., X be as i te previous example. Give ay fuctio f : X [0, 1], Hoeffdig s iequality tells us tat P PX f EfX ε 2e 2ε2. However, just as i te previous example, i learig teory we are primarily iterested i cotrollig te deviatios of empirical meas from true meas simultaeously over wole classes of fuctios. To tat ed, let F be suc a class cosistig of fuctios f : X [0, 1] ad cosider te uiform deviatio gx sup f F P X f P f. A argumet etirely similar to te oe i te previous example 1 sows tat tis g as bouded differeces wit c 1 =... = c = 1/. Terefore, applyig McDiarmid s iequality, we obtai P gx EgX ε 2e 2ε2. We will see later tat, for certai fuctio classes F, we will ave EgX = O1/ Kerel desity estimatio. For our fial example, let X = X 1,..., X be a -tuple of i.i.d. real-valued radom variables wose commo distributio P as a probability desity fuctio pdf f, i.e., P A = fxdx for ay measurable set A R. We wis to estimate f from te sample X. A popular metod is to use a kerel estimate te book by Devroye ad Lugosi [DL01] as plety of material o desity estimatio, icludig kerel metods, from te viewpoit of statistical learig teory. To tat ed, we pick a oegative fuctio K : R R tat itegrates to oe, Kxdx = 1 suc a fuctio is called a kerel, as well as a positive badwidt or smootig costat > 0 ad form te estimate f x = 1 A x Xi K It is ot ard to verify 2 tat f is a valid pdf, i.e., tat it is oegative ad itegrates to oe. A commo way of quatifyig te performace of a desity estimator is to use te L 1 distace to te true desity f: f f L1 = f x fx dx. R Note tat f f L1 is a radom variable sice it depeds o te radom sample X. Tus, we ca write it as a fuctio gx of te sample X. Leavig aside te problem of actually boudig EgX, we ca easily establis a cocetratio boud for it usig McDiarmid s iequality. To do. 1 Exercise: verify tis! 2 Aoter exercise! 9

10 tat, we eed to ceck tat g as bouded differeces. Coosig x ad x i as before, we ave gx gx i = 1 i 1 x xj K R j=1 1 i 1 x xj K R j=1 1 x K xi R 2 x K dx = 2. R + 1 K x xi + 1 K K x x i dx + 1 j=i+1 x x i + 1 x xj K fx dx x xj K fx dx j=i+1 Tus, we see tat gx as te bouded differeces property wit c 1 =... = c = 2/, so tat P gx EgX ε 2e ε2 /2. Refereces [BBL04] S. Boucero, O. Bousquet, ad G. Lugosi. Cocetratio iequalities. I O. Bousquet, U. vo Luxburg, ad G. Rätsc, editors, Advaced Lectures i Macie Learig, pages Spriger, [Ce52] H. Ceroff. A meausre of asymptotic efficiecy of tests of a ypotesis based o te sum of observatios. Aals of Matematical Statistics, 23: , [DL01] L. Devroye ad G. Lugosi. Combiatorial Metods i Desity Estimatio. Spriger, [Hoe63] W. Hoeffdig. Probability iequalities for sums of bouded radom variables. Joural of te America Statistical Associatio, 58:13 30, [McD89] C. McDiarmid. O te metod of bouded differeces. I Surveys i Combiatorics, pages Cambridge Uiversity Press,

Learning Theory: Lecture Notes

Learning Theory: Lecture Notes Learig Theory: Lecture Notes Kamalika Chaudhuri October 4, 0 Cocetratio of Averages Cocetratio of measure is very useful i showig bouds o the errors of machie-learig algorithms. We will begi with a basic