36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture, some of them follow similar lies of usig Cheroff s method i clever ways I ca provide refereces if you are curious I particular, we will go through: 1 Berstei s iequality: sharper cocetratio for bouded radom variables McDiarmid s iequality: Cocetratio of Lipschitz fuctios of bouded radom variables 3 Levy s iequality/tsirelso s iequality: Cocetratio of Lipschitz fuctios of Gaussia radom variables 4 χ tail boud Fially, we will see a applicatio of the χ tail boud i provig the Johso-Lidestrauss lemma 31 Berstei s iequality Oe ice thig about the Gaussia tail iequality was that it explicitly depeded o the variace of the radom variable X, ie roughly the iequality guarateed us that the deviatio from the mea was at most σ log(/δ/ with probability at least 1 δ O the other had Hoeffdig s boud depeded oly o the bouds of the radom variable but ot explicitly o the variace of the RVs The boud b a, provides a (possibly loose upper boud o the stadard deviatio Oe might at least hope that if the radom variables were bouded, ad additioally had small variace we might be able to improve Hoeffdig s boud This is ideed the case Such iequalities are typically kow as Berstei iequalities As a cocrete example, suppose we had X 1,, X which were iid from a distributio with mea µ, bouded support [a, b], with variace E[X ] = σ The, ( P( µ µ t exp 3-1 t (σ + (b at
3- Lecture 3: August 31 Roughly this iequality says with probability at least 1 δ, l(/δ 4(b a l(/δ µ µ 4σ + As a exercise work through the above algebra Upto some small costats this is ever worse tha Hoeffdig s boud, which just comes from usig the worst-case upper boud of σ b a Whe the RVs have small variace, ie σ is small, this boud ca be much sharper tha Hoeffdig s boud These are cases where oe has a radom variable that occasioally takes large values (so the bouds are ot great but has much smaller variace Ituitively, it captures more of the Chebyshev effect, ie that radom variables with small variace should be tightly cocetrated aroud their mea 3 McDiarmid s iequality So far we have mostly bee focused o the cocetratio of averages A atural questio is whether other fuctios of iid radom variables also show expoetial cocetratio It turs out that may other fuctios do cocetrate sharply, ad roughly the mai property of the fuctio that we eed is that if we chage the value of oe radom variable the fuctio does ot chage dramatically Formally, we have iid RVs X 1,, X, where each X i R We have a fuctio f : R R, that satisfies the property that: f(x 1,, x f(x 1,, x k 1, x k, x k+1,, x L k, for every x, x R, ie the fuctio chages by at most L k if its k-th co-ordiate is chaged This is kow as the bouded differece coditio If the radom variables X 1,, X are iid the for all t 0 P( f(x 1,, X E[f(X 1,, X ] t exp ( t k=1 L k Example 1: A simple example of this iequality i actio is to see that it directly implies the Hoeffdig boud I this case the fuctio of iterest is the average: f(x 1,, X = 1 X i, ad sice the radom variables are bouded we have that each L k (b a/ This i tur directly yields Hoeffdig s boud (with slightly better costats
Lecture 3: August 31 3-3 Example : A perhaps more iterestig example is that of U-statistics A U-statistic is defied by a kerel, which is just a fuctio of two radom variables, ie g : R R The U-statistic is the give as: U(X 1,, X := ( 1 g(x j, X k j<k There are may examples of U-statistics, for istace: 1 Variace: The usual estimator of the sample variace: σ = 1 (X i µ, 1 is the U-statistic that arises from takig g(x j, X k = 1 (X i X j Mea absolute deviatio: If we take g(x j, X k = X j X k, this leads to a U-statistic that is a ubiased estimator of the mea absolute deviatio E X 1 X For bouded U-statistics, ie if g(x i, X j b, we ca apply McDiarmid s iequality to obtai a cocetratio boud Note that sice each radom variable X i participates i ( 1 terms we have that, U(X 1,, X U(X 1,, X i,, X ( 1 ( 1(b = 4b So that McDiarmid s iequality tells us that, P( U(X 1,, X E[U(X 1,, X ] t exp( t /(8b 33 Levy s iequality There is a similar cocetratio iequality that applies to fuctios of Gaussia radom variables that are sufficietly smooth I this case, the assumptio is quite differet We assume that: f(x 1,, X f(y 1,, Y L (X i Y i, for all X 1,, X, Y 1,, Y R For such fuctios we have that if X 1,, X N(0, 1 the, P( f(x 1,, X E[f(X 1,, X ] t exp ( t L
3-4 Lecture 3: August 31 34 χ tail bouds A χ radom variable with degrees of freedom, deoted by Y χ, is a RV that is a sum of iid stadard Gaussia RVs, ie Y = X i where each X i N(0, 1 The expected value E[Xi ] = 1, ad we have the χ tail boud: ( 1 P Zk 1 t exp( t /8 for all t (0, 1 k=1 You will derive this i your HW usig the Cheroff method Aalogous to the class of sub- Gaussia RVs, χ radom variables belog to a class of what are kow as sub-expoetial radom variables The mai ote-worthy differece is that the Gaussia-type behaviour of the tail oly holds for small values of the deviatio t Detour: The uio boud have evets A 1,, A the This is also kow as Boole s iequality It says that if we ( P A i P(A i I particular, if we cosider a case whe each evet A i is a failure of some type, the the above iequality says that the probability that eve a sigle failure occurs is at most the sum of the probabilities of each failure Example: The Johso-Lidestrauss Lemma Oe very ice applicatio of χ tail bouds is i the aalysis of what are kow as radom projectios Suppose we have a data set X 1,, X R d where d is quite large Storig such a dataset might be expesive ad as a result we ofte resort to sketchig or radom projectio where the goal is to create a map F : R d R m, with m d We the istead store the mapped dataset {F (X 1,, F (X } The challege is to desig this map F i a way that preserves essetial features of the origial dataset I particular, we would like that for every pair (X i, X j we have that, (1 ɛ X i X j F (X i F (X j (1 + ɛ X i X j, ie the map preserves all the pair-wise distaces up to a (1 ± ɛ factor Of course, if m is large we might expect this is ot too difficult The Johso-Lidestrauss lemma is quite stuig: it says that a simple radomized costructio will produce such a map with probability at least 1 δ provided that, 16 log(/δ m ɛ Notice that this is completely idepedet of the origial dimesio d ad depeds o logarithmically o the umber of poits This map ca result i huge savigs i storage cost while still essetially preservig all the pairwise distaces
Lecture 3: August 31 3-5 The map itself is quite simple: we costruct a matrix Z R m d, where each etry of Z is iid N(0, 1 We the defie the map as: F (X i = ZX i m Now let us fix a pair (X j, X k ad cosider, F (X j F (X k X j X k = Z(X j X k m Xj X k = 1 m X j X k Z i, m X j X k }{{ } T i Now, for some fixed umbers a j the distributio of d j=1 a jz ij is Gaussia with mea 0 ad variace d j=1 a j So each term T i is a idepedet χ radom variable Now applyig the χ tail boud, we obtai that, ( F (X j F (X k P 1 X j X k ɛ exp( mɛ /8 Thus for the fixed pair (X i, X j the probability that our map fails to preserve the distace is expoetially small, ie is at most exp( mɛ /8 Now, to fid the probability that our map fails to preserve ay of our ( pairwise distaces we simply apply the uio boud to coclude that, the probability of ay failure is at most: ( P(failure exp( mɛ /8 Now, it is straightforward to verify that if m 16 log(/δ ɛ, the this probability is at most δ as desired A importat poit to ote is that the expoetial cocetratio is what leads to such a small value for m (ie it oly eeds to grow logarithmically with the sample size