ECE 6980 An Algorithmic and Information-Theoretic Toolbox for Massive Data

ECE 6980 A Algorithmic ad Iformatio-Theoretic Toolbo for Massive Data Istructor: Jayadev Acharya Lecture # Scribe: Huayu Zhag 8th August, 017 1 Recap X =, ε is a accuracy parameter, ad δ is a error parameter. Learig discrete distributios TV-Estimatio Problem: Give X 1, X,..., X idepedet samples draw from a uow distributio p over [], we eed to output ˆp s.t. with probability at least 1 δ, d T V (p, ˆp < ε. Here we assume δ = 0.1 (for ow. Suppose we observe X1 def = X 1, X,..., X from a distributio p over X. Let N def = {#times symbol appears i X 1 }. We defie the empirical estimator pˆ ( = N. Theorem 1. The empirical estimator satisfies [ E X l1 1 (p, pˆ ] Lemma (Cauchy-Schwarz Iequality. let a 1,..., a m, b 1,..., b m R, we have m m m ( a i b i ( a i ( b i The two sides are equal if ad oly if for all i, a i /b i = c. Proof. Usig CSI with a = p( pˆ (, b =1, ( ( l 1 (p, pˆ (p( pˆ ( If we tae epectatio for both sides, we have X [ E l 1 (p, pˆ ] E [ X ( N p(] (1 = [ E (N p( ] ( X = p((1 p( (3 X (4

The last two lies come from the fact that N Bi(, p(. So we have E[N ] = p( ad Var(N = p((1 p(. Because f( = is a cove fuctio, accordig to Jese s iequality, we get [ ] [ E l 1 (p, pˆ E l 1 (p, pˆ ] Lemma 3 (Marov s Iequality. If X is a oegative radom variable ad a > 0, the Prob(X a E[X] a Usig Marov s Iequality, ( Prob l 1 (p, pˆ > ε 1 ε Let 1 ε 0.1, we ca get 100. So if we use a empirical estimator, we get a upper ε boud of O(. ε 3 Poisso Samplig Poisso Samplig is a samplig method that produces idepedet N s without too much loss. 3.1 Properties of Poisso Distributio If X Poi(λ 1, Y Poi(λ 1 PMF: P(X = i = e λ1 λi 1 i!, Mea ad Variace: E[X] = Var(X = λ 1, 3 Whe ( p is fied ad p 0, Bi(, p goes to Poi( p. To be specific, whe p = λ, lim p i (1 p i = e λ λi p 0 i i! 4 X + Y Poi(λ 1 + λ 3. Procedure for Poisso samplig Fied legth samplig: We have a fied sample size ad we draw X 1, X,..., X iid samples from distributio p, N Bi(, p( Poisso legth samplig: 1 Poi( Geerate idepedet samples from p.

3..1 Properties of Poisso Samplig 1 N Poi( p(. Proof. Pr(N = j = Pr ( N = j, = j e! (p(j (p(j (p(j ( j (p( j (1 p( j j (1 p( j j ( j! ((1 p( j j p( (p(j. e (1 p( ( j! Coditio o, the distributio becomes fied legth with respect to parameter. 3 P(N =, N y = y = P(N = P(N y = y 4 Testig Problem Give descriptio of a probability distributio q over [], parameter ε ad idepedet samples from a uow distributio p, we wat to ow whether p = q or d T V (p, q > ε. The followig picture illustrates the case whe q = u[]. We eed to distiguish betwee p is the origi or p lies outside the square. Now we cosider a special case whe q is uiform. Give ε > 0 ad idepedet samples from p, we wat to figure out, with probability at least 0.9, whether p = q or d T V (p, q > ε. 3

Theorem 4. Testig uiformity requires Ω( samples for ay fied ε. Before we loo at the argumet for this theorem, let us see the followig lemma first. Lemma 5 (Birthday Parado. At least Ω( samples from u[] are eeded before you ca fid a repeated symbol with some costat probability. You ca prove this lemma by showig E[#symbols appear more tha 1 time] <. Do t forget uder Poisso Samplig, for every, N Poi(/. You ca also try to prove the followig result: At least Ω( 1 1/α samples from u[] are eeded before you ca fid a symbol appear α times with some costat probability. Now let us go bac to the theorem. Recall that P = u[] is the uiform distributio o []. Let u[/] be the collectio of all distributios that are uiform over a subset of / elemets of. There are ( / distributios. The ote that: For ay q u[/], dt V (q, u[] = 0.5. Let Q be the distributio uiformly draw from u[/]. The if we sample from P = u[] by /10 umber of samples, all symbols are distict. The same is true for Q. Hece we ca t distiguish betwee P ad Q with a costat probability. 4.1 Goldreich-Ro Algorithm The algorithm is as follows: Let T def = i<j I{ i = j }. If T ( ( 1 else we output p = q. + ε, we output d T V (p, q > ε Theorem 6. The coicidece based test solves uiformity testig problem with O( ε 4 Proof. Whe p is a uiform distributio, the epectatio of statistics T is: ( E[T p = u] = p ( (5 ( = 1 (6 Whe d T V (p, q > ε, by usig Jeso s iequality ad Cauchy-Schwarz iequality, ( p( 1 ( p( 1 ε Besides, ( p( 1 = p ( = p ( 1 p( + 1 (7 (8 The we have p ( 1 + ε 4

So the epectatio of the statistics is: ( E[T l 1 (p, u] = p ( (9 ( 1 + ε (10 The followig proof about boudig variace ad usig Chebychev s iequality will be covered i the et lecture. I the et lecture we will loo at a statistic that gives a upper boud of O( /ε samples. 5 Referece Mitzemacher, Michael, ad Eli Upfal. Probability ad Computig: Radomizatio ad Probabilistic Techiques i Algorithms ad Data Aalysis. Paisi 08 : http://www.stat.columbia.edu/ liam/research/pubs/sparse-uif-test.pdf 5