Mining Data Streams-Estimating Frequency Moment

Size: px

Start display at page:

Download "Mining Data Streams-Estimating Frequency Moment"

Caren Davis
5 years ago
Views:

1 Mnng Data Streams-Estmatng Frequency Moment Barna Saha October 26, 2017

2 Frequency Moment Computng moments nvolves dstrbuton of frequences of dfferent elements n the stream.

3 Frequency Moment Computng moments nvolves dstrbuton of frequences of dfferent elements n the stream. Let f be the number of occurrences of the th element for any [1, n], then the kth frequency moment s F k = f k

4 Frequency Moment The 0th moment s the sum of 1 for each f > 0. Hence t counts the number of dstnct tems.

5 Frequency Moment The 0th moment s the sum of 1 for each f > 0. Hence t counts the number of dstnct tems. The 1st moment s the sum of the f s whch must be the length of the stream. Ths s easy to calculate.

6 Frequency Moment The 0th moment s the sum of 1 for each f > 0. Hence t counts the number of dstnct tems. The 1st moment s the sum of the f s whch must be the length of the stream. Ths s easy to calculate. The 2nd moment s the sum of the squares of the f s. It s sometmes called the surprse number as t measures the unevenness of the dstrbuton of elements. Suppose we have a stream of length 100. Scenaro 1: There are 10 elements each wth frequency 10. F 2 = = 1000 Scenaro 2: There are 10 elements, 1st tem has frequency 91, and rest have each frequency 1. F 2 = = 8290.

7 Computng F 2 n Small Space Alon-Matas-Szegedy: Lnear Sketchng

8 Lnear Sketch for F 2 Problem Gven a stream A 1, A 2,.., A m where elements are comng from the unverse [1, n] estmate F 2 = n =1 f 2 n small space. Output Return an estmate ˆF 2 such that ) Pr (F 2 (1 ɛ) ˆF 2 (1 + ɛ)f 2 (1 δ) where ɛ > 0 and δ > 0 are respectvely the error and confdence parameters.

Lnear Sketch for F 2 +1 +1-1 +1-1 z Frequency Vector Sketch Dmenson: k x n To

famly of 4-wse ndependent unversal hash famly.

9 Lnear Sketch for F z Frequency Vector Sketch Dmenson: k x n To construct each row pck a hash func<on h:{1,n} à{+1,-1} unformly at random from a famly of 4-wse ndependent unversal hash famly. z(l,)=h l () Dmenson: n x 1 Pck k such hash func<ons ndependently: h 1, h 2,.,h k to construct The k rows. Dmenson: k x 1

10 Lnear Sketch for F z When the th element appears, smply update Z 1 =Z 1 + z 1 () update Z 2 =Z 2 + z 2 () update Z 3 =Z 3 + z 3 () : : : update Z k =Z k + z k ()

11 Lnear Sketch for F z When the th element appears, smply update Z 1 =Z 1 + z 1 () Space Requrement=klog(n) EsEmate=(Z 12 +Z Z k2 )/k update Z 2 =Z 2 + z 2 () update Z 3 =Z 3 + z 3 () : : : update Z k =Z k + z k ()

12 Estmate: ˆF 2 = 1 k k =1 Z 2 Why s ths a good estmate?

13 Estmate: ˆF 2 = 1 k k =1 Z 2 Why s ths a good estmate? Show E[ ˆF 2 ] = F 2.

14 Estmate: ˆF 2 = 1 k k =1 Z 2 Why s ths a good estmate? Show E[ ˆF 2 ] = F 2. Show Var[ ˆF 2 ] 2F 2 2 k.

15 Estmate: ˆF 2 = 1 k k =1 Z 2 Why s ths a good estmate? Show E[ ˆF 2 ] = F 2. Show Var[ ˆF 2 ] 2F 2 2 k. Apply Chebyshev. Prob ( ˆF 2 F 2 > ɛf 2 ) Var( ˆF 2 ) ɛ 2 F 2 2

16 Estmate: ˆF 2 = 1 k k =1 Z 2 Why s ths a good estmate? Show E[ ˆF 2 ] = F 2. Show Var[ ˆF 2 ] 2F 2 2 k. Apply Chebyshev. ) Prob ( ˆF 2 F 2 > ɛf 2 Var( ˆF 2 ) ɛ 2 F2 2 ) Take k = 16. Prob ( ˆF ɛ 2 2 F 2 > ɛf 2 1 8

17 Estmate: ˆF 2 = 1 k k =1 Z 2 Why s ths a good estmate? Show E[ ˆF 2 ] = F 2. Show Var[ ˆF 2 ] 2F 2 2 k. Apply Chebyshev. ) Prob ( ˆF 2 F 2 > ɛf 2 Var( ˆF 2 ) ɛ 2 F2 2 ) Take k = 16. Prob ( ˆF ɛ 2 2 F 2 > ɛf Prob (F 2 (1 ɛ) ˆF ) 2 (1 + ɛ)f 2 7 8

18 Expectaton of Z 2 s Z s Z, s = 1, 2,.., k Z = n =1 f z(), Z 2 =,j [1,n] f f j z z j

19 Expectaton of Z 2 s Z s Z, s = 1, 2,.., k Z = n =1 f z(), Z 2 =,j [1,n] f f j z z j E[Z 2 ] =,j [1,n] E[f f j z z j ] = E[f 2 z 2 ] = f 2 = F 2 snce E[z z j ] = 0 f j and E[z 2 ] = 1.

20 Expectaton of Z 2 s Z s Z, s = 1, 2,.., k Z = n =1 f z(), Z 2 =,j [1,n] f f j z z j E[Z 2 ] =,j [1,n] E[f f j z z j ] = E[f 2 z 2 ] = f 2 = F 2 snce E[z z j ] = 0 f j and E[z 2 ] = 1. E[ ˆF 2 ] = 1 k k E[Zs 2 ] = F 2 s=1

21 Varance of Z 2 s Var(Z 2 ) = E[Z 4 ] (E[Z 2 ]) 2

22 Varance of Z 2 s Var(Z 2 ) = E[Z 4 ] (E[Z 2 ]) 2 E[Z 4 ] = f 4 E[z 4 ] + = f 4,j:<j + 6,j:<j ( ) 4 f 2 fj 2 E[z 2 zj 2 ] 2 f 2 fj 2 snce E[z z j z k z l ] = 0 f < j < k < l or 3 of the terms are equal.

23 Varance of Z 2 s Var(Z 2 ) = E[Z 4 ] (E[Z 2 ]) 2 E[Z 4 ] = f 4 E[z 4 ] + = f 4,j:<j + 6,j:<j ( ) 4 f 2 fj 2 E[z 2 zj 2 ] 2 f 2 fj 2 snce E[z z j z k z l ] = 0 f < j < k < l or 3 of the terms are equal. (E[Z 2 ]) 2 = ( f 2 ) 2 = f 4 + 2,j:<j f 2 fj 2

24 Varance of Z 2 s Var(Z 2 ) = E[Z 4 ] (E[Z 2 ]) 2 E[Z 4 ] = f 4 E[z 4 ] + = f 4,j:<j + 6,j:<j ( ) 4 f 2 fj 2 E[z 2 zj 2 ] 2 f 2 fj 2 snce E[z z j z k z l ] = 0 f < j < k < l or 3 of the terms are equal. (E[Z 2 ]) 2 = ( f 2 ) 2 = Var(Z 2 ) = 4,j:<j f f 2 f 2 j 2F 2 2,j:<j f 2 fj 2

25 Varance of ˆF 2 Var( ˆF 2 ) = Var( 1 k k Zs 2 ) s=1 = 1 k k 2 Var( Zs 2 )) snce Var(aX ) = a 2 Var(X ) for any constant a = 1 k 2 k s=1 s=1 Var(Zs 2 ) 1 k 2 2kF 2 2 = 2F 2 2 k

26 Boostng Confdence by Medan We have ) Prob (F 2 (1 ɛ) ˆF 2 (1 + ɛ)f We want ) Prob (F 2 (1 ɛ) ˆF 2 (1 + ɛ)f 2 1 δ

27 Boostng Confdence by Medan We have ) Prob (F 2 (1 ɛ) ˆF 2 (1 + ɛ)f We want ) Prob (F 2 (1 ɛ) ˆF 2 (1 + ɛ)f 2 1 δ Take t ndependent estmates H 1 = ˆF 2 1, H2 = ˆF 2 2,..., Ht = ˆF 2 t

28 Boostng Confdence by Medan We have ) Prob (F 2 (1 ɛ) ˆF 2 (1 + ɛ)f We want ) Prob (F 2 (1 ɛ) ˆF 2 (1 + ɛ)f 2 1 δ Take t ndependent estmates H 1 = ˆF 2 1, H2 = ˆF 2 2,..., Ht = ˆF 2 t Return the medan of H 1, H 2,...,H t.

29 Boostng by Medan Suppose there s an Algorthm that returns an estmate ˆF of a true estmate F such that ˆF F s small wth probablty 7 8. How can we desgn an algorthm that wll return an estmate G of F such that G F s small wth probablty 99/100? (In general 1 δ)

30 Boostng by Medan Suppose there s an Algorthm that returns an estmate ˆF of a true estmate F such that ˆF F s small wth probablty 7 8. How can we desgn an algorthm that wll return an estmate G of F such that G F s small wth probablty 99/100? (In general 1 δ) Run s = 2 log 2 δ + 1 ndependent copes of the Algorthm to obtan estmates ˆF 1, ˆF 2,..., ˆF s. Set G = medan=1 s ˆF.

31 Boostng by Medan What s the probablty that the medan s a bad estmate?

32 Boostng by Medan What s the probablty that the medan s a bad estmate? Ether all s 2 copes wth estmate below G are bad or, s 2 copes wth estmate above G are bad. That s there are log 2 δ copes that are at least bad for G to be a bad estmate.

33 Boostng by Medan What s the probablty that the medan s a bad estmate? Ether all s 2 copes wth estmate below G are bad or, s 2 copes wth estmate above G are bad. That s there are log 2 δ copes that are at least bad for G to be a bad estmate. Show that the probablty of Medan to be bad s δ

34 Frequency Moment For k > 2, the best bound known s Õ(n 1 2 k log 1 δ ) barrng poly( 1 ɛ ) factor. There s an almost matchng lower bound of Ω(n 1 2 k ). For k < 2, the best bound known s Õ( 1 ɛ 2 log 1 δ ). The algorthms use clever combnaton of sketchng and hashng

35 Sketchng as a Versatle Tool Estmatng entropy, quantles, heavy htters, fttng hstograms etc. Applcatons beyond streamng: dmensonalty reducton, nearest neghbors, anomaly detecton, statstcs over socal network. Not only useful for small-space algorthm desgn, but also for fast runnng tme, dstrbuted processng etc.

36 Sketchng as a Versatle Tool Slde from Potr Indyk s course on Streamng, Sketchng and Compressed Sensng

37 Sldng Wndow Model Only the last W tems matter where W s the wndow sze.

38 Sldng Wndow Model Only the last W tems matter where W s the wndow sze. Can you extend Bloom Flter, FM sketch n ths settng?

39 Sldng Wndow Model Only the last W tems matter where W s the wndow sze. Can you extend Bloom Flter, FM sketch n ths settng? Can you extend Count-Mn sketch or lnear sketchng technques n ths settng?

40 Decayng Wndow Model No fxed wndow sze, but older tems have less mportance.

41 Decayng Wndow Model No fxed wndow sze, but older tems have less mportance. Can you extend Bloom Flter, FM sketch n ths settng?

42 Decayng Wndow Model No fxed wndow sze, but older tems have less mportance. Can you extend Bloom Flter, FM sketch n ths settng? Can you extend Count-Mn sketch or lnear sketchng technques n ths settng?

Notes on Frequency Estimation in Data Streams

Notes on Frequency Estimation in Data Streams Notes on Frequency Estmaton n Data Streams In (one of) the data streamng model(s), the data s a sequence of arrvals a 1, a 2,..., a m of the form a j = (, v) where s the dentty of the tem and belongs to