B669 Sublinear Algorithms for Big Data

Size: px

Start display at page:

Download "B669 Sublinear Algorithms for Big Data"

Clinton Burke
5 years ago
Views:

1 B669 Sublinear Algorithms for Big Data Qin Zhang 1-1

2 2-1 Part 1: Sublinear in Space

3 The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store everything. Applications: Internet router, stock data, ad auction, flight logs on tape, etc. (next 4 slides, in courtesy of Jeff Phillps) 3-1

4 Network routers Packets limited space Router Internet Router data per day: at least 1 Terabyte packet takes 8 nanoseconds to pass through router few million packets per second What statistics can we keep on data? For example, want to detect anomalies for security. 4-1

5 Telephone Switch txt, msg limited space Switch Cell phones connect through switches each message 1000 Bytes 500 million calls / day 1 Terabyte per month second Search for characteristics for dropped calls? 5-1

6 Ad Auction page view ad click keyword search limited space Server ad served delivery model Serving Ads on web Google, Yahoo!, Microsoft Yahoo.com viewed 77 trillion times 2 million / hour Each page serves ads; which ones? How to update ad delivery model? 6-1

50 years, total about 9 million flights Each flight has trajectory,

7 Flight Logs on Tape CPU statistics All airplane logs over Washington, DC About flights per day. 50 years, total about 9 million flights Each flight has trajectory, passenger count, control dialog. Stored on Tape. Can only make 1 (or O(1)) pass! What statistics can be gathered? 7-1

8 1.1 Distinct Elements RAM How many distinct elements? Approximation needed. CPU 8-1

9 The FM sketch Denote the stream by A = a 1, a 2,..., a m, where m is the length of the stream, which is unknown at the beginning. Let n be the item universe. Let f i be the frequency of item i in the steam. a i = (i, ) denote f i f i +. This algorithm only works for = 1 (insertion only). 9-1

10 The FM sketch Denote the stream by A = a 1, a 2,..., a m, where m is the length of the stream, which is unknown at the beginning. Let n be the item universe. Let f i be the frequency of item i in the steam. a i = (i, ) denote f i f i +. This algorithm only works for = 1 (insertion only). The algorithm (Flajoet and Martin 83) 1. Choose a random hash function h : [n] [n] from a 2-universal family. Set z = 0. Let zeros(h(e)) be the # tailing zeros of the binary representation of h(e). 2. For each new coming item e, if zeros(h(e)) > z, then set z = zeros(h(e)); 3. Output 2 z

11 The FM sketch Denote the stream by A = a 1, a 2,..., a m, where m is the length of the stream, which is unknown at the beginning. Let n be the item universe. Let f i be the frequency of item i in the steam. a i = (i, ) denote f i f i +. This algorithm only works for = 1 (insertion only). The algorithm (Flajoet and Martin 83) 1. Choose a random hash function h : [n] [n] from a 2-universal family. Set z = 0. Let zeros(h(e)) be the # tailing zeros of the binary representation of h(e). 2. For each new coming item e, if zeros(h(e)) > z, then set z = zeros(h(e)); 3. Output 2 z+0.5. Analysis (on board) 9-3

12 The FM sketch 9-4 Denote the stream by A = a 1, a 2,..., a m, where m is the length of the stream, which is unknown at the beginning. Let n be the item universe. Let f i be the frequency of item i in the steam. a i = (i, ) denote f i f i +. This algorithm only works for = 1 (insertion only). The algorithm (Flajoet and Martin 83) 1. Choose a random hash function h : [n] [n] from a 2-universal family. Set z = 0. Let zeros(h(e)) be the # tailing zeros of the binary representation of h(e). 2. For each new coming item e, if zeros(h(e)) > z, then set z = zeros(h(e)); 3. Output 2 z+0.5. Analysis (on board) Theorem The number of distinct elements can be O(1)-approximated with probability 2/3 using O(log n) bits.

13 Probability amplification Can we boost the success probability to 1 δ? The idea is to run k = Θ(log(1/δ)) copies of this algorithm in parallel, using mutually independent random hash functions, and output the median of the k answers. 10-1

14 An improved algorithm Idea: two-level hashing. The algorithm (Bar-Yossef et al. 02) 1. Choose a random hash function h : [n] [n] from a 2-universal family. Set z = 0, B =. Choose a secondary 2-universal hash function g : [n] [(log n/ɛ) O(1) ]. 2. For each new coming item e, if zeros(h(e)) z, then (a) set B = B {g(e)}; (b) if B > c/ɛ 2 then set z = z + 1 and rehash B. 3. Output B 2 z. 11-1

15 An improved algorithm Idea: two-level hashing. The algorithm (Bar-Yossef et al. 02) 1. Choose a random hash function h : [n] [n] from a 2-universal family. Set z = 0, B =. Choose a secondary 2-universal hash function g : [n] [(log n/ɛ) O(1) ]. 2. For each new coming item e, if zeros(h(e)) z, then (a) set B = B {g(e)}; (b) if B > c/ɛ 2 then set z = z + 1 and rehash B. 3. Output B 2 z. Analysis (on board) 11-2

16 An improved algorithm Idea: two-level hashing. The algorithm (Bar-Yossef et al. 02) 1. Choose a random hash function h : [n] [n] from a 2-universal family. Set z = 0, B =. Choose a secondary 2-universal hash function g : [n] [(log n/ɛ) O(1) ]. 2. For each new coming item e, if zeros(h(e)) z, then (a) set B = B {g(e)}; (b) if B > c/ɛ 2 then set z = z + 1 and rehash B. 3. Output B 2 z. Analysis (on board) Theorem The number of distinct elements can be (1 + ɛ)-approximated with probability 2/3 using O(log n + 1/ɛ 2 (log(1/ɛ) + log log n)) bits. 11-3

17 Nice algorithms, but only work for insertion-only sequences

18 Linear sketch Random linear projection M : R n R k that preserves properties of any v R n with high prob. where k n. M = Mv answer v 13-1

19 Linear sketch Random linear projection M : R n R k that preserves properties of any v R n with high prob. where k n. M = Mv answer v Simple and useful Perfect for streaming and distributed computations. Work for insertion+deletion sequences (that is, can be either positive or negative). 13-2

20 Distinct elements using linear sketches Search version Decision version Let D be # distinct elements: If D T (1 + ɛ), then answer YES. If D T /(1 + ɛ), then answer NO. Try T = 1, (1 + ɛ), (1 + ɛ) 2,

21 Now, the decision problem The algorithm 1. Select a random set S {1, 2,..., n}, s.t. for each i, independently, we have Pr[i S] = 1/T 2. Make a pass over the stream, maintaining Sum S (x) = i S x i Note: this is a linear sketch. 3. If Sum S (x) > 0, return YES, otherwise return NO. 15-1

22 Now, the decision problem 15-2 The algorithm 1. Select a random set S {1, 2,..., n}, s.t. for each i, independently, we have Pr[i S] = 1/T 2. Make a pass over the stream, maintaining Sum S (x) = i S x i Note: this is a linear sketch. 3. If Sum S (x) > 0, return YES, otherwise return NO. Lemma Let P = Pr[Sum S (x) = 0]. If T is large enough, and ɛ is small enough, then If D T (1 + ɛ), then P < 1/e ɛ/3. If D T /(1 + ɛ), then P > 1/e + ɛ/3. Proof (on board)

23 Amplify the success probability Repeat to amplify the success probability 1. Select k sets S 1,..., S k as in previous algorithm, for k = C log(1/δ)/ɛ 2, C > 0 2. Let Z be the number of values of Sum Sj (x) that are equal to 0, j = 1,..., k. 3. If Z < k/e then report YES, otherwise report NO. 16-1

24 Amplify the success probability Repeat to amplify the success probability 1. Select k sets S 1,..., S k as in previous algorithm, for k = C log(1/δ)/ɛ 2, C > 0 2. Let Z be the number of values of Sum Sj (x) that are equal to 0, j = 1,..., k. 3. If Z < k/e then report YES, otherwise report NO. Lemma If the constant C is large enough, then this algorithm reports a correct answer with probability 1 δ. 16-2

25 Amplify the success probability Repeat to amplify the success probability 1. Select k sets S 1,..., S k as in previous algorithm, for k = C log(1/δ)/ɛ 2, C > 0 2. Let Z be the number of values of Sum Sj (x) that are equal to 0, j = 1,..., k. 3. If Z < k/e then report YES, otherwise report NO. Lemma If the constant C is large enough, then this algorithm reports a correct answer with probability 1 δ. Proof (on board) 16-3

26 Amplify the success probability Repeat to amplify the success probability 1. Select k sets S 1,..., S k as in previous algorithm, for k = C log(1/δ)/ɛ 2, C > 0 2. Let Z be the number of values of Sum Sj (x) that are equal to 0, j = 1,..., k. 3. If Z < k/e then report YES, otherwise report NO. Lemma If the constant C is large enough, then this algorithm reports a correct answer with probability 1 δ. Proof (on board) 16-4 Theorem The number of distinct elements can be (1 ± ɛ)-approximated with probability 1 δ using O(log 2 n log(1/δ)/ɛ 3 ) bits.

27 Question: can we make FM sketch linear? 17-1

28 Heavy Hitters

29 Heavy hitters L p heavy hitter set: HH p φ (x) = {i : x i φ x p } 19-1

30 Heavy hitters L p heavy hitter set: HH p φ (x) = {i : x i φ x p } L p Heavy Hitter Problem: Given φ, φ, (often φ = φ ɛ), return a set S such that HH p φ (x) S HHp φ (x) 19-2

31 Heavy hitters L p heavy hitter set: HH p φ (x) = {i : x i φ x p } L p Heavy Hitter Problem: Given φ, φ, (often φ = φ ɛ), return a set S such that HH p φ (x) S HHp φ (x) L p Point Query Problem: Given ɛ, after reading the whole stream, given i, report x i = x i ± ɛ x p 19-3

32 Space-saving: an algorithm for insertion only Algorithm Space-saving [Metwally et al. 05] When a new item e comes, we have two cases. 1. If e is already in the array. We just increment f (e) by 1 and reinsert the (e, f (e)) into the array. 2. If e is not in the array, we create a new tuple (e, MIN + 1) where MIN = min{f (e i ) : e i is in the array}. We always keep the array sorted according to f (e i ), and then MIN is just the estimated frequency of the last item. If the length array is larger than 1/ɛ, we delete the last tuple. Cost: O(1/ɛ log n) bits. 20-1

33 Space-saving: an algorithm for insertion only Algorithm Space-saving [Metwally et al. 05] When a new item e comes, we have two cases. 1. If e is already in the array. We just increment f (e) by 1 and reinsert the (e, f (e)) into the array. 2. If e is not in the array, we create a new tuple (e, MIN + 1) where MIN = min{f (e i ) : e i is in the array}. We always keep the array sorted according to f (e i ), and then MIN is just the estimated frequency of the last item. If the length array is larger than 1/ɛ, we delete the last tuple. Cost: O(1/ɛ log n) bits. Proof (on board) 20-2

34 L 1 point query Algorithm Count-Min [Cormode and Muthu 05] Pick d (d = log(1/δ)) independent hash functions h 1,..., h d where h i : {1,..., n} {1,..., w} (w = 2/ɛ) from a 2-universal family. Maintain d vectors Z 1,..., Z d where Z t = {Z1, t..., Zw t } such that = i:h t (i)=j x i Z t j Estimator: x i = min t Z t h t (i) 21-1

35 L 1 point query Algorithm Count-Min [Cormode and Muthu 05] Pick d (d = log(1/δ)) independent hash functions h 1,..., h d where h i : {1,..., n} {1,..., w} (w = 2/ɛ) from a 2-universal family. Maintain d vectors Z 1,..., Z d where Z t = {Z1, t..., Zw t } such that = i:h t (i)=j x i Z t j Estimator: x i = min t Z t h t (i) Analysis (on board) 21-2

36 L 1 point query Algorithm Count-Min [Cormode and Muthu 05] Pick d (d = log(1/δ)) independent hash functions h 1,..., h d where h i : {1,..., n} {1,..., w} (w = 2/ɛ) from a 2-universal family. Maintain d vectors Z 1,..., Z d where Z t = {Z1, t..., Zw t } such that = i:h t (i)=j x i Z t j Estimator: x i = min t Z t h t (i) Analysis (on board) Theorem We can solve L 1 point query, with approximation ɛ, and failure probability δ by storing O(1/ɛ log(1/δ) log n) bits. 21-3

37 L 2 point query Algorithm Count-Sketch [Charikar et. al. 05] Pick d (d = log(1/δ)) independent hash functions h 1,..., h d where h i : {1,..., n} {1,..., w} (w = 3/ɛ 2 ) from a 2-universal family. Pick d (d = log(1/δ)) independent hash functions g 1,..., g d where g i : {1,..., n} { 1, 1} from a 2-universal family. Maintain d vectors Z 1,..., Z d where Z t = {Z1, t..., Zw t } such that Zj t = i:h t (i)=j g t(i)x i Estimator: x i = median{z 1 h 1 (i),..., Z d h d (i)} 22-1

38 L 2 point query Algorithm Count-Sketch [Charikar et. al. 05] Pick d (d = log(1/δ)) independent hash functions h 1,..., h d where h i : {1,..., n} {1,..., w} (w = 3/ɛ 2 ) from a 2-universal family. Pick d (d = log(1/δ)) independent hash functions g 1,..., g d where g i : {1,..., n} { 1, 1} from a 2-universal family. Maintain d vectors Z 1,..., Z d where Z t = {Z1, t..., Zw t } such that Zj t = i:h t (i)=j g t(i)x i Estimator: x i Analysis (on board) = median{z 1 h 1 (i),..., Z d h d (i)} 22-2

39 L 2 point query Algorithm Count-Sketch [Charikar et. al. 05] Pick d (d = log(1/δ)) independent hash functions h 1,..., h d where h i : {1,..., n} {1,..., w} (w = 3/ɛ 2 ) from a 2-universal family. Pick d (d = log(1/δ)) independent hash functions g 1,..., g d where g i : {1,..., n} { 1, 1} from a 2-universal family. Maintain d vectors Z 1,..., Z d where Z t = {Z1, t..., Zw t } such that Zj t = i:h t (i)=j g t(i)x i Estimator: x i Analysis (on board) Theorem = median{z 1 h 1 (i),..., Z d h d (i)} We can solve L 2 point query, with approximation ɛ, and failure probability δ by storing O(1/ɛ 2 log(1/δ) log n) bits. 22-3

40 L 2 point query (an alternative approach) The algorithm: [Gilbert, Kotidis, Muthukrishnan and Strauss 01] Maintain a sketch Rx such that s = Rx 2 = (1 ± ɛ) x 2 (R is a O(1/ɛ 2 log(1/δ)) n matrix, which can be constructed, e.g., by taking each cell to be N (0, 1)) Estimator: x i = (1 Rx/s Re i 2 2 /2)s 23-1

41 L 2 point query (an alternative approach) The algorithm: [Gilbert, Kotidis, Muthukrishnan and Strauss 01] Maintain a sketch Rx such that s = Rx 2 = (1 ± ɛ) x 2 (R is a O(1/ɛ 2 log(1/δ)) n matrix, which can be constructed, e.g., by taking each cell to be N (0, 1)) Estimator: x i = (1 Rx/s Re i 2 2 /2)s Theorem Johnson-Linderstrauss Lemma x, we have (1 ɛ) x 2 Rx 2 (1 + ɛ) x 2 w.p. 1 δ. 23-2

42 L 2 point query (an alternative approach) The algorithm: [Gilbert, Kotidis, Muthukrishnan and Strauss 01] Maintain a sketch Rx such that s = Rx 2 = (1 ± ɛ) x 2 (R is a O(1/ɛ 2 log(1/δ)) n matrix, which can be constructed, e.g., by taking each cell to be N (0, 1)) Estimator: x i = (1 Rx/s Re i 2 2 /2)s Theorem Johnson-Linderstrauss Lemma x, we have (1 ɛ) x 2 Rx 2 (1 + ɛ) x 2 w.p. 1 δ. Theorem We can solve L 2 point query, with approximation ɛ, and failure probability δ by storing O(1/ɛ 2 log(1/δ) log n) bits. 23-3

43 24-1 Which one is better?

44 Attribution Some of the contents are borrowed from Amit Chakrabarti s course Pitor Indyk s course and Andrew McGregor s course CS711S12/index.html 25-1

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

Linear Sketches A Useful Tool in Streaming and Compressive Sensing Qin Zhang 1-1 Linear sketch Random linear projection M : R n R k that preserves properties of any v R n with high prob. where k n. M =