Algorithms for Streaming Data

Size: px

Start display at page:

Download "Algorithms for Streaming Data"

Robyn Booker
6 years ago
Views:

1 Algorithms for Piotr Indyk

2 Problems defined over elements P={p 1,,p n } The algorithm sees p 1, then p 2, then p 3, Key fact: it has limited storage Can store only s<<n points Can store only s<<n bits (need to assume finite precision) p 1...p 2...p 3.p 4...p 5.p 6.p 7.

3 Example - diameter

4 Problems Diameter l 2 norm of a high-dimensional vector Histograms

5 Diameter in l d Assume we measure distances according to the l norm What can we do?

6 Diameter in l, ctd. From previous lecture we know that Diam (P)=max i=1 d [max p P p i -min p P p i ] Can maintain max/min in constant space Total space = O(d ) What about l 1?

7 Diameter in l 1 Let f:l 1d l 2^d be an isometric embedding We will maintain Diam (f(p)) For each point p, we compute f(p) and feed it to the previous algorithm Return the pair p,q that maximizes f(p)-f(q) This gives O(2 d ) space for l 1 d What about l 2?

8 Diameter in l 2 Let f:l 2d l d, d =O(1/ε) (d-1)/2, be a (1+ε)- distortion embedding Apply the same algorithm as before Parameters: Space: O(1/ε) (d-1)/2

9 Maintaining l 2 norm of a vector Implicit vector x=(x 1,,x n ) Start with x=0 Stream: sequence of pairs (i,b), meaning x i =x i +b Goal: maintain (approximately) x 2

10 Motivation Consider a set of web pages, stored in some order Two pages are similar if they link to the same page Note that each page is similar to itself Want to know the number of triplets <a,b,c> using small space Web pages streamed sequentially a c b

11 Connection to l 2 norm Let In(i)be the # in-links to page i Out(i)be the # out-links of page i Out(i)is easy to compute, In(i) is not We want to compute ½* i In(i) (In(i)+1) = ½ [ i In(i) 2 + i In(i)] Every time we see link to i: In(i):=In(i)+1

12 Approximate Algorithm Algorithm: Computes a (1+ε)-approximation to x 2 with probability 1-P Stores O(log(1/P)/ε 2 ) numbers

13 Algorithm From JL lemma, it suffices to maintain Ax for random A, since Ax x Assume we have Ax Need to compute Ay, where y=x except for y i =x i +b Use linearity: Ay = A(y-x)+Ax = A(be i )+Ax = b a i + Ax

14 Pseudo-randomness In practice: use A[i,j]=Normal( RND(i,j) ) In theory: one can use bounded space random generators to generate A using only O( log n * log(1/p)/ε 2 ) random numbers

15 Histograms View x as a function x:[1 n] -> [1 M] Approximate it using piecewise constant function h, with B pieces (buckets)

16 Our goal Want to maintain the best B-bucket representation of x, under changes of x Measure the error using 2-norm (1-norm also OK)

17 Our Approach Maintain sketches Ax of x Using Ax, construct B-histogram h which approximately minimizes x-h

18 Our result Can maintain a B-histogram h such that x-h (1+ε) x-h* up to a factor of using poly(log n, B, 1/ε) time/space, with probability 1-1/poly(n)

19 Proof: by iterated improvement Bbuckets, >n B construction time O(B log n) buckets, n 3 construction time B log 2 n buckets, n 2 construction time B log 2 n buckets, n poly(b+log n) time B log O(1) n buckets, poly(b+log n) time Bbuckets, poly(b+log n) time

20 Exponential time approach There are at most (2Mn) B+1 functions h By JL lemma, can reduce dimension to O(B log (Mn)), and approximately preserve x-h for all h To reconstruct h, minimize Ax-Ah Can be trivially done by enumerating all h s

21 Greedy approach Start from h=0, i=0 Repeat Let h(i,c) be the histogram h with the values in the interval I set to c Choose h =h(i,c) to minimize Ax Ah h=h Increment i Until i> const*b*log M

22 Details The square of Ax-A h(i,c) is a quadratic function of c Once we compute the parameters of this function, e.g. E(c)=Ac 2 +Bc+D, the minimum is achieved for c=b/(2a)

23 How does it help O(n 2 ) intervals O(n) time to find best c minimizing Ax-A h(i,c) 2 Overall: O(n 3 ) time

24 Approximation factor Assume ε=0, for simplicity Let h* be the optimal B-histogram If we replaced the current histogram h by all B intervals of h* (with proper values c), we would reduce the squared error from x-h 2 to x-h* 2 Thus, there is an interval I of h* (and c) such that x h(i,c) 2 - x-h 2 - (1-1/B) ( x-h 2 - x-h* 2 ) or Err i+1 (1-1/B) Err i, where Err i = x-h 2 - x-h* in stage i Since Err 0 n M 2, we get that i=o(b ln (nm 2 )) is enough to reduce Err i to 1

25 For the curious

26 Minimum Enclosing Ball Problem: given P={p 1 p n }, find center o and radius r>0 such that P B(o,r) ris as small as possible Solve the problem in l Generalize to l 1 and l 2 via embeddings

27 MEB in l Let C be the hyper-rectangle defined by max/min in every dimension Easy to see that min radius ball B(o,r) is a min size hypercube that contains C Min radius = min side length/2 How to solve it in l 2?

28 MEB in l 2 Firstly, assume (1+ε) 1 Let f:l 2d l d be an almost isometric embedding Algorithm: For each point p, compute f(p) Maintain MEB B (o,r) of f(p 1 ) f(p n ) Compute o such that f(o)=o Report B(o,r)

29 Problem There might be NO o such that f(o)=o If it was the case, then we would always have MEB radius=diameter/2, which is not true: The problem is that f is into, not onto

30 The Correct Version Algorithm: Maintain the min/max points f(p 1 ) f(p 2d ), two points per dimension Compute MEB B(o,r) of p 1 p 2d Report B(o,r(1+ε))

31 Correctness MEB radius for P = Min r s.t. o P B(o,r) Min r s.t. o f(p) B(f(o),r) = Min r s.t. o {f(p 1 ) f(p 2d )} B(f(o),r) Min r s.t. o {p 1 p 2d } B(o,r) = MEB radius for {p 1 p 2d } Total error at most (1+ε) 2 In reality, at most (1+ε)

32 Digression: Core Sets In the previous slide we use the fact that in l, for any set P of points, there is a subset P of P, P =2d, such that MEB(P )=MEB(P) P is called a core-set for the MEB of P in l For more on core-sets, see the web page by Sariel Har-Peled

CS 372: Computational Geometry Lecture 14 Geometric Approximation Algorithms

CS 372: Computational Geometry Lecture 14 Geometric Approximation Algorithms Antoine Vigneron King Abdullah University of Science and Technology December 5, 2012 Antoine Vigneron (KAUST) CS 372 Lecture