Interval Selection in the streaming model

Size: px

Start display at page:

Download "Interval Selection in the streaming model"

Gavin Parsons
5 years ago
Views:

1 Interval Selection in the streaming model Pascal Bemmann Abstract In the interval selection problem we are given a set of intervals via a stream and want to nd the maximum set of pairwise independent intervals. Let α(i) denote the size of an optimal solution. We present the results of S. Cebello and P. Pérez-Antero for estimating α(i) in the streaming model where only one pass over the data is allowed, the endpoints of intervals lie within the range of {1,..., n} and the memory is constrained. For intervals of potentially dierent size we provide an algorithm that computes an estimate ˆα of α(i) such that 1 2 (1 ε)α(i) ˆα α(i) holds with probability at least 2/3. For same sized intervals we explain an algorithm that computes an estimate ˆα of α(i) for which 2 3 (1 ε)α(i) ˆα α(i) holds with probability at least 2/3. The required space is in polynomial order of ε 1 and log n. We also present approximation algorithms for the interval selection problem which use O(α(I)) space which are used in the mentioned estimates. 1 Introduction Basics and Denitions Intervals Sampling H-random samples A 2-approximation algorithm Estimating the size of an optimal solution Segments Algorithms in the Streaming Model Same-size intervals Largest independent set of same size-intervals Size of largest independent set of same size-intervals Conclusion and other results Introduction In this work we will present results developed by Cabello and Pérez-Lantero [1]. We consider problems in the streaming model. Usually huge data sets arrive sequentially. The task is to solve problems with limited memory. We will assume that we are not able to look at input items again unless we saved it in our memory. This can be also described as using only one pass over the input. Furthermore we assume that we observe an input stream that is too big to be stored as a whole in our memory. Pascal Bemmann Universität Paderborn, Warburger Str. 100, Paderborn, pbemmann@mail.upb.de 1

2 2 Pascal Bemmann Within this model we will analyze the interval selection problem. As an input we get intervals within a predetermined range. The task is to nd the biggest set of intervals that are pairwise disjoint while using the memory most eciently. Another problem that arises is to estimate only the size of an optimal solution without providing the actual set of intervals. Both of these problems will be analyzed in the following chapters. Note that the interval selection problem is a generalization of the distinct element problem. This fundamental problem deals with the task to compute the number of pairwise dierent elements of a data stream. If for all intervals of the interval selection problem both endpoints of each interval are equal, the task becomes to count the number of dierent points (elements). We will start by providing general denitions and tools that we use to approach the interval selection problem. After that we will give you the idea of how to design a 2-approximation algorithm in the mentioned setting. This algorithm will be used together with other general results to construct an algorithm to estimate the size of an optimal solution for the interval selection problem. At the end we will provide the general approach how we can improve the presented results if we assume that all inputs of our input are of the same size. 2 Basics and Denitions In this section we provide denitions and useful tools which we will use in later proofs. To shorten the notation we will use [n] to denote the set of integers {1,..., n}. Also we will assume for all later constructions that 0 < ε < 1/2 holds. Denition 1 (Interval selection problem). The interval selection problem is dened as follows: Given a set I of (input) intervals, the task is to nd the largest set of intervals that are pairwise disjoint. These intervals are also called independent. α(i) denotes the size of an optimal solution for this problem. Another problem that arises from this denition is the problem to estimate α(i) without outputting an independent subset of the input intervals. We will consider both of these problems in this work. 2.1 Intervals We consider the input intervals of the interval selection problem to be closed. Intervals constructed during the algorithm in Section 4 will be called windows to distinguish them from the input intervals. For the same reason we use the term segment for an interval used in the segment tree in Section 5.1. We say that an interval I = [x, y] is contained in another interval I if both endpoints x, y are elements of I. Denition 2 (Leftmost and Rightmost intervals). Given a window W and a set of input intervals I. Then we dene Lef tmost(w ) is the interval with the smallest right endpoint among the intervals of I contained in W. We use the left endpoint as a tiebreaker, choosing the interval with the largest left endpoint. Rightmost(W ) is the interval with the largest left endpoint among the intervals of I contained in W. We use the right endpoint as a tiebreaker, choosing the interval with the smallest right endpoint. In case of W being empty both Leftmost(W ) and Rightmost(W ) are undened.

3 Interval Selection in the streaming model 3 If W contains just a single interval I I it holds that Lef tmost(w ) = Rightmost(W ). Also the intersection of all input intervals contained in W is equal to Leftmost(W ) Rightmost(W ). Otherwise it would contradict to the denition of Lef tmost(w ) or Rightmost(W ). It will be clear from context to which set of input intervals Leftmost(W ) and Rightmost(W ) refer to. 2.2 Sampling Denition 3 (ε-min-wise independence). A family of permutations H : {h : [n] [n]} is ε-min-wise independent if X [n] and y X : 1 ε X 1 + ε Pr [h(y) = min h(x)] h H X. Note that we will only look at subsets of all possible permutations on [n]. This is because if we examine the set of all permutations h(y) is distributed uniformly on [n] and it follows that ε = 0. Based on this denition we can use the results of Indyk [2] to obtain a family of permutations with properties that we will need later. Lemma 1. For every ε (0, 1/2) and n > 0 there exists a family permutations H(n, ε) = {h : [n] [n]} with the following properties: (i ) H(n, ε) has n O(log(1/ε)) permutations (ii ) H(n, ε) is ε-min-wise independent (iii ) an element of H(n, ε) can be chosen uniformly at random in O(log 1/ε) time (iv ) for h H(n, ε) and x, y [n], we can decide with O(log(1/ε)) arithmetic operations whether h(x) < h(y). Proof. We will just provide a rough idea how to proof the above properties. The results of Indyk [2] grant a family H of ε -wise independent hash functions with ε depending on ε and some constant factors. It can be shown that each hash function h H can be used to create a ε-min-wise independent permutation using the lexicographic order of (h (i), i) over all i [n]. Using standard constructions over nite eld grant a family of hash functions satisfying condition (i), (iii) and (iv). Transforming these hash functions into permutations using the above approach grants property (ii). 3 H-random samples We will use the result of the previous lemma to obtain H-random samples. These are elements that are chosen nearly uniformly at random, but we still want to maintain some characteristic information about this samples. The general idea is based on the work of Datar and Muthukrishnan [4]. We consider a xed subset X [n] and H = H(n, ε) a family of permutations as stated in Lemma 1. To obtain a H-random element s of X we choose a permutation h H uniformly at random and set s = arg min{h(x) x X}. It is important that we do not choose s completely uniformly at random. With the denition of ε-min-wise independence we obtain that x X : 1 ε X Pr[s = x] 1 + ε X. This follows from the observation that if we x h, it holds that h(x) = h(y) x = y. Moreover, with Pr[s Y ] = y i Y Pr[s = y i] we can conclude that

4 4 Pascal Bemmann Y X : (1 ε) Y X Pr[s Y ] (1 + ε) Y. (1) X This gives us the opportunity to estimate the ratio Y / X for a xed Y. We keep calculating H- random samples from X and count how many of the samples are elements of Y. The probability that an H-random sample is an element of Y correlates to the ratio between Y and X. Furthermore, H-random samples can be maintained during the stream. After choosing h H uniformly at random we check whether for a new element a of the stream h(a) < h(s) holds. This means that a is our new minimum of X and we update s = a. Also we will use H-random samples for conditional sampling where we sample elements until we obtain an element satisfying certain properties. To analyze later results we will need the following observation. Lemma 2. Let Y X [n] and ε (0, 1/2). Consider a family of permutations H = H(n, ε) with the properties of Lemma 1 and an H-random sample s from X. Then y Y : 1 4ε Y Pr[s = y s Y ] 1 + 4ε. Y Proof. Fix an arbitrary y Y. With s = y s Y and the considerations above we observe Pr[s = y s Y ] = Pr[s = y and s Y ] Pr[s Y ] 1+ε Pr[s = y] = Pr[s Y ] X (1 ε) Y X = 1 + ε 1 ε 1 ( ) (1 + 4ε) 1 Y Y. Here the inequality marked with ( ) follows with (1 + ε) (1 ε) 1 + 4ε 1 + ε (1 + 4ε)(1 ε) = 1 + 4ε ε 4ε 2 0 2ε 4ε 2 and 2ε 4ε 2 = ε(2 4ε) ε( ) = 0. Similarly we conclude that Pr[s = y s Y ] (1 4ε) 1 Y. 4 A 2-approximation algorithm The goal of this section is to construct a 2-approximation algorithm for the interval selection problem using O(α(I)) space. The algorithm will maintain a set W which is a partition of the real line. We call elements of W windows which are intervals for which both the inclusion and exclusion of the endpoints are allowed. More specically all elements of W are pairwise disjoint and the union of all elements of W forms whole R. We formalize this desired set and its consequentially properties in the next lemma. Lemma 3. Let I be a set of intervals and let W be a partition of the real line with the following properties: Each window of W contains at least one interval from I. For each window W W, the intervals of I contained in W pairwise intersect. Let J be any set of intervals constructed by selecting for each window W of W an interval of I contained in W. Then J > α(i)/2.

5 Interval Selection in the streaming model 5 Fig. 1 At the bottom is a partition of the real line. Filled circles represent included endpoints; empty circles excluded endpoints. At the top the we split an optimal solution J (marked in blue) into J and J. Proof. Consider a partition of the real line W with the above properties. To shorten the notation we set k = W. With J I we refer to an optimal solution of the interval selection problem. By denition J = α(i) holds. For the further investigation we split J into two disjoint sets J and J. For an example see Figure 1. J describes the set of intervals which are fully contained in some window of W. By construction all intervals contained in a window W of W pairwise intersect, therefore at most one interval of J is contained in a window. With W having k elements we obtain J k. Every interval which intersects at least two successive windows of W is contained in J. All these intervals cannot be elements of J. Since the k windows of W are split by k 1 endpoints, J k 1 holds. Since each element of J 5 is either contained in J or J we combine the above results to obtain α(i) = I = J + J k + k 1 = 2k 1. Since J is constructed by choosing one interval from each window and J = k we conclude that 2 J = 2k > 2k 1 α(i) which completes the proof. We will now present the general idea of an algorithm that maintains such a partition throughout the stream. For an example of such a partition see Figure 2. The overall goal is to partition the real line while storing Leftmost(W ) and Rigtmost(W ) for all windows W W. When receiving the rst input interval I 0 of the stream we set W = {R} and Leftmost(W ) = Rigtmost(W ) = I 0. At this point Lemma 3 holds. Now we show that we can insert new intervals while keeping the above conditions. Let I be a new interval of the stream. If I is not contained in any window of W no update is needed, since it will be disregarded by the algorithm because we will choose our nal intervals from the set of intervals contained in a window of W. Otherwise I is contained in some window W and we distinct two cases. If I intersects all intervals in W we check if we have to update Leftmost(W ) or Rightmost(W ). Otherwise we have to split window W into two windows W 1 and W 2. If both endpoints of I are bigger than Leftmost(W ) Rightmost(W ) we use the right endpoint of Leftmost(W ) as a splitting value. Then we set W 1 to the segment containing Leftmost(W ) and W 2 to the segment containing I. If both endpoints of I are smaller than Leftmost(W ) Rightmost(W ) we use the same approach with Rightmost(W ) instead of Lef tmost(w ). With this operations we ensure that our partition satises Lemma 3. The formal proof showing that these instructions maintain a partition satisfying Lemma 3 is a simple case distinction using inductive

6 6 Pascal Bemmann arguments. By storing W using a dynamic binary search tree the algorithm needs O( W ) = O(α(I)) space. Also all operations within this tree can be handled within O(log W ) = O(log α(i)) time. By choosing one arbitrary input interval from each window we end up with a 2-approximation solution for the interval selection problem. Fig. 2 Maintaining of a partition of the real line by a 2-approximation algorithm. 5 Estimating the size of an optimal solution In this section we want to estimate the size of an optimal solution of the interval selection problem. To reach this we will split the interval [1, n] into segments and apply the 2-approximation algorithm from Section 4 at each of these segments. We will construct our segments in such a way, that each segment contains neither too much nor too few input intervals. We will start with presenting results independent from the streaming model. After that we will explain how to use these in algorithms in the streaming model. 5.1 Segments For our overall we use a segment tree T. This is a balanced binary tree on the segments [i, i+1) with i [n]. Each leaf of T corresponds to a segment [i, i + 1) for some i, including the left endpoint and excluding the right endpoint. Note that the order of leafs in the tree is the same as the order of the corresponding segments on the real line. For any inner node v of T the corresponding interval S(v) is the (disjoint) union of the intervals of v's children. Then for the root node r the corresponding interval S(r) is equal to [1, n + 1). With S we denote the set of segments corresponding to the nodes of T. Since T is a balanced binary tree the size of S is 2n 1. For an example see Figure 3. To refer to the parent of a segment S S with S S(r) we will use π(s). This is the segment corresponding to the parent node of the node v of S for which S(v) = S holds. For the upcoming constructions we want to denote the size of the largest independent subset by β(s) if we only consider input intervals which are elements of S S. Analogously we use ˆβ(S) for the size of an optimal solution computed by a 2-approximation algorithm of Section 4 only applied to input intervals which are elements of S. Resulting from this denitions we directly conclude that S S : β(s) ˆβ(S) β(s)/2. (2)

7 Interval Selection in the streaming model 7 The next lemma tells us that if we restrict a 2-approximation algorithm to some segments of S with certain properties and apply it to the input intervals contained in them, we obtain a (1/2 ε) approximation for the estimation of the size of an optimal solution. Fig. 3 Segment tree for n = 8. Lemma 4. Let S S be such that (i ) S(r) is the disjoint union of the segments in S and (ii ) for each S S it holds that β(π(s)) 2ε 1 log n. Then α(i) S S ˆβ(S) ( 1 2 ε)α(i). Proof. Consider a set S with the above properties. Then we can merge the solutions produced by a 2-approximation algorithm applied independently each S S. No input interval is chosen multiple times because the segments in S are disjoint. With inequality (2) the rst inequality follows α(i) S S β(s) S S ˆβ(S). To obtain the second inequality we look at set of elements S which is the set of leafmost elements in the set of parents {π(s) S S }. This means that each element of S has a child in S but no descendant which is an element of S. By denition for each segment S S there exists a S S such that the parent of S is on the path Π T ( S) in T from the root to S. Otherwise we would have found a leafmost parent node which is not an element of S. For each of these S S using (ii) it holds that β( S) 2ε 1 log n. Now we want to link this considerations to J I an optimal solution for the interval selection problem. Assume we are given such a solution. Then for each segment S S, J contains at most two intervals that intersect S but which are not completely contained in S. Otherwise at least two intervals of the optimal solution would intersect which leads to a contradiction. Therefore we can conclude for all S S {J J J S } {J J J S} + 2 β(s) + 2. (3) Per denition the segments in S are pairwise disjoint as otherwise they would not be a leafmost parent node. Then we can join single solutions restricted on segments of S to a solution for the whole input. Together with (ii) we obtain

8 8 Pascal Bemmann J S S β( S) S 2ε 1 log n. (4) The maximum path length in T is log n + 1 since T is a balanced tree. Then for all S S the path from the root to S has at most log n vertices because S is a parent node. Each S S has a parent on the path from the root to some S S. With the fact that each S S has at most two children and by rearranging (2) to S we obtain S 2 log n S 2 log n J 2ε 1 log n = ε J. Since the segments of S form a disjoint union of S(r) we can conclude with (2), (3) and (4) that J {J J J S } (β(s) + 2) 2 S + β(s) 2ε J + β(s). S S S S S S S S Because β(s) is a 2-approximation and J = α(i) the next inequality proves the second inequality of the lemma: J 2ε J + β(s) 2ε J + 2 β(s) S S S S (1 2ε) J 2 β(s) S S ( 1 2 ε) J β(s). S S The next goal is to nd a set which satises the properties of Lemma 3. But to determine whether a segment S belongs to this set S we want to use only local information which does not requires knowledge about other segments to minimize our space requirements. The application of a 2-approximation algorithm on segments β(s) is not suitable for this task because it is possible that for some segment S S\{S(r)}, β(π(s)) < β(s) holds. This would cause problems in our overall construction. Instead we dene another estimate which is monotone nondecreasing along paths to the root. In particular, we dene for each segment S S γ(s) = {S S S S and I I s.t. I S } describing the number of segments of S that are contained in S and contain at least one input interval. This number corresponds are nodes in the segment tree that are descendants of S and contain some input interval. For this estimate we can prove the following properties. Lemma 5. For all S S, we have the following properties: (i) γ(s) γ(π(s)), if S S(r), (ii) γ(s) β(s) log n, (iii) γ(s) β(s) and (iv) γ(s) can be computed in O(γ(S)) space using the portion of the stream after the rst interval contained in S. Proof. The rst statement follows immediately from the denition of the segment tree because each parent node contains all input intervals which are contained in its children. To proof the remaining properties we x some S S and dene S := {S S S S and I I : I S }

9 Interval Selection in the streaming model 9 denoting the set of segments contained in S which itself contains at least one input interval. These segments are associated with descendants of S in the segment tree. Let T S be the subtree with root S. Since T is a balanced tree it has at most log n many levels. Due to the fact that γ(s) is exactly the size of S we can use the pigeonhole principle to conclude that there has to exists a level L of T S which has at least γ(s)/ log n pairwise distinct elements of S. All these segments are disjoint because they are on the same level of the segment tree. This means we can pick an arbitrary input interval from each of these segments to obtain a subset of the input intervals resulting in β(s) γ(s)/ log n. Rearranging grants the second property. To prove (iii) we consider an optimal solution J constrained to S. Each interval J from J is contained in some interval of S. Let S(J) be the smallest segments of the segments containing J. Then J contains the middle point of S(J). Otherwise we could split S(J) in half and choose the half containing J which would be smaller than S(J). Therefore for all J J the segments S(J) are pairwise distinct or else two input intervals of the optimal solution would intersect. Note that these segments do not have to be disjoint as their associated nodes might be on dierent levels of the segment tree. With this minimal segments being elements of S we can conclude that γ(s) = S {S(J) J J } = J. The property follows with J = β(s). We can use binary search trees to prove the fourth property. For a new input interval I we check if our tree already contains the segments which are contained in S and contain I. If not, update those segments to the structure. Then γ(s) corresponds to the number of leaves and can be computed by traversing the tree. The necessary space needed for such a tree is related to the number of elements stored and results in O(γ(S)). Equipped with this estimate we can dene a certain type of segments which will help us to nd a set satisfying Lemma 4. Denition 4. A segment S of S with S S(r) is relevant if (i) γ(π(s)) 2ε 1 log n 2 and (ii) 1 γ(s) < 2ε 1 log n 2. Then S rel S denotes the set of relevant segments of S. In case S rel is empty, we set it to S rel = {S(r)}. This formalizes the phrase that relevant segments contain at least one input interval but not too many. Also it is guaranteed that parent nodes of nodes associated with relevant segments also contain a certain amount of input intervals. Now we will analyze the result of applying a 2-approximation algorithm to relevant segments. Remind that β(s) describes the size of a solution produces by a 2-approximation algorithm. Lemma 6. It holds that α(i) 1 ˆβ(S) ( 2 ε)α(i). S S rel Proof. If γ(s(r)) < 2ε 1 log n 2 then S is empty because by Lemma 5, γ( ) is nondecreasing along paths from leafs to the root in T. Then no node S can have a parent node for which γ(π(s)) 2ε 1 log n 2 holds and we set S rel to S rel = {S(r)}. Then the above inequality follows directly from the fact that β(s) denotes the size of a 2-approximation algorithm. Therefore we can assume that γ(s(r)) 2ε 1 log n 2. Then also the root node of T is not an element of S rel. Dene S 0 = {S S\{S(r)} γ(s) = 0 and γ(π(s)) 2ε 1 log n 2 }. Let S be a the rst node on a path from the root to a leaf for which γ(π(s)) 2ε 1 log n 2 and γ(s) < 2ε 1 log n 2 holds. In case it contains any input intervals it follows that S S rel and S S 0 otherwise. Therefore S rel S 0 forms a disjoint union of S(r). By rearranging Lemma 5 (ii) to β(s), using the denition of relevant segments and the assumption that γ(s(r)) 2ε 1 log n 2 we obtain S S rel S 0 : β(π(s)) γ(π(s))/ log n 2ε log n

10 10 Pascal Bemmann Fig. 4 Active segments (dotted, marked in red) in a segment tree caused by an interval I. implying that S = S rel S 0 satises Lemma 4. Because all S S 0 do not contain any input interval by denition, it holds that γ(s) = β = 0 which grants the above statement. Another important type of segments we need for our overall estimate are active segments. Denition 5. A segment S S is active if S = S(r) or its parent π(s) contains some input interval. For an example of active segments see Figure 4. Note that every relevant segment is also an active segment because relevant segments contain at least one input interval. Later we will use H-random samples to estimate the ratio between relevant and active segments. 5.2 Algorithms in the Streaming Model In this section we want to use the ndings of the previous section to construct algorithms estimating the number of active segments, the ratio between active and relevant segments and the average value of β(s) over the relevant segments. Putting all these estimates together we can obtain an estimate for the size of an optimal solution. To provide an estimate for the number of active segments we rst denote with σ S (I) the sequence of active segments because of the input interval I. This sequence is ordered non increasing by the size of the intervals. Therefore the sequence starts with the root node followed by nodes which parents contain I. Since the segments of parent nodes are bigger than its children, parent nodes appear before its children in σ S (I). The length of the sequence is bounded by 2 log n + 1. This holds because if an interval is contained in some leaf of the segment tree, there exists a path on nodes whose associated segments are active. All children of this nodes are by denition active, too. Since T is a balanced tree we obtain for the length of the sequence 2 log n + 1 if we add the root node as well. Note that we do not need to store our whole segment tree to compute σ S because the intervals associated to nodes are independent from the input intervals. Therefore we can calculate segments associated to nodes in our segments tree on the y to compute the at most 2 log n + 1 active nodes using only O(log n) space. With this we can show an estimate for the number of active segments N act. Lemma 7. There is an algorithm in the data stream model that in O(ε 2 + log n) space computes a value ˆN act such that Pr[ N act ˆN act ε N act ] 11/12.

11 Interval Selection in the streaming model 11 Proof. The stream of input intervals I = I 1, I 2... denes the stream of active segments σ = σ S (I 1 ), σ S (I 2 ),... where S is set of intervals associated to the segments of the balanced segment tree over [n]. Remind that this stream is O(log n) times longer than I. If we count the distinct elements in the stream σ we obtain the number of active segments. Therefore we reduce the interval selection problem to the problem to counting the distinct elements of the stream σ. For this we can use the results of Kane, Nelson and Woodru [3]. Their algorithm grants a (1 ± ε) approximation for the distinct elements problem using O(ε 2 + log S ) = O(ε 2 + log(n)) space with 2/3 success probability. The space needed is proven to match the optimal space bound up to constant factors. The success probability can be improved by using a standard technique. By running O(log((2/12) 1 )) instances of the algorithm in parallel we can amplify the success probability to 1 1/12 = 10/12. The general idea of their algorithm is to run several constant factor approximations in parallel. Then they keep several counters updated to compensate the constant factors. With the help of their algorithm we obtain a value ˆN act matching the above property. As we have an estimate for the number of active segments we can use this to compute an estimate for the number of relevant segments which are a subset of all active segments. Lemma 8. There is an algorithm in the data stream model that uses O(ε 4 log 4 n) space and computes a value ˆN rel such that Pr[ N rel ˆN rel ε N rel ] 10/12. Proof. First we will estimate the number of active segments using the algorithm from Lemma 7. Using H-random samples we will be able estimate the ratio between active and relevant segments. This can be turned into an estimate for N rel = (N rel /N act ) N act. To ensure that the sample we take later will be representative, we need a lower bound for N rel /N act. In our segment tree T, each relevant segment S S rel has at most 2γ(S ) < 4ε 1 log n 2 active segments below it. This holds because each relevant segment is by denition also an active segment. Then all relevant segments contained in S have at most two children which are active. The inequality follows from the denition of γ( ). Also there are at most 2 log n active segments whose parent is an ancestor of S. This upper bound holds exactly if S is a leaf of T. Then for each relevant segment we have at most 4ε 1 log n log n 4ε 1 log n log n 2 = 6ε 1 log n 2 active segment. Using this upper bound we can conclude that N rel N act 1 6ε 1 log n 2 = ε 6 log n 2 (5) which we will use later. Now consider an arbitrary mapping b between S and [n 2 ] that can be easily computed. Then Lemma 1 gives us a family H = H(n 2, ε) of ε-min-wise independent hash functions. Note that we use n 2 as S contains at most 2n 1 elements to get an injective function. For a xed h H, the concatenation h b denes an order among the elements of S. This ordering will be used together with ε-min-wise independent hash function to compute H-random samples. We choose k = 72 log n 2 /(ε 3 (1 ε)) Θ(ε 3 log 2 n) and pick k permutation h 1,..., h k H uniformly and independently at random. We will see shortly why k was chosen in this specic way. Then for each permutation h j with j = 1,..., k our H-random sample S j is dened as S j = arg min{h j (b(s)) S S is active}. Therefore S j is the active segment of S which minimizes (h j b)( ). Then S j is approximately a random active segment of S. It is not completely uniformly random because we want to keep some

12 12 Pascal Bemmann information so we can apply properties of ε-min-wise independent hash functions to it. By dening the random variable X = {j {1,..., k} S j is relevant} we have formalized the number of H-random samples which are active. We shift the discussion how X can be compute to the end of the proof. Then N rel /N act is in the order of X/k. To get a bound for X we dene p = Pr [S j is relevant] h j H and remind that every relevant segment is also an active segment. Then we can our results from inequality (1) from Section 3 by setting Y = S rel and X = S act. We obtain (1 ε)n rel N act p (1 + ε)n rel N act. (6) By using the denition of k, the lower bound of p and the lower bound in (5) it holds that kp 72 log n 2 ε 3 (1 ε) (1 ε)n rel N act 72 log n 2 ε ε 3 6 log n 2 = 12 ε 3. (7) You can also understand each S j as a binary random variable. Then X is the sum of k independent random variables and it follows that E[X] = kp. Using Chebychev's inequality, rearranging the lower bound for kp in (7) and the fact that V ar[x] = kp(1 p) for sums of k independent random variables it holds that [ ] X Pr k p εp = Pr [ X kp εkp] V ar[x] kp(1 p) (1 p) = (εkp) 2 (εkp) 2 = kpε 2 < 1 kpε (8) This allows us to formalize our nal estimator. We use the estimator ˆN act for the number of active segments as stated in Lemma 7. Then we dene ˆN rel = ˆN act (X/k). For now assume that the estimator of N act is successful and p lies within an ε-range around X/k. Formally: [ N act ˆN ] [ ] X act εn act and k p εp. We use this to get bounds and the upper bound from (6) for our estimate ˆN rel : ˆN rel = ˆN act ( X k ) (1 + ε)n act (1 + ɛ)p (1 + ε) 2 N act (1 + ε)n rel N act = (1 + ε) 3 N rel With ε < 1/2 it holds that (1 + ε) 3 (1 + 7ε). This grants ˆN rel (1 + 7ε)N rel. Analogously using lower bounds we can conclude that ˆN rel (1 7ε)N rel. It is left to analyze our overall success probability. This can be done via the probability that the estimates for ˆN act and p fail: [ ] Pr[ ˆN rel = (1 ± 7ε)n rel ] 1 Pr[ N act ˆN X act εn act ] Pr k p εp We can scale ε by a factor 1/7 to obtain the claimed bound. This is a valid operation because ε stays within the range of (0, 1/2). It is left to analyze how X can be computed and the space needed for the estimate. For each j [k] we store the H-random element S j for all active segments seen so far using h j. Moreover, we store information about the choice of h j and also about γ(s j ) and γ(π(s j )) so we can decide whether S j is relevant. Then let I 1, I 2,... be the stream of input intervals and σ = σ S (I 1 ), σ S (I 2 ),... the stream of active

13 Interval Selection in the streaming model 13 segments described earlier. When given a segment S of σ we have to update S j if h j (S) < h j (S j ). With the segments of σ S (I) ordered nondecreasing in size we can keep γ(π(s j )) updated because S j becomes active for the rst time if its parent contains some input intervals. Then γ(π(s j )) > 0 and the following parts of σ can be used to compute γ(s j ) and γ(π(s j )) using Lemma 5 (iv). To stay within our desired space bounds we use the following trick. If at some point γ(s j ) > 2ε 1 log n 2 violates the condition for a relevant segment, we only store the fact that S j is not relevant and nothing more. No later arriving input interval could possibly change this. Also if γ(π(s j )) > 2ε 1 log n 2 we only store that is possible for S j to be relevant. This information is enough and maintaining a counter for bigger values is not necessary. Then for each j we need ε 1 log 2 n space. With the denition of k we obtain a space bound of at most O(kε 1 log 2 n) = O(ε 4 log 4 n). The last part we need for our nal result is about an average value of a 2-approximation applied to relevant segments. In specic we dene ρ = ( S S rel ˆβ(S)/ Srel. Lemma 9. There is an algorithm in the data stream model that uses O(ε 5 log 6 n) space and computes a value ˆρ such that Pr[ ρ ˆρ ερ] 10/12. Proof. To get an estimate of ρ we will use conditional sampling. We compute H-random samples until we get a sample satisfying a certain condition. As in the previous lemma let b be an arbitrary injective mapping between S and [n 2 ]. Also consider a family H = H(n 2, ε) from Lemma 1. Then h b denes an order among the elements of S. In the following S act describes the set of active segments. We repeatedly sample h H uniformly at random until S 1 = arg min S Sact h(b(s)) is a relevant segment. Then set the random variable Y 1 = ˆβ(S). Using Lemma 2 with X = S act and Y = S rel we obtain S S rel : 1 4ε S rel Pr[S 1 = S] 1 + 4ε S rel. Using the upper bound of the equation above and the denition of ρ it follows from the denition of the expected value that E[Y 1 ] (1 + 4ε) ρ. Similarly using the lower bounds we can show that E[Y 1 ] (1 4ε) ρ. With Lemma 5 (iii) we have that γ(s) β(s) ˆβ(S). Using the denition of relevant segments and ρ we obtain an upper bound for the variance of Y 1 : V ar[y 1 ] = E(Y 2 1 ) (E(Y 1 )) 2 E[Y 2 1 ] = Pr[S 1 = S] ( ˆβ(S)) 2 S S rel 1 + 4ε S rel ˆβ(S) 2 log n 2 ε S S rel ρ 2(1 + 4ε log n 2 ) ε 6 log n 2 ρ ε Since γ(s) 1 means that S contains at least 1 input interval, the approximation 2-algorithm constrained to S will also choose at least one input interval and therefore ˆβ(S) 1. This leads to ρ 1. Consider some integer k which we choose later. This k will be the amount of relevant segments we use to estimate ρ. Dene Y 2,... Y k as independent random variables with the same distribution as Y 1 and our estimate ˆρ as the average over those random variables: ˆρ = ( k i=1 Y i)/k. With Chebyshev's inequality and ρ 1 it holds that

14 14 Pascal Bemmann Pr[ ˆρ E(Y 1 ) ερ] = Pr [ ˆρk E[Y 1 ]k εkρ] V ar[ˆρk] (εkρ) 2 6 log n 2 kε 3. By setting k = 6 12 log n 2 /ε 2 we get Pr[ ˆp E[Y 1 ] ερ] 1/12. With this as a basis we apply the same approach as in Lemma 8. First we set k 0 = 12 log n 2 k/ε(1 ε) Θ(ε 4 log 4 n). Then for each j [k 0 ] we compute an H- random sample by choosing h j H uniformly at random and setting S j = arg min{h j (b(s)) S is active}. For the further analysis let X denote the overall number of relevant segments of S 1,..., S k0 and p = Pr[S 1 S rel ]. Using the lower bound of inequality (6) and the ratio between relevant and active segments (5) from Lemma 8 we get k 0 p (12 log n 2 )k ε(1 ε) (1 ε) N rel 12 log n 2 (1 ε)ε k N act ε(1 ε) 6 log n 2 = 2k. This result, Chebyshev's inequality and the fact that V ar[x] = k 0 p(1 p) holds because X is a sum of binary variables leads us to Pr[ X k 0 p k 0 p/2] V ar[x] (k 0 p/2) 2 = 4k p(1 p) k 2 0 p < 4 k 0 p 4 2k Here we also used that p > 0 and k 0 p 2k. This implies that with probability at least 11/12 the sample S 1,..., S k0 (represented by X) contains at least (1/2)k 0 p k relevant segments. With these rst k relevant segments we are able to estimate ˆp which was dened above. Like before we can use the failure probability for X and ˆp to obtain the probability 1 1/12 1/12 = 10/12 that both [ X k 0 p k 0 p/2] and [ ˆρ E[Y 1 ] ερ] hold. In case of success we get by using the upper bounds and our results for the expected value of Y 1 that ˆρ ερ + E[Y 1 ] ερ + (1 + 4ε)ρ = (1 + 5ε)ρ and analogously using the lower bounds ˆρ (1 5ε)ρ. Combining these two inequalities and the above success probabilities grants Pr [ ρ ˆρ 5ερ] 10/12. Like in the previous Lemma we can rescale ε, this time using the factor 1/5. Also the argumentation for the space bounds stays basically the same. For each j [k 0 ] we keep information about h j, γ(s j ), γ(π(s j )) and ˆβ(S j ). As discussed before ˆβ(S j ) β(s j ) γ(s j ) holds. Therefore our space bound for each index j is the same as in Lemma 8, particularly O(ε 1 log 2 n). Because we have k 0 indices we need O(k 0 ε 1 log 2 n) = O(ε 5 log 6 n) space in total. Now we are able to turn all the presented algorithms so far in one algorithm to compute an estimate for the optimal solution of the interval selection problem. Theorem 1. Let ε (0, 1/2) and I be a set of interval with endpoints in [n] that arrive in a data stream. There is an algorithm that uses O(ε 5 log 6 n) space and computes a value ˆα such that Pr[(1/2 ε) α(i) ˆα α(i)] 2/3. Proof. We start with estimating N rel and ρ using Lemma 8 respectively Lemma 9 and obtain the estimates ˆN rel and ˆρ. Then we combine them to a single estimate ˆα 0 = ˆN rel ˆρ which we will now further investigate. The success probability of ˆα 0 is at least = 2/3 using that the failure probability of ˆNrel and ˆρ are In case of success both

15 Interval Selection in the streaming model 15 [ N rel ˆN rel ε N rel ] and [ ρ ˆρ ερ] hold. Using the denition of N rel and ρ and the fact that N rel = S rel together with Lemma 6 we can show that ( ) ˆα 0 (1 + ε)n rel (1 + ε)ρ = (1 + ε)n rel (1 + ε) ˆβ(S) / S rel s S rel = (1 + ε) 2 ˆβ(S) S S rel (1 + ε) 2 α(i) Similarly one can show that ˆα 0 (1 ε) 2 ( 1 2 ε)α(i) holds using the lower bounds of ˆNrel and ˆp. Combining these two results leads to To obtain the nal result we use that for all ε (0, 1/2). This holds because Pr[(1 ε) 2 ( 1 2 ε) α(i) ˆα 0 (1 + ε) 2 α(i)] 2 3. (1 ε) 2 ( 1 2 ε)/(1 + ε) ε ( (1 ε) 2 /(1 + ε) 2) ( 1 ( ) 2 ε) = 4ε 1 (1 + ε) 2 ( 1 2 ε) = 1 2 4ε (1 + ε) 2 (1 ( ) ε) ε ε(1 2 ε) ε with ( ) using 4ε (1 + ε) 2 (1 2 ε) = 2ε 1+ε 1 (1 2ε) 2ε(1 2ε) 2ε. (1 + ε) 2 = 1 2 ε + 2ε2 ε 1 2 3ε Also we can again rescale ε this time using the factor 1/6. If we then set ˆα = ˆα 0 /(1 + ε) 2 to avoid overestimation we obtain our nal result. The space needed for this algorithm is precisely the space needed for our two estimates ˆN rel and ˆp from Lemma 8 and Lemma 9 which completes this proof. 6 Same-size intervals Within this section we will show how we can improve the results presented so far if we assume that all input intervals have the same length λ > 0.

16 16 Pascal Bemmann 6.1 Largest independent set of same size-intervals The rst approach is again to compute an approximation of the largest independent set. By using the shifting technique of Hochbaum and Mass [6] we obtain a (3/2) approximation using O(α(I)). We will maintain a partition of the real line using windows of length 3λ. For l R we dene the window W l = [l, l + 3λ) including the left endpoint and excluding the right endpoint. Given an a {0, 1, 2} we also dene W a = {W (a+3j)λ j Z}. Note that W a is a partition of the real line. Furthermore we dene I a for some a {0, 1, 2} as the set of input intervals that are contained in some window of W a. Formally: I a = {I I j Z : I W (a+3j)λ }. Each interval of length λ is contained in exactly two windows of W 0 W 1 W 2. Then it can be shown that max{α(i 0 ), α(i 1 ), α(i 2 )} 2/3α(I) where α(i a ) denotes an optimal solution restricted on the intervals contained in I a. Because at most 2 intervals can t in a window of length 3λ, we are able to compute and store an optimal solution J a restricted on input intervals of I a for a {0, 1, 2}. By returning the maximum of J 0, J 1 and J 2 we obtain (3/2)-approximation. We will now describe how an algorithm can maintain these solutions J a throughout the stream. We use the same approach as the algorithm in Section 4. We store Leftmost(W ) and Rightmost(W ) for each window W W a with a {0, 1, 2}. In addition we store a boolean value active(w ) indicating if some interval earlier in the stream is contained in W. If active(w ) = false, W does not contain any input interval and therefore we declare Lef tmost(w ) and Rightmost(W ) undened. When receiving a new interval I of the stream we look at all windows W W a for some a {0, 1, 2}. If W is not active, we set active(w ) = true and add the input interval to J a. In case W is already active and contains intersecting intervals, we check if there is an interval of W that is disjoint to I. If so, then these two intervals are added to J a. In the other cases there is nothing to do. By following the above instructions we maintain indeed an optimal solution J a restricted to intervals of I a. By using a binary search tree for storing the at most O(α(I)) active windows we can execute all necessary operation in time O(log α(i)) and O(α(I)) space. This grants a (3/2) approximation to the largest independent set. 6.2 Size of largest independent set of same size-intervals To estimate the size of an optimal solution of all intervals are of the same size will use H-random samples again. First we will show how to estimate a solution constrained to I a for a {0, 1, 2}. Then will use the result to get a nals estimate. Lemma 10. Let a {0, 1, 2} and ε (0, 1). There is an algorithm in the data stream model that in O(ε 2 log(1/ε) + log n) space computes a value ˆα a such that Pr[ α(i a ) ˆα a ε α(i a )] 8/9. Proof. Fix some a {0, 1, 2}. We dene the type i of a window W of W a as minimal number of disjoint input intervals contained in W. Since each window can contain at most two disjoint intervals it can be of type 0, 1 or 2. By γ i with i = 0, 1, 2 we denote the number of windows of type i in W a. Then α(i a ) = γ 1 + γ 2 since intervals counted by γ 0 do not contain any input intervals. Like in Section 5 we will use H-random samples to estimate γ 1 and then the ratio γ 2 /γ 1 to obtain γ 2. First we will describe how to obtain an estimate ˆγ 1 of γ 1. Given the stream of input intervals

17 Interval Selection in the streaming model 17 I = I 1, I 2,... we can compute the sequence of windows W (I) = W (I 1 ), W (I 2 ),... with W (I i ) denoting the window of W a that contains I i. If such a window does not exist we skip I i. It follows that γ 1 is the number of distinct elements in W (I). Again, using the results of Kane, Nelson and Woodru [3] we are able to compute the estimate ˆγ 1 for which Pr[(1 ε)γ 1 ˆγ 1 (1+ε)γ 1 ] 17/18 holds using O(ε 2 + log n) space. Next we will estimate the ratio γ 2 /γ 1. For this we use H-random samples very similar to the approach in Lemma 8. Given a family H = H(n, ε) of permutation [n] [n] as stated in Lemma 1. We choose k = 18ε 2 Θ(ε 2 ) and choose h 1,..., h k H uniformly and independently at random. h j with j [k] let W j be the window [l, l + 3λ) of W a that contains at least one input interval and minimizes h j (l): W j = arg min{h j (l) [l, l + 3λ) W a and I I : I [l, l + 3λ)}. Then W j is nearly a uniform random window of W a among the segments which contain at least one input interval. We dene the random variable M = {j [k] W j is of type 2}. With the considerations before Mγ 1 /k is roughly γ 2. By applying Chebyshev's inequality on M and using the choice of k it is possible to show that Mγ 1 /k = γ 2 ± εγ 1. Using both results above we output with ˆγ 1 (1 + M/k) the desired estimate. Note that M can be computed in O(ε 2 log(1/ε)) space by keeping information about h j and the current window W j for each index J. We also store Leftmost(W j ) and Rightmost(W j ) to decide if W j is of type 1 or 2. Theorem 2. Let ε (0, 1/2) and I be a set of intervals of length λ with endpoints in [n] that arrive in a data stream. There is an algorithm that uses O(ε 2 log(1/ε)+log n) space and computes a value ˆα such that Pr[(2/3 ε) α(i) ˆα α(i)] 2/3. Proof. For each a {0, 1, 2} we use Lemma 10 to obtain the estimate ˆα a for α(i a ). The probability that these three estimates are successful is at least 1 1/9 1/9 1/9 = 2/3. With the properties of these estimates we conclude that 2/3(1 ε) α(i) max{ˆα 0, ˆα 1, ˆα 2 } (1 + 1ε)α(I). Rescaling ˆα by 1/(1 + ε) to avoid overestimation and replace ε by ε/2 complete the proof. 7 Conclusion and other results We have shown how to get a 2-approximation for the interval selection problem and how to use it to get an estimate for the size of an optimal solution. It is also possible to show lower bounds for both problems considered. Emek, Halldórsson and Rosén [5] showed that any streaming algorithm for the interval selection problem cannot achieve an approximation ratio of r, for any constant r < 2. For same size intervals no ratio below 3/2 is possible. In [1] they showed similar results of the problem of estimating α(i). For this they reduce the intrval selection problem to the INDEX problem. Here it is the task to deciding whether a subset of [n] contains some element i [n]. The complexity of INDEX is well studied [7] [8]. To achieve a non trivial success probability Ω(n) bits of memory are required. This reduction shows that any algorithm that uses o(n) bits of memory cannot compute an estimate ˆα for which Pr[( c)α(i) ˆα α(i)] 2 3

18 18 Pascal Bemmann with some arbitrary constant c > 0 holds. This means that the results presented in this work match the lower bounds up to constant factors if we use o(n) space. References 1. S. Cebello, P. Pérez-Lantero. Interval Selection in the Streaming Model. Volume 9214 of the series Lecture Notes in Computer Science pp February 5, P. Indyk. A small approximately min-wise independent family of hash function. J. Algorithms 38(1):84-90, D. M. Kane, J. Nelson and D. P. Woodru. An optimal solution algorithm for the distinct elements problem. PODS 2010, pp , M. Datar and S. Muthukrishnan. Estimating rarity and similarity over data stream windows. ESA 2002, pp Springer, Lecture Notes in Computer Science 2461, Y. Emek, M. M. Halldórsson and A. Rosén. Space-constrained interval selection. ICALP 2012(1), pp Springer, Lecture Note in Computer Science 7391, D.S. Hochbaum and W. Maass. Approximation schemes for covering and packings problems in image processing and vlsi. J. ACM32(1) : , T.S. Jayram, R. Kumar, and D. Sivakumar. The one-way communication complexity of hamming distance. Theory of Computing 4(6) : , E. Kushilevitz and N. Nisan. Communication Complexity. Cambride Usniversity Press. New, NY, USA, 1997.

arxiv: v2 [cs.ds] 4 Feb 2015

arxiv: v2 [cs.ds] 4 Feb 2015 Interval Selection in the Streaming Model Sergio Cabello Pablo Pérez-Lantero June 27, 2018 arxiv:1501.02285v2 [cs.ds 4 Feb 2015 Abstract A set of intervals is independent when the intervals are pairwise