arxiv: v1 [cs.cg] 20 Mar PDF Free Download

Grouping Time-varying Data for Interactive Exploration Arthur van Goethem Marc van Kreveld Maarten Löffler Bettina Speckmann Frank Staals Abstract arxiv:1603.06252v1 [cs.cg] 20 Mar 2016 We present algorithms and data structures that support the interactive analysis of the grouping structure of one-, two-, or higher-dimensional time-varying data while varying all defining parameters. Grouping structures characterise important patterns in the temporal evaluation of sets of time-varying data. We follow Buchin et al. [9] who define groups using three parameters: group-size, group-duration, and interentity distance. We give upper and lower bounds on the number of maximal groups over all parameter values, and show how to compute them efficiently. Furthermore, we describe data structures that can report changes in the set of maximal groups in an output-sensitive manner. Our results hold in R d for fixed d. 1 Introduction Time-varying phenomena are ubiquitous and hence the rapid increase in available tracking, recording, and storing technologies has led to an explosive growth in time-varying data. Such data comes in various forms: time-series (tracking a one-dimensional variable such as stock prices), two- or higher-dimensional trajectories (tracking moving objects such as animals, cars, or sport players), or ensembles (sets of model runs under varying initial conditions for onedimensional variables such as temperature or rain fall), to name a few. Efficient tools to extract information from time-varying data are needed in a variety of applications, such as predicting traffic flow [22], understanding animal movement [7], coaching sports teams [17], or forecasting the weather [25]. Consequently, recent years have seen a flurry of algorithmic methods to analyse time-varying data which can, for example, identify important geographical locations from a set of trajectories [6, 19], determine good average representations [8], or find patterns, such as groups traveling together [9, 18, 21]. Most, if not all, of these algorithms use several parameters to model the applied problem at hand. The assumption is that the domain scientists, who are the users of the algorithm, know from years of experience which parameter values to use in their analysis. However, in many cases this assumption is not valid. Domain scientists do not always know the correct parameter settings and in fact need algorithmic support to interactively explore their data in, for example, a visual analytics system [3, 20]. We present algorithms and data structures that support the interactive analysis of the grouping structure of one-, two-, or higher-dimensional time-varying data while varying all Department of Mathematics and Computer Science, Eindhoven University of Technology {a.i.v.goethem,b.speckmann}@tue.nl Department of Information and Computing Sciences, Utrecht University, {m.j.vankreveld, m.loffler}@uu.nl MADALGO, Aarhus University, f.staals@cs.au.dk 1

defining parameters. Grouping structures (which track the formation and dissolution of groups) characterise important patterns in the temporal evaluation of sets of time-varying data. Classic examples are herds of animals or groups of people. But also for one-dimensional ensembles grouping is meaningful, for example, when detecting trends in weather models [24]. Buchin et al. [9] proposed a grouping structure for sets of moving entities. Their definition was later extended by Kostitsyna et al. [21] to geodesic distances. In this paper we use the same trajectory grouping structure. Our contributions are data structures and query algorithms that allow the parameters of the grouping structure to vary interactively and hence make it suitable for explorative analysis of sets of time-varying data. Below we first briefly review the definitions of Buchin et al. [9] and then state our contributions in detail. Trajectory grouping structure [9]. Let X be a set of n entities moving in R d and let T denote time. The entities trace trajectories in T R d. We assume that each individual trajectory is piecewise linear and consists of at most τ vertices. Two entities a and b are ε-connected if there is a chain of entities a = c 1,.., c k = b such that for any pair of consecutive entities c i and c i+1 the distance is at most ε. A set G is ε-connected, if for any pair a, b G, the entities are ε-connected (possibly using entities not in G). Given parameters m, ε, and δ, a set of entities G is an (m, ε, δ)-group during time interval I if (and only if) (i) G has size at least m, (ii) duration(i) δ, and (iii) G is ε-connected at any time t I. An (m, ε, δ)-group (G, I) is maximal if G is maximal in size or I is maximal in duration, that is, if there is no group H G that is also ε-connected during I, and no interval J I such that G is ε-connected during J. Results and Organization. We want to create a data structure D that represents the grouping structure, that is, its maximal groups, while allowing us to efficiently change the parameters. As we show below, the complexity of the problem is already fully apparent for one-dimensional time-varying data. Hence we restrict our description to R 1 in Sections 2 4 and then explain in Section 5 how to extend our results to higher dimensions. If all three parameters m, ε, and δ can vary independently the question arises what constitutes a meaningful maximal group. Consider a maximal (m, ε, δ)-group (G, I). If we slightly increase ε to ε, and consider a slightly longer time interval I I then (G, I ) is a maximal (m, ε, δ)- group. Intuitively, these groups (G, I) and (G, I ) are the same. Thus, we are interested only in (maximal) groups that are combinatorially different. Note that the set of entities G may also be a maximal (m, ε, δ)-group during a time interval J completely different from I, we also wish to consider (G, I) and (G, J) to be combinatorially different groups. In Section 2 we formally define when two (maximal) (m, ε, δ)-groups are (combinatorially) different. We prove that there are at most O( A n 2 ) such groups, where A is the arrangement of the trajectories in T R 1, and A is its complexity. We also argue that the number of maximal groups may be as large as Ω(τn 3 ), even for fixed parameters m, ε, and δ and in R 1. This significantly strengthens the lower bound of Buchin et al. [9]. In Section 3 we present an O( A n 2 log 2 n) time algorithm to compute all combinatorially different maximal groups. In Section 4 we describe a data structure that allows us to efficiently obtain all groups for a given set of parameter values. Furthermore we also describe data structures for the interactive exploration of the data. Specifically, given the set of maximal (m, ε, δ)-groups we want to change one or more of the parameters and efficiently report only those maximal groups which either ceased to be a maximal group or became a maximal group. That is, our data structures can answer so-called symmetric-difference queries which are gaining in importance as part of interactive analysis systems [16]. As mentioned above, in Section 5 we extend our data structures and algorithms to R d, for fixed d. 2

R r v o p ε (a) time (b) time Fig. 1: (a) A set of trajectories for a set of entities moving in R 1 (b) The region A {r,v} during which {r, v} is alive, and its decomposition into polygons, each corresponding to a distinct instance. In all such regions, except the top one {r, v} is a maximal group: in the top region {r, v} is dominated by {r, v, o} (darker region). 2 Combinatorially Different Maximal Groups We consider entities moving in R 1, hence the trajectories form an arrangement A in T R 1. We assume that no three pairs of entities have equal distance at the same time. Consider the four-dimensional parameter space P with axes time, size, distance, and duration. A set of entities G defines a region A G in this space in which it is alive: a point p = (p t, p m, p ε, p δ ) = (t, m, ε, δ) lies in A G if and only if G is a (m, ε, δ)-group at time t. We use these regions to define when groups are combinatorially different. First (Section 2.1) we fix m = 1 and δ = 0 and define and count the number of combinatorially different maximal (1, ε, 0)-groups, over all choices of parameter ε. We then extend our results to include other values of δ and m in Section 2.2. 2.1 The Number of Distinct Maximal (1, ε, 0)-Groups, over all ε Consider the (t, ε)-plane in P through δ = 0 and m = 1. The intersection of all regions A G with this plane give us the points (t, ε) for which G is a (1, ε, 0)-group. Note that G is a (1, ε, 0)-group at time t if and only if the set G is ε-connected at time t. Hence the region A G, restricted to this plane, corresponds to the set of points (t, ε) for which G is ε-connected. A G, restricted to this plane, is simply connected. Furthermore, as the distance between any pair of entities moving in R 1 varies linearly, A G is bounded from below by a t-monotone polyline f G. The region is unbounded from above: if G is ε-connected (at time t) for some value ε, then it is also ε -connected for any ε ε (see Fig. 1). Every maximal length segment in the intersection between (the restricted) A G and the horizontal line l ε at height ε corresponds to a (maximal) time interval I during which (G, I) is a (1, ε, 0)-group, or an ε-group for short. Every such a segment corresponds to an instance of ε-group G. Observation 1. Set G is a maximal ε-group on I, iff the line segment s ε,i = {(t, ε) t I} is a maximal length segment in A G, and is not contained in A H, for a supergroup H G. Two instances of ε-group G may merge. Let v be a local maximum of f G and I 1 = [t 1, v t ] and I 2 = [v t, t 2 ] be two instances of group G meeting at v. At v ε, the two instances G that are alive during [t 1, v t ] and [v t, t 2 ] merge and we now have a single time interval I = [t 1, t 2 ] on which G is a group. We say that I is a new instance of G, different from I 1 and I 2. We can thus decompose A G into maximally-connected regions, each corresponding to a distinct instance of group G, using horizontal segments through the local maxima of f G. We further split each region at the values ε where G changes between being maximal and being dominated. Let P G denote the obtained set of regions in which G is maximal. Each such a region P corresponds to a combinatorially distinct instance on which G is a maximal group (with at least one member 3

distance time Fig. 2: The arrangement H and the regions A {r,v} (purple) and A {p,o} (orange) for the trajectories shown in Fig. 1(a). The arrangement H corresponds to the arrangement of functions h a (t) that represent the distance from a to the entity directly above a at time t. and duration at least zero). The region P is bounded by at most two horizontal line segments and two ε-monotone chains (see Fig. 1(b)). Counting maximal ε-groups. To bound the number of distinct maximal ε-groups, over all values of ε, we have to count the number of polygons in P G over all sets G. While there are possibly exponentially many sets, there is structure in the regions A G which we can exploit. Consider a set of entities G and a region P P G corresponding to a distinct instance of the maximal ε-group G. We observe that all vertices of P lie on the polyline f G : they are either directly vertices of f G, or they are points (t, ε) on the edges of f G where G starts or stops being maximal. For the latter case there must be a polyline f H, for some subgroup or supergroup of G, that intersects f G at such a point. Furthermore, observe that any vertex (of either type) is used by at most a constant number of regions from P G. Below we show that the complexity of the arrangement H, of all polylines f G over all G, is bounded by O( A n). Furthermore, we show that each vertex of H can be incident to at most O(n) regions. It follows that the complexity of all polygons P P G, over all groups (sets) G, and thus also the number of such sets, is at most O( A n 2 ). The complexity of H. The span S G (t) = {a a X a(t) [min b G b(t), max b G b(t))} of a set of entities G at time t is the set of entities between the lowest and highest entity of G at time t (for technical reasons, we include the lowest entity of G in the span, but not the highest). Let h a (t) denote the distance from entity a to the entity directly above a at time t, that is, h a (t) is the height of the face in A that has a on its lower boundary at time t. Observation 2. A set G is ε-connected at time t, if and only if the largest distance among consecutive entities in S G (t) is at most ε. That is, f G (t) = max h a(t) a S G (t) It follows that H is a subset of the arrangement of the n functions h a, for a X (see Fig. 2). We use this fact to show that H has complexity at most O( A n): Lemma 3. Let A be an arrangement of n line segments, and let k be the maximum number of line segments intersected by a vertical line. The number of triplets (F, F, x) such that the faces F A and F A have equal height h at x-coordinate x is at most O( A k) O( A n) O(n 3 ). Proof. Let l x be the vertical line through point (x, 0). Now consider a triplet (F, F, x), and let e F and f F (e F and f F ) be the two edges of F (F ) intersected by l x. We charge (F, F, x) to edge e {e F, f F, e F, f F } if (and only if) its left endpoint, say u, is the rightmost endpoint that lies to the left of l x (i.e. u is the rightmost among the left endpoints). We now show that each edge can be charged at most 2k times. Consider an edge e = uv of A, with u x v x. Edge e is charged by a triplet (F, F, x), only if one of the faces, say F, is incident to e, and the left-endpoints of the three other edges that are 4

intersected by l x and bounding F or F lie to the left of u. It now follows that there are only k choices for face F, as both the edges bounding F are intersected (consecutively) by the vertical line through u x, and any vertical line intersects at most k edges. Clearly, e is incident to at most two faces, and thus there are also only two choices for F. Finally, observe that for each such pair of faces F and F there is at most one value x [u x, r], where r is the x-coordinate of the leftmost right endpoint among e F, f F, e F and f F, at which F and F have equal height (as the height of faces F and F varies linearly in such an interval). It follows that every edge e A is charged at most 2k times. Remark 1. Interestingly, this bound is tight in the worst case. In Appendix A we give a construction where there are Ω(n 3 ) triplets (F, F, x) such that F and F have equal height at x, even if we use lines instead of line segments. Lemma 4. The arrangement H has complexity O( A n). Proof. Vertices in H are either (i) vertices of individual functions h a, or (ii) intersections between two such functions, say h a and h c. The total complexity of the individual functions is O( A ), hence there are also only O( A ) vertices of the type (i). Vertices of the type (ii) correspond to a triplet (F, F, x) in which F and F are faces of A that have equal height at x-coordinate x. By Lemma 3 there are at most O( A n) such triplets. Thus, the number of vertices of type (ii) is also at most O( A n). What remains to show is that each vertex v of H can be incident to at most O(n) polygons from different sets. We use Lemma 5, which follows from Buchin et al. [9]: Lemma 5. Let R be the Reeb graph for a fixed value ε capturing the movement of a set of n entities moving along piecewise-linear trajectories in R d (for some constant d), and let v be a vertex of R. There are at most O(n) maximal groups that start or end at v. Lemma 6. Let v be a vertex of H. P = G X P G. Vertex v is incident to at most O(n) polygons from Proof. Let P P G be a region that uses v. Thus, G either starts or ends as a maximal v ε -group at time v t. This means, v correspond to a single vertex u in the Reeb graph, built with parameter v ε. By Lemma 5, there are at most O(n) maximal v ε -groups that start or end at u. Hence, v can occur in regions of at most O(n) different sets G. For a fixed set G, the regions in P G are disjoint, so there are only O(1) regions from P G, that contain v. Lemma 7. The number of distinct ε-groups, over all values ε, and the total complexity of all regions P = G X P G, are both at most O( H n) = O( A n 2 ). 2.2 The Number of Distinct Maximal Groups, over all Parameters Maximal groups are monotonic in m and δ (see Buchin et al. [9]); hence a maximal (m, ε, δ)-group is also a maximal (m, ε, δ )-group for any parameters m m and δ δ. It follows that the number of combinatorially different maximal groups is still at most O( A n 2 ). For the complexity of the regions in P G : fix m = 0, and consider the remaining subspace of P with axes time, distance, and duration, and the restriction of A G, for any set G, into this space. In the δ = 0 plane we simply have the regions A G, that are bounded from below by a t-monotone polyline f G, as described in Section 2.1. As we increase δ we observe that the local minima in the boundary f G get replaced by a horizontal line segment of width δ (see Fig. 3). For arbitrarily small values of δ > 0, the total complexity of this boundary is still O( A n 2 ). 5

distance δ δ Fig. 3: A cross section of the region A {r,v} with the plane through δ = δ. The boundary of the original region (i.e. the cross section with the plane through δ = 0) is dashed. time Further increasing δ, monotonically decreases the number of vertices on the functions f G. It follows that the regions A G, restricted to the time, distance, duration space also have total complexity O( A n 2 ). Finally, consider the regions A G in the full four dimensional space. Clearly, A G {p p P p m < G } =. For values m G, the boundary of A G is constant in m. We conclude: Theorem 8. Let X be a set of n entities, in which each entity travels along a piecewise-linear trajectory of τ edges in R 1, and let A be the resulting trajectory arrangement. The number of distinct maximal groups is at most O( A n 2 ) = O(τn 4 ), and the total complexity of all regions in the parameter space corresponding to these groups is also O( A n 2 ) = O(τn 4 ). In Section B in the appendix we prove Lemma 9: even for fixed parameters ε, m, and δ, the number of maximal (m, ε, δ)-groups, for entities moving in R 1, may be as large as Ω(τn 3 ). This strengthens the result of Buchin et al. [9], who established this bound for entities in R 2. Lemma 9. For a set X of n entities, in which each entity travels along a piecewise-linear trajectory of τ edges in R 1, there can be Ω(τn 3 ) maximal ε-groups. 3 Algorithm In the following we refer to combinatorially different maximal groups simply as groups. Our algorithm computes a representation (of size O( A n 2 )) of all groups, which we can use to list all groups and, given a pointer to a group G, list all its members and the polygon Q G P G. We assume δ = 0 and m = 1, since the sets of maximal groups for δ > 0 and m > 1 are a subset of the set for δ = 0 and m = 1. 3.1 Overview Our algorithm uses the arrangement H located in the (t, ε)-plane. Line segments in H correspond to the height function of the faces in A. Let a, b S G (t) be the pair of consecutive entities in the span of a group G with maximum vertical distance at time t. We refer to (a, b) as the critical pair of G at time t. The pair (a, b) determines the minimal value of ε that is required for the group G to be ε-connected at time t. The distance between a critical pair (a, b) defines an edge of the polygon bounding G in H. Our representation will consist of the arrangement H in which each edge e is annotated with a data structure T e, a list L (or array) with the top edge in each group polygon Q G P G, and an additional data structure S to support reconstructing the grouping polygons. We start by computing the arrangement H. This takes O( H ) = O(τn 3 ) time [2]. The arrangement is built from the set of height-functions of the faces of A. With each edge we store the pair of edges in A responsible for it. 6

Given arrangement H we use a sweep line algorithm to construct the rest of the representation. A horizontal line l(ε) is swept at height ε upwards, and all groups G whose group polygon Q G currently intersects l are maintained. To achieve this we maintain a two-part status structure. First, a set S with for each group G the time interval I(G, ε) = Q G l(ε). Second, for each edge e H intersected by l(ε) a data structure T e with the sets of entities whose time interval starts or ends at e, that is, G T e if and only if I(G, ε) = [s, t] with s = e l(ε) or t = e l(ε). We postpone the implementation of T to Section 3.3. The data structures support the following operations: Operation Input Action Filter(T e, X) A data structure T e A set of entities X Insert(T e, G) A data structure T e A pointer to a representation of G Delete(T e, G) A data structure T e A pointer to a representation of G Merge(T e, T f ) Two data structures T e, T f, belonging to two edges e, f having the same starting or ending vertex Contains(T e, G) A data structure T e A pointer to a representation of G ending or starting on edge e HasSuperSet(T e, G) A data structure T e A pointer to a representation of G ending or starting on edge e Create a data structure T = {G X G T e } Create a data structure T = T e {G}. Create a data structure T = T e \ {G}. Create a data structure T = T e T f. Test if T e contains set G. Test if T e contains a set H G, and return the smallest such set if so. The end points of the time interval I(G, ε) = [start(g, ε), end(g, ε)] vary non-stop along the sweep. For each group G, the set S instead stores the edges e and f of H that contain the starting time start(g, ε) and ending time end(g, ε), respectively, and pointers to the representation of G in T e and T f. We refer to e and f as the starting edge and ending edge of G. In addition, we store with each interval I(G, ε) a pointer to the previous version of the interval I(G, ε ) if (and only if) the starting time (ending time) of G changed to edge e (edge f) at ε. Note that updates for both S and T occur only when a vertex is hit by the sweep line l(ε). For all unbounded groups we add I(G, ) to L after the sweep line algorithm. 3.2 Sweepline Events The sweep line algorithm results in four different vertex events (see Fig. 4). The Extend-event has a symmetrical version in which uu and ww both have a negative incline. We describe how to update our hypothetical data structures in all cases. Case I - Birth. Vertex v is a local minimum of one of the functions h a, with a X (see Fig. 4(a)). When the sweep line intersects v a new maximal group G is born. We can find the maximal group spawned in O( G ) time by checking which trajectories are ε-connected for this value of t and ε. To this end we traverse the (vertical decomposition of) A starting at the entities defining v. Case II - Extend. Vertex v is the intersection of two line segments s ab = uu and s cd = ww, both with a positive incline (see Fig. 4(b)). The case in which s ab and s cd have negative incline can be handled symmetrically. Assume without loss of generality that s cd is steeper than s ab. We start with the following observation: 7

b a A t ε u (a) Birth H v u d c b a A ε ε t u (b) Extend H v w w u b a A t ε H v u u (c) Join d a bc A ε ε t w u (d) Union H v u w Fig. 4: The different types of vertex events shown both in the arrangement A and in H. The Extend event has a horizontally symmetric case. Observation 10. None of the groups arriving on edge (w, v) continue on edge (v, u ). Proof. Let G be a group that arrives at v using edge (w, v). As G uses (w, v), it must contain entities both above and below the face F defined by critical pair (c, d). We know that u ε and w ε are strictly smaller than v ε and ε is never smaller than zero. Thus, v ε is strictly positive and F has a strictly positive height at t. Therefore, G still contains entities above and below F after time t. But then the critical pair (c, d) is still part of G and s cd is a lower bound for the group. It follows that G must use edge (v, w ). We first compute the groups on outgoing edge (v, u ). By Observation 10 all these groups arrive on edge (u, v). In particular, they are the maximal size subsets from T (u,v) for which all entities lie below entity d at time t, that is, T (v,u ) = Filter(T (u,v), below(d, t)), where below(y, t) = {x x X x(t) < y(t)}. For each group G in T (v,u ) we update the time-interval in S. If G was dominated by a maximal group H G on incoming edge (u, v), we insert a new time interval with starting edge f = start(h, ε) and ending edge (v, u ) into S, and insert G into T f. Note that G and H indeed have the same starting time: G is a subset of H, and is thus ε-connected at any time where H is ε-connected. Since G was not maximal before, it did not start earlier than H either. The groups from T (u,v) that contain entities on both sides of critical pair (c, d), continue onto edge (v, w ). Let T denote these groups. We update the interval I(G) in S for each group G T by setting the ending edge to (v, w ). Next, we determine which groups from T (w,v) die at v. A maximal group G T (w,v) dies at v if there is a group H on (v, w ) that dominates G. Any such group H must arrive at v by edge (u, v). Hence, for each group G T (w,v) we check if there is a group H T with H G and I(H) I(G). For each of these groups we remove the interval I(G, ε) from S, add I(G, ε) to L, and delete the set G from the data structure T f, where f is the starting edge of G (at height ε). The remaining (not dominated) groups from T (w,v) continue onto edge (v, w ). Let T denote this set. We obtain T (v,w ) by merging T and T, that is, T (v,w ) = Merge(T, T ). Since we now have the data structures T (v,u ) and T (v,w ), and we updated S accordingly, our status structure again reflects the maximal groups currently intersected by the sweep line. Case III - Join. Vertex v is a local maximum of one of the functions h a, with a X (see Fig. 4(c)). Two combinatorially different maximal groups G u and G w with the same set of entities 8

die at v (and get replaced by a new maximal group G ) if and only if G u is a maximal group in T (u,v) and G w is a maximal group in T (w,v). We test this with a call to Contains(T (w,v), G u ) for each group G u T (u,v). Let G be a group in T (u,v), and let H T (w,v) be the smallest supergroup of G, if such a group exists. At v the group G will immediately extend to the ending edge of H. We can find H by using a HasSuperSet(T (w,v), G) call. If H exists we insert G into T e, and update I(G, ε) in S accordingly. We process the groups G in T (w,v) that have a group H T (u,v) whose starting time jumps at v analogously. Case IV - Union. Vertex v is the intersection of a line segment s ab = uu with positive incline and a line segment s cd = ww, with negative incline (see Fig. 4(d)). The Union event is a special case of the Birth event. Incoming groups on edge (u, v) are below the line segment s cd and, hence, can not contain any elements that are above c. As a consequence the line segment s cd does not limit these groups and for a group G T (u,v) we can safely add it to T (v,u ). We also update the interval I(G) in S by setting the ending edge to (v, u ). An analogous argument can be made for groups arriving on edge (w, v). Furthermore a new maximal group is formed. Let H be the set of all entities ε-connected to entity a at time t. We insert H into T (v,u ) and T (v,w ) and we insert a time interval I(H) into S with starting edge (v, w ) and ending edge (v, u ). 3.3 Data Structure We can implement S using any standard balanced binary search tree, the only requirement is that, given a (representation of) set G in a data structure T e, we can efficiently find its corresponding interval in S. The data structure T e. We need a data structure T = T e that supports Filter, Insert, Delete, Merge, Contains, and HasSuperSet efficiently. We describe a structure of size O(n), that supports Contains and HasSuperSet in O(log n) time, Filter in O(n) time, and Insert and Delete in amortized O(log 2 n) time. In general, answering Contains and HasSuperSet queries in a dynamic setting is hard and may require O(n 2 ) space [27]. Lemma 11. Let G and H be two non-empty ε-groups that both end at time t. We have: (G H G H ) G H G. Proof. The if-direction is easy: G H immediately implies that G H, and since G is non-empty we then also have G H = G. We prove the only-if direction by contradiction: assume by contradiction that G H, and thus there is an element b G \ H. Furthermore, let a G H, and let s G and s H denote the starting times of group G and H, respectively. We distinguish two cases: s G s H and s H < s G. Case s G s H. Since a G and s G s H, we have that at any time in [s H, z], the entities in G are ε-connected to a. So, in particular, entity b is ε-connected to a. However, during [s H, z], the entities in H are also ε-connected to a. Thus, it follows that during [s H, z], entity b is also ε-connected to H, and thus b H. Contradiction. Case s H < s G. Analogous to the previous case we get that both H and G are ε-connected to entity a during [s G, z]. It then follows that H G. However, as b G \ H, this relation is strict, that is, H G. This contradicts that G H. We implement T with a tree similar to the grouping-tree used by Buchin et al. [9]. Let {G 1,.., G k } denote the groups stored in T, and let X = i [1,..,k] G i denote the entities in these groups. Our tree T has a leaf for every entity in X. Each group G i is represented by an internal 9

node v i. For each internal node v i the set of leaves in the subtree rooted at v i corresponds exactly to the entities in G i. By Lemma 11 these sets indeed form a tree. With each node v i, we store the size of the group G i, and (a pointer to) an arbitrary entity in G i. Next to the tree we store an array containing for each entity a pointer to the leaf in the tree that represents it (or Nil if the entity does not occur in any group). We preprocess T in O(n) time to support level-ancestor (LA) queries as well as lowest common ancestor (LCA) queries, using the methods of Bender and Farach-Colton [4, 5]. Both methods work only for static trees, whereas we need to allow updates to T as well. However, as we need to query T e only when processing the upper end vertex of e, we can be lazy in updating T e. More specifically, we delay all updates, and simply rebuild T e when we handle its upper end vertex. HasSuperSet and Contains queries. Using LA queries we can do a binary search on the ancestors of a given node. This allows us to implement both HasSuperSet(T e, G) queries and Contains(T e, G) in O(log n) time for a group G ending or starting on edge e. Let a be an arbitrary element from group G. If the datastructure T e contains a node matching the elements in G then it must be an ancestor of the leaf containing a in T. That is, it is the ancestor that has exactly G elements. By Lemma 11 there is at most one such node. As ancestors only get more elements as we move up the tree, we find this node in O(log n) time by binary search. Similarly, we can implement the HasSuperSet function in O(log n) time. Insert, Delete, and Merge queries. The Insert, Delete, and Merge operations on T e are performed lazily; We execute them only when we get to the upper vertex of edge e. At such a time we may have to process a batch of O(n) such operations. We now show that we can handle such a batch in O(n log 2 n) time. Lemma 12. Let G 1,.., G m be maximal ε-groups, ordered by decreasing size, such that: (i) all groups end at time t, (ii) G 1 G i, for all i, (iii) the largest group G 1 has size s, and (iv) the smallest group has size G m > s/2. We then have that G i G i+1 for all i [1,.., m 1]. Proof. All groups G i are subsets of G 1 and have size at least s/2. Thus, any two subsets G i and G i+j have a non-empty intersection, i.e. G i G i+j. The result then follows directly from Lemma 11. Lemma 13. Given two nodes v G T and v H T, representing the set G respectively H, both ending at time t, we can test if G H in O(1) time. Proof. Let a be the entity from G stored with v G. We use the array of T to find the leaf l in T that represents a, and perform a LCA query on l and v H in T. If the result is v H then a H and Lemma 11 states that G H if and only if G < H. If the result is not v H then a H, and trivially G H. Finding l, and performing the LCA query takes O(1) time. As we store the group size with each node, we can also test if G < H in constant time. Lemma 14. Given m = O(n) nodes representing maximal ε-groups G 1,.., G m, possibly in different data structures T 1,.., T m, that all share ending time t, we can construct a new data structure T representing G 1,.., G m in O(n log 2 n) time. Proof. Sort the groups G 1,.., G m on decreasing group size. Let G 1 T 1 denote the largest group and let it have size s. We assume for now that G 1 is a superset of all other groups. If this is not the case we add a dummy group G 0 containing all elements. We process the groups in order of decreasing size. By Lemma 12 it follows that all groups G 1,.., G k that are larger than s/2 form a path P in T, rooted at G. 10

For all remaining (small) groups G i we then find the smallest group in P that is a super set of G i. By Lemma 13, we can test in O(1) time if a group H P is a supergroup of G i by performing a LCA query in the tree H originated from. We can then find the smallest super set of G i in O(log n) time using a binary search. Once all groups are partitioned into clusters with the same ancestor G i, we process the clusters recursively. When the largest group in a cluster has size one we are done (see Fig. 5). The algorithm goes through a series of rounds. In each round the remaining clusters are handled G C i Fig. 5: T is built top-down in several rounds. Edges and nodes are colored by round. recursively. Because all (unhandled) clusters jointly contain no more than O(n) groups, each round takes only O(n log n) time in total. As in each round the size of the largest group left is reduced by half, it follows that after O(log n) rounds the algorithm must has constructed the complete tree. Updating the array with pointers to the leaves takes O(n) time, as does rebuilding the tree for future LA and LCA queries. The final function Filter can easily be implemented in linear time by pruning the tree from the bottom up. We thus conclude: Lemma 15. We can handle each event in O(n log 2 n) time. 3.4 Maximal Groups Reconstructing the grouping polygons. Given a group G, represented by a pointer to the top edge of Q G in L, we can construct the complete group polygon Q G in O( Q G ) time, and list all group members of G in O( G ) time. We have access to the top edge of Q G. This is an interval I(G, ˆε) in S, specifically, the version corresponding to ˆε, where ˆε is the value at which G dies as a maximal group. We then follow the pointers to the previous versions of I(G, ) to construct the left and right chains of Q G. When we encounter the value ˇε at which G is born, these chains either meet at the same vertex, or we add the final bottom edge of Q G connecting them. To report the group members of G, we follow the pointer to I(G, ˆε) in S. This interval stores a pointer to its starting edge e, and to a subtree in T e of which the leaves represent the entities in G. Analysis. The list L contains O(g) = O( A n 2 ) entries (Theorem 8), each of constant size. The total size of all S s is O( H n): at each vertex of H, there are only a linear number of changes in the intervals in S. Each edge e of H stores a data structure T e of size O(n). It follows that our representation uses a total of O( H n) = O( A n 2 ) space. Handling each of the O( H ) nodes requires O(n log 2 n) time, so the total running time is O( A n 2 log 2 n). Theorem 16. Given a set X of n entities, in which each entity travels along a trajectory of τ edges, we can compute a representation of all g = O( A n 2 ) combinatorial maximal groups G such that for each group in G we can report its grouping polygon and its members in time linear in its complexity and size, respectively. The representation has size O( A n 2 ) and takes O( A n 2 log 2 n) time to compute, where A = O(τn 2 ) is the complexity of the trajectory arrangement. 4 Data Structures for Maximal Group Queries In this section we present data structures that allow us to efficiently obtain all groups for a given set of parameter values (Section 4.1), and for the interactive exploration of the data (Section 4.2). 11

Throughout this section, n denotes the number of entities considered, τ the number of vertices in any trajectory, k the output complexity, i.e. the number of groups reported, g the number of maximal groups, g the maximum number of maximal groups for a given (fixed) value of ε, and Π the total complexity of the regions corresponding to the g combinatorially different maximal groups. So we have g = O(τn 3 ) and g Π = O(τn 4 ). When g, g, or Π appear as the argument of a logarithm, we write O(log nτ). 4.1 Quering the maximal groups We show that we can store all groups in a data structure of size O(Π log nτ log n) that can be built in O(Π log 2 nτ log n) time, and allows reporting all (m, ε, δ)-groups in O(log 2 nτ log n + k) time. We use the following three-level tree to achieve this. On the first level we have a balanced binary tree with in the leaves the group sizes 1...n. Each internal node v corresponds to a range R v of group sizes and stores all groups whose size lies in the range R v. Let G v denote this set of groups, and for each such group let D G denote the duration of group G as a function of ε. The functions D G are piecewise-linear, δ-monotone, and may intersect (see Fig. 6). By Theorem 8 the total complexity of these functions is O(Π). We store all functions D G, with G G v, in a data structure that can answer the following polyline stabbing queries in O(log 2 nτ + k) time: Given a query point q = (ε, δ), report all polylines that pass above point q, that is, for which D G (ε) δ. Thus, given parameters m, ε, and δ, finding all (m, ε, δ)-groups takes O(log 2 nτ log n + k) time. We build a segment tree storing the (ε-extent of duration δ ε distance Fig. 6: The functions D G expressing the duration of group G as a function of ε. Assuming all groups have size at least m, all (m, ε, δ)- groups intersect the upward vertical half-ray starting in point (ε, δ). the) individual edges of all polylines stored at v. An internal node u of the segment tree corresponds to an interval I(u), and stores the set of edges Ints(u) that completely span I(u). Hence, with respect to u, we can consider these segments as lines. For a query with a point q, we have to be able to report all (possibly intersecting) lines from Ints(u) that pass above q. We use a duality transform to map each line l to a point l and query point q to a line q. The problem is then to report all points l in the half-plane below q. Such queries can be answered in O(log h + k) time, using O(h) space and O(h log h) preprocessing time, where h is the number of points stored [13]. It follows that we can find all k polylines that pass above q in O(log 2 nτ + k) time, using O(Π log nτ) space, and O(Π log 2 nτ) preprocessing time. We thus obtain the following result: Theorem 17. Given parameters m, ε, and δ, we can build a data structure of size O(Π log nτ log n), using O(Π log 2 nτ log n) preprocessing time, which can report all (m, ε, δ)-groups in O(log 2 nτ log n+ k) time, where k is the output complexity. 4.2 Symmetric Difference Queries Here we describe data structures for the interactive exploration of the data. We often have all (m, ε, δ)-groups, for some parameters m, ε, and δ, and we want to change (some of the) parameters, say to m, ε, and δ, respectively. Thus, we need a way to efficiently report all changes in the maximal groups. This requires us to solve symmetric difference queries, in 12

which we want to efficiently report all maximal (m, ε, δ)-groups that are no longer maximal for parameters m, ε, and δ, and all maximal (m, ε, δ )-groups that were not maximal for parameters m, ε, and δ. That is, we wish to report G(m, ε, δ) G(m, ε, δ ). Changing only δ. Consider the case in which we vary only δ, and keep m and ε fixed, that is, m = m and ε = ε. With fixed ε, it suffices to use the algorithm from Buchin et al. [9] to compute all maximal ε-groups with size at least m. There are at most g such groups. Each group G corresponds to an interval I G = (, duration(g)] such that G is maximal for a choice δ of the duration parameter if and only if δ (, duration(g)]. We now have two values δ and δ, and we should report all intervals in S(δ) S(δ ), where S(x) denotes the intervals that contain x. Note that we can assume without loss of generality that δ δ. Then we observe that we should report group G if and only if duration(g) [δ, δ ]. Hence, our data structure is simply a balanced binary search tree on at most g values duration(g) and a symmetric difference query is a 1-dimensional range query. Changing only m. The case in which we vary only m can be solved analogously to the previous case. A maximal group G has a size G, and G should be reported if and only if G [m, m ), assuming that the group size changes from m to m or vice versa, with m < m. Changing only ε. The minimum duration δ is fixed, so consider the δ-truncated grouping polygons (i.e. the regions A G where each local minimum has been replaced by a horizontal line segment of width δ). Compute all combinatorially distinct maximal groups for this parameter δ and remove all groups that have size less than m. A group G is now maximal during some interval I G = [ˇε G, ˆε G ], and we have to report G if (and only if) I G occurs in the set S(ε) S(ε ). We now observe that this is the case exactly when I G contains ε or ε, but not both (see Fig. 7). Using an interval tree we can thus report the symmetric difference for ε in O(log nτ + k) time, using O(Π) space and O(Π log nτ) preprocessing time. 1 ε ε Fig. 7: The symmetric difference for parameters ε and ε (red and blue intervals) is exactly the set of intervals that contains either ε or ε, but not both. Hence, the green intervals should not be reported. Changing δ and m simultaneously. Consider the space δ m. A group G is maximal in the quadrant (, duration(g)] (, G ] with top-right corner p G = (duration(g), G ). So, for parameters δ and m, the set of maximal (m, ε, δ)-groups corresponds to the set of corner points that lie to the top-right of (δ, m). It now follows that when we change the parameters to (δ, m ), the maximal groups that we have to report lie in Q (δ,m) Q (δ,m ) (see Fig. 8). We can report those points (groups) by two three-sided (orthogonal) range queries. Therefore, we store the corner points in a priority search tree [15], and thus solve symmetric difference queries in O(log nτ + k) time, and O(g ) space. Building a priority search tree takes O(g log nτ) time. Changing ε and m simultaneously. Consider the space ε m. A group G is now a maximal group in a bottomless rectangle R G = [ˇε G, ˆε G ] (, G ]. See Fig. 9. Thus, for parameters ε and m the maximal groups all contain the point (ε, m). We find the groups that we have to report by combining the approaches for varying only ε and varying only m. Observe that G should be reported if (and only if) (ε, m) is in the rectangle R G and (ε, m ) is not, or vice versa. Assume we test for the former. We can solve this query problem with three very similar 1 Note that we now have O(Π) groups (intervals) rather than O(g ). 13

size size m m (a) duration δ δ (b) duration Fig. 8: Symmetric difference queries when we allow varying δ and m. Each combinatorial maximal group G corresponds to a lower-left quadrant in the space δ m (a). For given parameters m, ε, and m, all (δ, ε, m)-groups lie to the top-right of the point (δ, m). Therefore, the groups that have to be reported in a symmetric-difference query (shown in red and blue) can be reported by two three-sided range queries. two-level data structures. The first is a binary search tree on all groups G sorted on ˇε G. An internal node v is associated to a subset G v of groups that appear in the subtree rooted at v. We store G v by storing the horizontal line segments [ˇε G, ˆε G ] G in a hive graph [11], preprocessed for planar point location queries. If h = G v, then this structure uses O(h) storage and and allows us to report all line segments of G v that lie vertically above a query point in O(log h + k) time. We query the main tree with ε and select a subset of O(log nτ) nodes whose associated subsets contains exactly the groups G with ε < ˇε G. This implies that (ε, m ) is not inside R G. The second-level structure allows us to find those groups whose rectangle R G contains (ε, m). The second data structure is different only in its main tree, which is sorted on ˆε G, and we will select the nodes whose associated subsets contains exactly the groups with ε > ˆε G. The third data structure is again different in the main tree only, and is sorted on G. Here we select nodes whose associated subsets have m > G. Together, the three main trees capture that (ε, m ) is not in R G and the associated structures capture that (ε, m) is in R G. We report any group in the symmetric difference at most twice. The data structure uses O(Π log nτ) space and takes O(Π log 2 nτ) time to build. The query time is O(log 2 nτ + k). Changing ε and δ, one by one. Consider the space ε δ. A group G is maximal in the region below the partial, piecewiselinear, and monotonically increasing function D G that expresses the duration of G as a function of ε. Each such partial function is defined for a single interval of ε- values. See Fig. 6. Note that the polylines representing D G and D H, for two groups G and H, may intersect. The combination of non-orthogonal boundaries and intersections makes changing ε and δ much harder than changing ε and m. Consider changing parameter δ to δ, while keeping ε unchanged. For such a query we thus have to report all groups G size m m ε ε distance Fig. 9: Symmetric difference queries when we allow varying ε and m. Each combinatorial maximal group G now corresponds to a bottomless rectangle. We find the groups that are maximal for only one pair of parameters (i.e. the red and blue groups) by a query in a two-level tree for symmetric-difference queries. 14

whose polyline D G intersects the vertical query segment Q = (ε, δ)(ε, δ ). We use the following data structure to answer such queries. We build a segment tree storing the individual edges of the polylines. Like in Section 4, each node v in this tree now corresponds to a set Ints(v) of polyline edges (one per polyline) that completely cross the interval I v associated with v. We again treat these edges as lines. We store the h = Ints(v) lines in a data structure by Cheng and Janardan [14] that has size O(h log h), can be built in O(h log 2 h) time, and allows reporting all (possibly intersecting) lines that intersect Q in O( h2 log h log h + k) time. Since for any value ε there are at most O(g ) groups, we also have that for any node v, Ints(v) = O(g ). It follows that we can answer symmetric difference queries in δ in O( g 2 log nτ log 2 nτ + k) time, after O(Π log 3 nτ) preprocessing time, and using O(Π log 2 nτ) space. Consider changing parameter ε to ε, while keeping δ unchanged. For such a query we have to report all groups G whose polyline D G is above exactly one end point of the horizontal query segment Q = (ε, δ)(ε, δ). Since all polylines D G are ε and δ-monotone we could use the same approach as before, reversing the roles of ε and δ. However, a horizontal line may intersect O(g) polylines rather than O(g ), causing g to appear in the query time rather than g. This may be significantly worse. Instead, observe that there are three ways in which Q can have exactly one end point below D G. The two cases where one end point of Q is outside the ε-range of D G are easily handled with a two-level tree. The first level is a binary search tree on the ε-range of D G, and allows us to find the groups for which D G is either defined completely before, or completely after ε. All these groups are not maximal for parameter ε, so among them we have to select the ones that are maximal for parameters ε and δ. Our second level is thus the data structure from Section 4. This leads to a data structure of size O(Π log 2 nτ) and query time O(log 3 nτ + k). The third case concerns the situation where the ε-range of Q is contained in the ε-range of D G. In that case we need to test whether Q intersects D G. We use a hereditary segment tree [12] on the ε-ranges of all segments S of all D G, and at each node v, we use associated structures for the cases where segments of S are long and Q is short, and vice versa. For the segments S v of S that are long at v, we observe that there are only O(g ) of them, because they have a common ε-value. Furthermore, there can be at most one long segment in S v for each group G. Hence, we can use the data structure by Cheng and Janardan [14] to report the ones intersecting Q. For the segments of S that are short at v, we know that the query segment is long and horizontal, so we can just consider the δ-span of each short segment. However, we must still ensure that the polyline D G that a short segment is part of, extends to the right beyond Q. Both conditions together lead again to a hive graph, preprocessed for planar point location. The data structure has size O(Π log 2 nτ), query time O( g 2 log nτ log 2 nτ + k), and can be built in O(Π log 3 nτ) time. Changing all three parameters, one by one. To support changing all three parameters, we combine some of the previous approaches. We build two separate data structures; one to change m, the other to change ε or δ. The data structure to change m is simply the data structure from Section 4. The first level of this tree allows us to find O(log n) subtrees containing the groups whose size is in the range (min{m, m }, max{m, m }]. We then use the associated data structures to report the groups that are long enough (with respect to parameters ε and δ). Thus, we can answer such queries in O(log n log 2 nτ + k) time. To support changing ε or δ we extend the solution from the previous case: we simply add an other level to the structure, that allows us to filter the groups that intersect a query segment Q in the (ε, δ)-plane by size. This yields a query time of O( g 2 log nτ log 2 nτ log n+k). The size and preprocessing time remain unaffected, when compared to the previous situation. Changing ε and δ simultaneously. We build a data structure that allows us to report the maximal groups for parameters ε and m as a small number of canonical subsets. For each of 15

arxiv: v1 [cs.cg] 20 Mar 2016