A simple sub-quadratic algorithm for computing the subset partial order Paul Pritchard P.Pritchard@cit.gu.edu.au Technical Report CIT-95-04 School of Computing and Information Technology Grith University Queensland, Australia 4111 March 1995 Abstract A given collection of sets has a natural partial order induced by the subset relation. Let the size N of the collection be dened as the sum of the cardinalities of the sets that comprise it. Algorithms have recently been presented that compute the partial order (and thereby the minimal and maximal sets, i.e., extremal sets) in worst-case time O(N 2 = log N ). This paper develops a simple algorithm that uses only simple data structures, and gives a simple analysis that establishes the above worst-case bound on its running time. The algorithm exploits a variation on lexicographic order that may be of independent interest. 1 Introduction Given is a collection F = fs1; : : : ; S k g, where each S i is a set over the same domain D. Dene the size of the collection to be N = P i js i j. Pritchard [4] presented algorithms for nding those sets in F that have no subset in F. Starting from a naive O(N 2 ) algorithm 1, an algorithm was obtained that had worst-case complexity O(N log N) when there was a constant that bounded the cardinality of each set in F. It was later shown [5] that there is an implementation with worst-case running time O(N 2 = log N). Yellin and Jutla [8] generalized the problem to that of computing the minimal (as in [4]) or maximal sets in the partial order on the distinct sets in F that is induced by the subset relation. They presented an algorithm that built the 1 Our model is a uniform cost RAM with (log N )-bit words. 1
partial order in O(N 2 = log N) operations over a dictionary ADT, and observed that dynamic perfect hashing [2] gave a randomized amortized expected running time of O(1) per ADT operation. Pritchard [6] presented a variant algorithm permitting an implementation taking worst-case time O(N 2 = log N). These known sub-quadratic algorithms are variously complex ([5]), rely on complex data structures ([8]), or require substantial housekeeping ([6]). We proceed to develop an alternative approach that has a simple structure, that uses simple and ecient algorithmic techniques, and that has a straightforward analysis that establishes a worst-case running time of O(N 2 = log N). 2 Initial development and analysis For simplicity of presentation, we shall assume that F is a set rather than a multiset. Equal sets can be gathered in an inexpensive preliminary computation, and thereafter treated as one set in a collection of distinct sets. Following [8], we shall compute the complete subset graph, where the vertices are the sets of F, and there is an edge from S to S 0 i S 0 S. We shall use a transformational style of development, starting with a high-level algorithm. V := F; E := ;; forall x 2 F do forall y 2 F such that y x do E := E [ (x; y) end forall y end forall x f (V; E) is the complete subset graph for F g Consider the naive implementation of the inner loop: forall y 2 F do if y x then E := E [ (x; y) end if end forall y With each set implemented as an ordered linked list or array of its elements, the cost of a straightforward implementation of a subset test y x is O(jxj + jyj). Hence, recalling that there are k sets in F, the cost of this inner loop is order y (jxj + jyj) = y jxj + y jyj = kjxj + N: The overall cost is therefore order x (kjxj + N) = k x jxj + x 2 N = kn + kn = 2kN:
Since k = O(N), the worst-case cost is O(N 2 ), and it is easy to see this bound is tight. An obvious optimization is to only consider y such that jyj < jxj, but the worst case is still quadratic. In search of another optimization, x x, and consider y such that jyj < jxj. Instead of each element of x contributing (1) to the cost of the subset test y x in the worst case, we can arrange for just the elements of y \ x to do so by precomputing, for each element d in the domain D, a list of the sets in F that contain d. Then to nd all subsets of x, the list for each member d of x is traversed, an initially zero count is incremented for each set y in the list, and a set is recognized as a subset of x when P its count reaches its cardinality. The cost of the inner loop is O( d2x weight(d)), where weight(d) denotes the number of sets in F that contain d. Therefore the total cost (after the precomputation) is O( weight(d)) = O( weight(d) 2 ): x d2x d It can be seen that in the worst case, when there is an element with weight (N), the complexity is still quadratic. 3 Handling common elements Repeated scanning of the list of sets for an element d can be avoided whenever successive sets x contain d. To arrange for common elements to occur in successive sets, the domain elements are ordered by non-increasing weight, each set is ordered by the domain ordering, and the sets now sequences are ordered lexicographically using the ordered domain as the ordered alphabet (with most common element rst). The sets x are then taken in order by the outer loop. The result is that all sets containing the most common element are processed successively at the start, so that only one pass over its list of sets is needed. Moreover, at most two passes need be made over the list for the next most common element, and in general, at most 2 i?1 passes over the list for the i'th most common element. We are now ready to present the complete algorithm and its analysis. 1. Compute the weight of each element; 2. Order the domain elements by non-increasing weight; 3. Order the elements in each set to respect the domain order; 4. Sort the ordered sets lexicographically, using the ordered domain as the ordered alphabet; 5. Compute the list of sets for each domain element; 6. For each set x, record edge (x; y) for each set y such that y x. 3
Let the size of the domain D be m. We assume each element has non-zero weight, so that m = O(N). Step 1 is done by scanning each set in F, in total time (m) + (N) = (N). Step 2 is done with an asymptotically optimal sorting algorithm such as Heapsort in time O(m log m) = O(N log N). Alternatively, bin sorting can be used, taking time O(N). Step 3 is done by applying an asymptotically optimal sorting algorithm such as Heapsort to each set. The total time required is O(N log N). Step 4 is done using the repeated radix (i.e., bin) sorting method in [1], in time (m) + (N) = (N). Alternatively, the more sophisticated algorithm of Paige and Tarjan [3] can be used. It has the same worst-case complexity. Step 5 is done by scanning each set x, and appending a reference to it to the list for each element d 2 x. The total cost is (m)+(n) = (N). The resulting lists respect the lexicographic order on the sets. The crucial step 6 is implemented as follows. Set Y is used to gather all subsets of the current set x. With each set y a counter y:count is associated that equals the number of elements of x that have been processed so far that are in y. By \prev x" (resp. \next x") is denoted the set immediately preceding (resp. following) x in the lexicographic order; it should be taken as empty if x is rst (resp. last). f Step 6: g Y := ;; forall y 2 F do y:count := 0 end forall y; forall x 2 F; in lexicographic order; do forall d 2 x? previous x; in domain order; do forall y such that d 2 y do Increment y:count; if y:count = jyj then Y := Y [ fyg end if end forall y end forall d; forall y 2 Y do Record edge (x; y) end forall y; forall d 2 x? next x do forall y such that d 2 y do if y:count = jyj then Y := Y? fyg end if; Decrement y:count end forall y end forall d end forall x The key to the ecient implementation of step 6 is that each set y such that d 2 y may be found in O(1) time by following the corresponding list of sets. Let the domain D as ordered by non-increasing weights be D = fd1; : : : ; d m g. As has already been observed, the list for d i is processed at most 2 i?1 times. Therefore, 4
the total cost of step 6 over all executions is order Now m weight(d i ) minf2 i?1 ; weight(d i )g: i=1 Let I = blog 2 (N= log N)c = (log N). Then for i I, 2 i?1 < N= log N, so ii weight(d i ) minf2 i?1 ; weight(d i )g < ii weight(d i )N= log N N 2 = log N: i>i weight(d i ) minf2 i?1 ; weight(d i )g i>i weight(d i ) 2 : Since P i>i weight(d i ) < N, the sum of squares of the weights is maximized by maximizing the weights. But for i > I, since the weights are non-increasing, weight(d i ) N=i < N=I N= log 2 N: Therefore i>i weight(d i ) 2 = O(N 2 = log 2 N) O(log N) = O(N 2 = log N): It is now apparent that the worst-case complexity of our algorithm is O(N 2 = log N). 4 Further improvements The supersets of x may be detected just as easily as the subsets: y x when jyj = jxj. Step 6 can easily be modied to record these edges as well. By removing x from each of the lists it belongs to (one for each member) after x is processed, each edge is found just once, and redundant computation is avoided. Another improvement can be obtained by choosing a better ordering of the sets, to reduce the number of runs, i.e., maximal sequences of consecutive occurrences of elements. Although lexicographic ordering does the job, in that is is fast to compute and leads to an O(N 2 = log N) complexity, it is far from optimal in the worst case. An unusual ordering that we call lezigzagraphic roughly halves the maximum number of runs, and is just as easy to compute. Table 1 shows the two orderings applied to the family of all non-empty subsets of f1; : : : ; 4g. Each element of D is assigned a column so as to highlight runs. 5
lexicographic lezigzagraphic 1 1 1 2 1 4 1 2 3 1 3 4 1 2 3 4 1 3 1 2 4 1 2 3 1 3 1 2 3 4 1 3 4 1 2 4 1 4 1 2 2 2 2 3 2 4 2 3 4 2 3 4 2 4 2 3 3 3 3 4 3 4 4 4 Table 1: Two orderings of non-empty subsets of f1; : : : ; 4g. Let length(s) denote the number of characters in a string s, and s:i denote the i'th character of s, 1 i length(s). Then lezigzagraphic order is dened as follows. Let s; s 0 be unequal strings, and e be maximal such that their rst e characters agree. Then ( s e = length(s) _ s:(e + 1) > s s 0 0 :(e + 1); if e is odd; e = length(s 0 ) _ s:(e + 1) < s 0 :(e + 1); if e is even; (1) where reference to an element implies its existence. In contrast, the denition of lexicographic order does not distinguish between odd and even e: s s 0 e = length(s) _ s:(e + 1) < s 0 :(e + 1): (2) The name \lezigzagraphic" derives from the relationship to lexicographic ordering, and the zigzag graphical pattern of elements sharing the same prex (as revealed by table 1). This is not the place to expound on this interesting order the interested reader is referred to [7] for proofs of the properties we use below. But it may be worth noting here that a slightly less cumbersome denition is obtained by \padding" (to the right) with a \top" alphabetic character +1. Then the two left disjuncts involving lengths in the r.h.s. of equation (1) may be dropped. In contrast, for lexicographic ordering, dening missing characters to be?1 allows the left disjunct in the r.h.s. of (2) to be dropped. The mathematical property that we need in our application here is that for any collection of distinct sets in lezigzagraphic order, the number of runs for the 6
i'th element in the alphabet is at most d2 i?2 e. This may be proven by induction on i, exploiting the structure evident in table 1. There remains the problem of computing the lezigzagraphic order in step 4 of the algorithm. The reader should not be surprised to hear that only minor modications are needed to the standard lexicographic sort Algorithm 3.2 of [1]. They leave the order of its complexity unchanged. See [7] for the details. 5 Conclusion Signicant recent attention has been given to the problems of nding, for a given family of sets, the minimal and/or maximal sets in the partial order induced by the subset relation, and of computing the partial order itself. Heretofore, complex or at least messy techniques were used in the known subquadratic solutions, where the problem size is the sum of the cardinalities of the sets in the family. This paper has developed a comparatively simple algorithm whose worst-case complexity of O(N 2 = log N) matches the best known results. It is expected to be very competitive in practical applications. In doing so, we have exploited an interesting variant on lexicographic order (though this is inessential to the main result). It remains to be seen whether there is an algorithm with worst-case complexity o(n 2 = log N). The best known worst-case lower bound is (N 2 = log 2 N) for any algorithm that explicitly computes the partial order see [8, 6]. References [1] A. V. Aho, J. E. Hopcroft, and J. D. Ullman, The Design and Analysis of Computer Algorithms, Addison-Wesley, Reading, Mass., 1974. [2] M. Dietzfelbinger, A. Karlin, K. Mehlhorn, F. Meyer auf der Heide, H. Rohnert, and R. E. Tarjan, Dynamic perfect hashing: upper and lower bounds, SIAM J. Comput. 23 (1994), 738{761. [3] R. Paige and R. Tarjan, Three partition renement algorithms, SIAM J. Comput. 16 (1987), 973{989. [4] P. Pritchard, Opportunistic algorithms for eliminating supersets, Acta Inform. 28 (1991), 733{754. [5] P. Pritchard, An old sub-quadratic algorithm for nding extremal sets, Technical Report ECS-LFCS-94-301, Dept of Computer Science, University of Edinburgh, September 1994. 7
[6] P. Pritchard, On computing the subset graph of a collection of sets, Technical Report CIT-95-03, School of Computing and Information Technology, Grith University, March 1995. [7] P. Pritchard, Lezigzagraphic order (avec un application), forthcoming Technical Report, School of Computing and Information Technology, Grith University, 1995. [8] D. M. Yellin and C. S. Jutla, Finding extremal sets in less than quadratic time, Inform. Process. Lett. 48 (1993), 29{34. 8