the subset partial order Paul Pritchard Technical Report CIT School of Computing and Information Technology

Similar documents
A fast algorithm to generate necklaces with xed content

Turing Machines, diagonalization, the halting problem, reducibility

Efficient Cryptanalysis of Homophonic Substitution Ciphers

Hashing. Martin Babka. January 12, 2011

1 Introduction A priority queue is a data structure that maintains a set of elements and supports operations insert, decrease-key, and extract-min. Pr

On Controllability and Normality of Discrete Event. Dynamical Systems. Ratnesh Kumar Vijay Garg Steven I. Marcus

2. ALGORITHM ANALYSIS

Algorithms for pattern involvement in permutations

A shrinking lemma for random forbidding context languages

immediately, without knowledge of the jobs that arrive later The jobs cannot be preempted, ie, once a job is scheduled (assigned to a machine), it can

Rank and Select Operations on Binary Strings (1974; Elias)

Longest Common Prefixes

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events

I would like to thank BRICS and, in particular, Erik Meineche Schmidt for giving me the opportunity to teach this mini-course at Aarhus University. Le

arxiv: v1 [cs.ds] 15 Feb 2012

Asymptotic redundancy and prolixity

3.1 Asymptotic notation

Quiz 1 Solutions. (a) f 1 (n) = 8 n, f 2 (n) = , f 3 (n) = ( 3) lg n. f 2 (n), f 1 (n), f 3 (n) Solution: (b)

1. Introduction Bottom-Up-Heapsort is a variant of the classical Heapsort algorithm due to Williams ([Wi64]) and Floyd ([F64]) and was rst presented i

A version of for which ZFC can not predict a single bit Robert M. Solovay May 16, Introduction In [2], Chaitin introd

Weighted Activity Selection

CMSC 451: Lecture 7 Greedy Algorithms for Scheduling Tuesday, Sep 19, 2017

STGs may contain redundant states, i.e. states whose. State minimization is the transformation of a given

Space-Efficient Re-Pair Compression

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

Finding Succinct. Ordered Minimal Perfect. Hash Functions. Steven S. Seiden 3 Daniel S. Hirschberg 3. September 22, Abstract

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events

An Optimal Lower Bound for Nonregular Languages

Fast Sorting and Selection. A Lower Bound for Worst Case

Sorting n Objects with a k-sorter. Richard Beigel. Dept. of Computer Science. P.O. Box 2158, Yale Station. Yale University. New Haven, CT

CSC173 Workshop: 13 Sept. Notes

University of Waterloo. W. F. Smyth. McMaster University. Curtin University of Technology

Chapter 1. Comparison-Sorting and Selecting in. Totally Monotone Matrices. totally monotone matrices can be found in [4], [5], [9],

Upper and Lower Bounds on the Number of Faults. a System Can Withstand Without Repairs. Cambridge, MA 02139

Online Interval Coloring and Variants

Problem. Problem Given a dictionary and a word. Which page (if any) contains the given word? 3 / 26

in (n log 2 (n + 1)) time by performing about 3n log 2 n key comparisons and 4 3 n log 4 2 n element moves (cf. [24]) in the worst case. Soon after th

THIS paper is aimed at designing efficient decoding algorithms

1. Problem I calculated these out by hand. It was tedious, but kind of fun, in a a computer should really be doing this sort of way.

On the Exponent of the All Pairs Shortest Path Problem

On Boyer-Moore Preprocessing

Module 1: Analyzing the Efficiency of Algorithms

Alon Orlitsky. AT&T Bell Laboratories. March 22, Abstract

Bloom Filters, general theory and variants

Efficient Enumeration of Regular Languages

2. (25%) Suppose you have an array of 1234 records in which only a few are out of order and they are not very far from their correct positions. Which

Restricted b-matchings in degree-bounded graphs

Abstract. We show that a proper coloring of the diagram of an interval order I may require 1 +

A robust APTAS for the classical bin packing problem

arxiv: v1 [cs.ds] 25 Nov 2009

Ecient and improved implementations of this method were, for instance, given in [], [7], [9] and [0]. These implementations contain more and more ecie

Computation Theory Finite Automata

Kybernetika. Daniel Reidenbach; Markus L. Schmid Automata with modulo counters and nondeterministic counter bounds

Alphabet Friendly FM Index

Sequential programs. Uri Abraham. March 9, 2014

Balanced Allocation Through Random Walk

On the optimality of Allen and Kennedy's algorithm for parallelism extraction in nested loops Alain Darte and Frederic Vivien Laboratoire LIP, URA CNR

The idea is that if we restrict our attention to any k positions in x, no matter how many times we

An implementation of deterministic tree automata minimization

Theory of Computation

Chazelle (attribution in [13]) obtained the same result by generalizing Pratt's method: instead of using and 3 to construct the increment sequence use

Searching. Constant time access. Hash function. Use an array? Better hash function? Hash function 4/18/2013. Chapter 9

Efficient Sequential Algorithms, Comp309

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

Math 324 Summer 2012 Elementary Number Theory Notes on Mathematical Induction

1 Ordinary points and singular points

protocols such as protocols in quantum cryptography and secret-key agreement by public discussion [8]. Before we formalize the main problem considered

Three-dimensional Stable Matching Problems. Cheng Ng and Daniel S. Hirschberg. Department of Information and Computer Science

CISC 876: Kolmogorov Complexity

Lecture 1: Asymptotics, Recurrences, Elementary Sorting

Portland, ME 04103, USA IL 60637, USA. Abstract. Buhrman and Torenvliet created an oracle relative to which

Strategies for Stable Merge Sorting

Almost Optimal Sublinear Time Parallel. Lawrence L. Larmore. Wojciech Rytter y. Abstract

Chapter 2 Source Models and Entropy. Any information-generating process can be viewed as. computer program in executed form: binary 0

COMP 355 Advanced Algorithms

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

2.1 Computational Tractability. Chapter 2. Basics of Algorithm Analysis. Computational Tractability. Polynomial-Time

Counting and Constructing Minimal Spanning Trees. Perrin Wright. Department of Mathematics. Florida State University. Tallahassee, FL

Cardinality Networks: a Theoretical and Empirical Study

Johns Hopkins Math Tournament Proof Round: Automata

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

A Linear Time Algorithm for Ordered Partition

Primary classes of compositions of numbers

Kirsten Lackner Solberg. Dept. of Math. and Computer Science. Odense University, Denmark

A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS *

Ahlswede Khachatrian Theorems: Weighted, Infinite, and Hamming

Spanning and Independence Properties of Finite Frames

Generating Permutations and Combinations

c 1998 Society for Industrial and Applied Mathematics Vol. 27, No. 4, pp , August

Circuit depth relative to a random oracle. Peter Bro Miltersen. Aarhus University, Computer Science Department

Notes on Logarithmic Lower Bounds in the Cell Probe Model

COMP 355 Advanced Algorithms Algorithm Design Review: Mathematical Background

Introduction to Hash Tables

An Approximation Algorithm for Constructing Error Detecting Prefix Codes

arxiv: v1 [math.co] 25 Apr 2018

CPSC 518 Introduction to Computer Algebra Schönhage and Strassen s Algorithm for Integer Multiplication

A General Lower Bound on the I/O-Complexity of Comparison-based Algorithms

ON SENSITIVITY OF k-uniform HYPERGRAPH PROPERTIES

9 Union Find. 9 Union Find. List Implementation. 9 Union Find. Union Find Data Structure P: Maintains a partition of disjoint sets over elements.

Transcription:

A simple sub-quadratic algorithm for computing the subset partial order Paul Pritchard P.Pritchard@cit.gu.edu.au Technical Report CIT-95-04 School of Computing and Information Technology Grith University Queensland, Australia 4111 March 1995 Abstract A given collection of sets has a natural partial order induced by the subset relation. Let the size N of the collection be dened as the sum of the cardinalities of the sets that comprise it. Algorithms have recently been presented that compute the partial order (and thereby the minimal and maximal sets, i.e., extremal sets) in worst-case time O(N 2 = log N ). This paper develops a simple algorithm that uses only simple data structures, and gives a simple analysis that establishes the above worst-case bound on its running time. The algorithm exploits a variation on lexicographic order that may be of independent interest. 1 Introduction Given is a collection F = fs1; : : : ; S k g, where each S i is a set over the same domain D. Dene the size of the collection to be N = P i js i j. Pritchard [4] presented algorithms for nding those sets in F that have no subset in F. Starting from a naive O(N 2 ) algorithm 1, an algorithm was obtained that had worst-case complexity O(N log N) when there was a constant that bounded the cardinality of each set in F. It was later shown [5] that there is an implementation with worst-case running time O(N 2 = log N). Yellin and Jutla [8] generalized the problem to that of computing the minimal (as in [4]) or maximal sets in the partial order on the distinct sets in F that is induced by the subset relation. They presented an algorithm that built the 1 Our model is a uniform cost RAM with (log N )-bit words. 1

partial order in O(N 2 = log N) operations over a dictionary ADT, and observed that dynamic perfect hashing [2] gave a randomized amortized expected running time of O(1) per ADT operation. Pritchard [6] presented a variant algorithm permitting an implementation taking worst-case time O(N 2 = log N). These known sub-quadratic algorithms are variously complex ([5]), rely on complex data structures ([8]), or require substantial housekeeping ([6]). We proceed to develop an alternative approach that has a simple structure, that uses simple and ecient algorithmic techniques, and that has a straightforward analysis that establishes a worst-case running time of O(N 2 = log N). 2 Initial development and analysis For simplicity of presentation, we shall assume that F is a set rather than a multiset. Equal sets can be gathered in an inexpensive preliminary computation, and thereafter treated as one set in a collection of distinct sets. Following [8], we shall compute the complete subset graph, where the vertices are the sets of F, and there is an edge from S to S 0 i S 0 S. We shall use a transformational style of development, starting with a high-level algorithm. V := F; E := ;; forall x 2 F do forall y 2 F such that y x do E := E [ (x; y) end forall y end forall x f (V; E) is the complete subset graph for F g Consider the naive implementation of the inner loop: forall y 2 F do if y x then E := E [ (x; y) end if end forall y With each set implemented as an ordered linked list or array of its elements, the cost of a straightforward implementation of a subset test y x is O(jxj + jyj). Hence, recalling that there are k sets in F, the cost of this inner loop is order y (jxj + jyj) = y jxj + y jyj = kjxj + N: The overall cost is therefore order x (kjxj + N) = k x jxj + x 2 N = kn + kn = 2kN:

Since k = O(N), the worst-case cost is O(N 2 ), and it is easy to see this bound is tight. An obvious optimization is to only consider y such that jyj < jxj, but the worst case is still quadratic. In search of another optimization, x x, and consider y such that jyj < jxj. Instead of each element of x contributing (1) to the cost of the subset test y x in the worst case, we can arrange for just the elements of y \ x to do so by precomputing, for each element d in the domain D, a list of the sets in F that contain d. Then to nd all subsets of x, the list for each member d of x is traversed, an initially zero count is incremented for each set y in the list, and a set is recognized as a subset of x when P its count reaches its cardinality. The cost of the inner loop is O( d2x weight(d)), where weight(d) denotes the number of sets in F that contain d. Therefore the total cost (after the precomputation) is O( weight(d)) = O( weight(d) 2 ): x d2x d It can be seen that in the worst case, when there is an element with weight (N), the complexity is still quadratic. 3 Handling common elements Repeated scanning of the list of sets for an element d can be avoided whenever successive sets x contain d. To arrange for common elements to occur in successive sets, the domain elements are ordered by non-increasing weight, each set is ordered by the domain ordering, and the sets now sequences are ordered lexicographically using the ordered domain as the ordered alphabet (with most common element rst). The sets x are then taken in order by the outer loop. The result is that all sets containing the most common element are processed successively at the start, so that only one pass over its list of sets is needed. Moreover, at most two passes need be made over the list for the next most common element, and in general, at most 2 i?1 passes over the list for the i'th most common element. We are now ready to present the complete algorithm and its analysis. 1. Compute the weight of each element; 2. Order the domain elements by non-increasing weight; 3. Order the elements in each set to respect the domain order; 4. Sort the ordered sets lexicographically, using the ordered domain as the ordered alphabet; 5. Compute the list of sets for each domain element; 6. For each set x, record edge (x; y) for each set y such that y x. 3

Let the size of the domain D be m. We assume each element has non-zero weight, so that m = O(N). Step 1 is done by scanning each set in F, in total time (m) + (N) = (N). Step 2 is done with an asymptotically optimal sorting algorithm such as Heapsort in time O(m log m) = O(N log N). Alternatively, bin sorting can be used, taking time O(N). Step 3 is done by applying an asymptotically optimal sorting algorithm such as Heapsort to each set. The total time required is O(N log N). Step 4 is done using the repeated radix (i.e., bin) sorting method in [1], in time (m) + (N) = (N). Alternatively, the more sophisticated algorithm of Paige and Tarjan [3] can be used. It has the same worst-case complexity. Step 5 is done by scanning each set x, and appending a reference to it to the list for each element d 2 x. The total cost is (m)+(n) = (N). The resulting lists respect the lexicographic order on the sets. The crucial step 6 is implemented as follows. Set Y is used to gather all subsets of the current set x. With each set y a counter y:count is associated that equals the number of elements of x that have been processed so far that are in y. By \prev x" (resp. \next x") is denoted the set immediately preceding (resp. following) x in the lexicographic order; it should be taken as empty if x is rst (resp. last). f Step 6: g Y := ;; forall y 2 F do y:count := 0 end forall y; forall x 2 F; in lexicographic order; do forall d 2 x? previous x; in domain order; do forall y such that d 2 y do Increment y:count; if y:count = jyj then Y := Y [ fyg end if end forall y end forall d; forall y 2 Y do Record edge (x; y) end forall y; forall d 2 x? next x do forall y such that d 2 y do if y:count = jyj then Y := Y? fyg end if; Decrement y:count end forall y end forall d end forall x The key to the ecient implementation of step 6 is that each set y such that d 2 y may be found in O(1) time by following the corresponding list of sets. Let the domain D as ordered by non-increasing weights be D = fd1; : : : ; d m g. As has already been observed, the list for d i is processed at most 2 i?1 times. Therefore, 4

the total cost of step 6 over all executions is order Now m weight(d i ) minf2 i?1 ; weight(d i )g: i=1 Let I = blog 2 (N= log N)c = (log N). Then for i I, 2 i?1 < N= log N, so ii weight(d i ) minf2 i?1 ; weight(d i )g < ii weight(d i )N= log N N 2 = log N: i>i weight(d i ) minf2 i?1 ; weight(d i )g i>i weight(d i ) 2 : Since P i>i weight(d i ) < N, the sum of squares of the weights is maximized by maximizing the weights. But for i > I, since the weights are non-increasing, weight(d i ) N=i < N=I N= log 2 N: Therefore i>i weight(d i ) 2 = O(N 2 = log 2 N) O(log N) = O(N 2 = log N): It is now apparent that the worst-case complexity of our algorithm is O(N 2 = log N). 4 Further improvements The supersets of x may be detected just as easily as the subsets: y x when jyj = jxj. Step 6 can easily be modied to record these edges as well. By removing x from each of the lists it belongs to (one for each member) after x is processed, each edge is found just once, and redundant computation is avoided. Another improvement can be obtained by choosing a better ordering of the sets, to reduce the number of runs, i.e., maximal sequences of consecutive occurrences of elements. Although lexicographic ordering does the job, in that is is fast to compute and leads to an O(N 2 = log N) complexity, it is far from optimal in the worst case. An unusual ordering that we call lezigzagraphic roughly halves the maximum number of runs, and is just as easy to compute. Table 1 shows the two orderings applied to the family of all non-empty subsets of f1; : : : ; 4g. Each element of D is assigned a column so as to highlight runs. 5

lexicographic lezigzagraphic 1 1 1 2 1 4 1 2 3 1 3 4 1 2 3 4 1 3 1 2 4 1 2 3 1 3 1 2 3 4 1 3 4 1 2 4 1 4 1 2 2 2 2 3 2 4 2 3 4 2 3 4 2 4 2 3 3 3 3 4 3 4 4 4 Table 1: Two orderings of non-empty subsets of f1; : : : ; 4g. Let length(s) denote the number of characters in a string s, and s:i denote the i'th character of s, 1 i length(s). Then lezigzagraphic order is dened as follows. Let s; s 0 be unequal strings, and e be maximal such that their rst e characters agree. Then ( s e = length(s) _ s:(e + 1) > s s 0 0 :(e + 1); if e is odd; e = length(s 0 ) _ s:(e + 1) < s 0 :(e + 1); if e is even; (1) where reference to an element implies its existence. In contrast, the denition of lexicographic order does not distinguish between odd and even e: s s 0 e = length(s) _ s:(e + 1) < s 0 :(e + 1): (2) The name \lezigzagraphic" derives from the relationship to lexicographic ordering, and the zigzag graphical pattern of elements sharing the same prex (as revealed by table 1). This is not the place to expound on this interesting order the interested reader is referred to [7] for proofs of the properties we use below. But it may be worth noting here that a slightly less cumbersome denition is obtained by \padding" (to the right) with a \top" alphabetic character +1. Then the two left disjuncts involving lengths in the r.h.s. of equation (1) may be dropped. In contrast, for lexicographic ordering, dening missing characters to be?1 allows the left disjunct in the r.h.s. of (2) to be dropped. The mathematical property that we need in our application here is that for any collection of distinct sets in lezigzagraphic order, the number of runs for the 6

i'th element in the alphabet is at most d2 i?2 e. This may be proven by induction on i, exploiting the structure evident in table 1. There remains the problem of computing the lezigzagraphic order in step 4 of the algorithm. The reader should not be surprised to hear that only minor modications are needed to the standard lexicographic sort Algorithm 3.2 of [1]. They leave the order of its complexity unchanged. See [7] for the details. 5 Conclusion Signicant recent attention has been given to the problems of nding, for a given family of sets, the minimal and/or maximal sets in the partial order induced by the subset relation, and of computing the partial order itself. Heretofore, complex or at least messy techniques were used in the known subquadratic solutions, where the problem size is the sum of the cardinalities of the sets in the family. This paper has developed a comparatively simple algorithm whose worst-case complexity of O(N 2 = log N) matches the best known results. It is expected to be very competitive in practical applications. In doing so, we have exploited an interesting variant on lexicographic order (though this is inessential to the main result). It remains to be seen whether there is an algorithm with worst-case complexity o(n 2 = log N). The best known worst-case lower bound is (N 2 = log 2 N) for any algorithm that explicitly computes the partial order see [8, 6]. References [1] A. V. Aho, J. E. Hopcroft, and J. D. Ullman, The Design and Analysis of Computer Algorithms, Addison-Wesley, Reading, Mass., 1974. [2] M. Dietzfelbinger, A. Karlin, K. Mehlhorn, F. Meyer auf der Heide, H. Rohnert, and R. E. Tarjan, Dynamic perfect hashing: upper and lower bounds, SIAM J. Comput. 23 (1994), 738{761. [3] R. Paige and R. Tarjan, Three partition renement algorithms, SIAM J. Comput. 16 (1987), 973{989. [4] P. Pritchard, Opportunistic algorithms for eliminating supersets, Acta Inform. 28 (1991), 733{754. [5] P. Pritchard, An old sub-quadratic algorithm for nding extremal sets, Technical Report ECS-LFCS-94-301, Dept of Computer Science, University of Edinburgh, September 1994. 7

[6] P. Pritchard, On computing the subset graph of a collection of sets, Technical Report CIT-95-03, School of Computing and Information Technology, Grith University, March 1995. [7] P. Pritchard, Lezigzagraphic order (avec un application), forthcoming Technical Report, School of Computing and Information Technology, Grith University, 1995. [8] D. M. Yellin and C. S. Jutla, Finding extremal sets in less than quadratic time, Inform. Process. Lett. 48 (1993), 29{34. 8