Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Similar documents
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Lecture 9

The space complexity of approximating the frequency moments

Lecture 1: Introduction to Sublinear Algorithms

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

Computing the Entropy of a Stream

14.1 Finding frequent elements in stream

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

Lecture 2. Frequency problems

Lecture 4: Hashing and Streaming Algorithms

Lower Bounds for Testing Bipartiteness in Dense Graphs

Lecture 1: 01/22/2014

PCPs and Inapproximability Gap-producing and Gap-Preserving Reductions. My T. Thai

CSE200: Computability and complexity Space Complexity

Sublinear Algorithms Lecture 3. Sofya Raskhodnikova Penn State University

Lecture 3 Frequency Moments, Heavy Hitters

Essential facts about NP-completeness:

1 Estimating Frequency Moments in Streams

Lecture 3 Sept. 4, 2014

Advanced topic: Space complexity

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

Lecture 2: Streaming Algorithms

Limits to Approximability: When Algorithms Won't Help You. Note: Contents of today s lecture won t be on the exam

U.C. Berkeley CS278: Computational Complexity Professor Luca Trevisan 9/6/2004. Notes for Lecture 3

Lecture 5. 1 Review (Pairwise Independence and Derandomization)

CSC 5170: Theory of Computational Complexity Lecture 4 The Chinese University of Hong Kong 1 February 2010

More on NP and Reductions

Testing Cluster Structure of Graphs. Artur Czumaj

Lecture 3. 1 Terminology. 2 Non-Deterministic Space Complexity. Notes on Complexity Theory: Fall 2005 Last updated: September, 2005.

(x 1 +x 2 )(x 1 x 2 )+(x 2 +x 3 )(x 2 x 3 )+(x 3 +x 1 )(x 3 x 1 ).

NP-Completeness I. Lecture Overview Introduction: Reduction and Expressiveness

A Complexity Theory for Parallel Computation

1 Approximate Quantiles and Summaries

Notes on Space-Bounded Complexity

Undirected ST-Connectivity in Log-Space. Omer Reingold. Presented by: Thang N. Dinh CISE, University of Florida, Fall, 2010

COMS E F15. Lecture 22: Linearity Testing Sparse Fourier Transform

Lecture 2 Sept. 8, 2015

Notes on Space-Bounded Complexity

Randomized Algorithms

Lecture 7: More Arithmetic and Fun With Primes

Stanford University CS254: Computational Complexity Handout 8 Luca Trevisan 4/21/2010

Lecture 01 August 31, 2017

CMPSCI611: Three Divide-and-Conquer Examples Lecture 2

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

compare to comparison and pointer based sorting, binary trees

Data Stream Methods. Graham Cormode S. Muthukrishnan

With P. Berman and S. Raskhodnikova (STOC 14+). Grigory Yaroslavtsev Warren Center for Network and Data Sciences

Big Data. Big data arises in many forms: Common themes:

Lecture 13: 04/23/2014

Notes for Lecture 14

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Lecture 22: Counting

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

Approximate Counting of Cycles in Streams

Lecture 2: January 18

PRGs for space-bounded computation: INW, Nisan

ACO Comprehensive Exam March 20 and 21, Computability, Complexity and Algorithms

6.842 Randomness and Computation Lecture 5

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms

Sublinear Algorithms for Big Data

Lower Bound Techniques for Multiparty Communication Complexity

Lecture 4. 1 Circuit Complexity. Notes on Complexity Theory: Fall 2005 Last updated: September, Jonathan Katz

Lecture 2 Sep 5, 2017

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

CS369E: Communication Complexity (for Algorithm Designers) Lecture #8: Lower Bounds in Property Testing

Randomized algorithms. Inge Li Gørtz

Ad Placement Strategies

Declaring Independence via the Sketching of Sketches. Until August Hire Me!

Tight Bounds for Distributed Functional Monitoring

Lecture 6 September 13, 2016

Testing Expansion in Bounded-Degree Graphs

Lecture Space Bounded and Random Turing Machines and RL

Lecture 18: P & NP. Revised, May 1, CLRS, pp

Lecture 15: Expanders

1 Reductions and Expressiveness

Oblivious and Adaptive Strategies for the Majority and Plurality Problems

P is the class of problems for which there are algorithms that solve the problem in time O(n k ) for some constant k.

CSC2556. Lecture 5. Matching - Stable Matching - Kidney Exchange [Slides : Ariel D. Procaccia]

P, NP, NP-Complete, and NPhard

Approximate Counting and Markov Chain Monte Carlo

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

CSC 373: Algorithm Design and Analysis Lecture 30

MTAT Complexity Theory October 13th-14th, Lecture 6

Topics in Approximation Algorithms Solution for Homework 3

Model Counting for Logical Theories

6.842 Randomness and Computation March 3, Lecture 8

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

Randomness and Computation March 13, Lecture 3

Min/Max-Poly Weighting Schemes and the NL vs UL Problem

Approximating MAX-E3LIN is NP-Hard

Space and Nondeterminism

Tight Bounds for Distributed Streaming

Lecture 5: January 30

Testing Monotone High-Dimensional Distributions

10.4 The Kruskal Katona theorem

- Well-characterized problems, min-max relations, approximate certificates. - LP problems in the standard form, primal and dual linear programs

1 Matchings in Non-Bipartite Graphs

Lecture 12: Randomness Continued

1 Adjacency matrix and eigenvalues

Lecture 13 Spectral Graph Algorithms

Transcription:

Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a metric space Randomized + exact: searching in a sorted list o Lower bound (thus optimality) using Yao s principle Randomized + inexact: o Estimating average degree in a graph o Estimating size of maximal matching in a graph o Property testing Testing linearity of a Boolean function CSC2420 Allan Borodin & Nisarg Shah 2

Today Continue sublinear time property testing Testing if an array is sorted Testing if a graph is bipartite Some comments about sublinear space algorithms Begin streaming algorithms Find the missing element(s) Finding very frequent or very rare elements Counting the number of distinct elements CSC2420 Allan Borodin & Nisarg Shah 3

Testing Monotonicity of Array Input: Array A of length n with O(1) access to A[i] Check: A i < A[i + 1] for every i {1,, n 1} Definition of at least ε-far : You need to change at least εn entries to make it monotonic Equivalently, there are at least εn entries that are not between their adjacent values. Goal: 1-sided algorithm with O log n ε queries CSC2420 Allan Borodin & Nisarg Shah 4

Testing Monotonicity of Array Proposal: Pick t random indices i, and return no even if x i > x i+1 for even one of them. No! For 1 1 1 1 0 0 0 0 (n/2 each), we ll need t = Ω(n) Proposal: Pick t random pairs (i, j) with i < j, and return no if x i > x j for even one of them. No! 1 0 2 1 3 2 4 3 5 4 6 5 (two interleaved sorted lists) ½-far (WHY?), but need t Ω(n) (by Birthday Paradox, we also must access Ω n elements) (WHY?) CSC2420 Allan Borodin & Nisarg Shah 5

Testing Monotonicity of Array Algorithm: Choose 2/ε random indices i. For each i, do a binary search for A[i]. Return yes if all binary searches succeed. Assume all elements are distinct w.l.o.g. Can replace A[i] by (A i, i) and use lexicographic comparison Important observation: searchable elements form an increasing subsequence! (WHY?) CSC2420 Allan Borodin & Nisarg Shah 6

Testing Monotonicity of Array Algorithm: Choose 2/ε random indices i. For each i, do a binary search for A[i]. Return yes if all binary searches succeed. Thus: If α n elements searchable array is at most 1 α -far from monotonic If array is at least ε-far from monotonic at least ε n elements must not be searchable o Each iteration fails to detect violation w.p. at most 1 ε o All 2/ε iterations fail to detect w.p. at most 1 ε 2 ε 1 Τ3 CSC2420 Allan Borodin & Nisarg Shah 7

Graph Property Testing It s an active area of research by itself. Let G = (V, E) with n = V and m = E Input models: Dense: Represented by adjacency matrix o Query if i, j E in O(1) time o ε-far from satisfying P if εn 2 matrix entries must be changed to satisfy P o Change required = ε-fraction of the input CSC2420 Allan Borodin & Nisarg Shah 8

Graph Property Testing It s an active area of research by itself. Let G = (V, E) with n = V and m = E Input models: Sparse: Represented by adjacency lists o Query if v, i to get the i th neighbor of v in O(1) time o We only use it for graphs with degrees bounded by d o ε-far from satisfying P if ε(dn) matrix entries must be changed to satisfy P o Change required = ε-fraction of the input Generally, dense is easier than sparse CSC2420 Allan Borodin & Nisarg Shah 9

Testing Bipartiteness Dense model: Upper bound: O( 1Τε 2 ) (independent of n) Lower bound: Ω( 1Τε 1.5 ) Sparse model (for constant d): Upper bound: O n poly Lower bound: Ω( n) log n ε CSC2420 Allan Borodin & Nisarg Shah 10

Testing Bipartiteness In the dense model: Algorithm [Goldreich, Goldwasser, Ron] Pick a random subset of vertices S, S = Θ log1 ε ε 2 Output bipartite iff the induced subgraph is bipartite Analysis: Easy: If the graph is bipartite, algorithm always accepts. Claim: If the graph is ε-far, it rejects w.p. at least 2/3 Running time: trivially constant (i.e., independent of n) CSC2420 Allan Borodin & Nisarg Shah 11

Testing Bipartiteness Q: Why doesn t this work for the sparse model? Take a line graph of n nodes. Throw εn additional edges. In the dense model, we don t care about this instance because it s not ε-far (only ε/n-far). In the sparse model, we care about it, and the previous algorithm will not work. CSC2420 Allan Borodin & Nisarg Shah 12

Testing Bipartiteness In the sparse model: Algorithm [Goldreich, Ron] Repeat O( 1Τε) times: Pick a random vertex v Run OddCycle(v), and if it finds an odd cycle, REJECT. If no trial rejected, then ACCEPT. OddCycle: Performs poly( log nτε) random walks from v, each of length poly( log nτε). If a vertex is reachable by an even-length path and an odd-length prefix, an odd cycle is detected. CSC2420 Allan Borodin & Nisarg Shah 13

Limitations of Sublinear Time The problems we saw are rather exceptions. For most problems, there is not much you can do in sublinear time. For instance, these problems require Ω(n 2 ) time: Estimating min d i,j in a metric space d. i,j o Contrast this with the sublinear algorithm we saw for estimating d i,j (diameter) max i,j Estimating the cost of the minimum-cost matching Estimating the cost of k-median for k = Ω(n) CSC2420 Allan Borodin & Nisarg Shah 14

Sublinear Space Algorithms An important topic in complexity theory Fundamental unsolved questions: Is NSPACE S = DSPACE(S) for S log n? Is P = L? (L = DSPACE(log n), and we know L P) What s the relation between P and polyl = DSPACE log n O 1? o We know P polyl, but don t know if P polyl, polyl P, or if neither is contained in the other. Savitch s theorem: DSPACE S NSPACE S DSPACE(S 2 ) CSC2420 Allan Borodin & Nisarg Shah 15

USTCON vs STCON USTCON (resp. STCON) is the problem of checking if a given source node has a path to a given target node in an undirected (resp. directed) graph. USTCON RSPACE(log n) was shown in 1979 through a random-walk based algorithm After much effort, Reingold [2008] finally showed that USTCON DSPACE log n Open questions: Is STCON in RSPACE(log n), or maybe even in RSPACE(log n)? What about o(log 2 n) instead of log n space? Is RSPACE(S) = DSPACE(S)? CSC2420 Allan Borodin & Nisarg Shah 16

Streaming Algorithms Input data comes as a stream a 1,, a m, where, say, each a i {1,, n}. The stream is typically too large to fit in the memory. We want to use only S(m, n) memory for sublinear S. o We can measure this in terms of the number of integers stored, or the number of actual bits stored (might be log n times). It is also desired that we do not take too much processing time per element of the strem. o O(1) is idea, but O log m + n might be okay! If we don t know m in advance, this can often act as an online algorithm. CSC2420 Allan Borodin & Nisarg Shah 17

Streaming Algorithms Input data comes as a stream a 1,, a m, where, say, each a i {1,, n}. Most questions are about some statistic of the stream. E.g., how many distinct elements does it have?, or count the #times the most frequent element appears Once again, we will often approximate the answer. Most algorithms process the stream in one pass, but sometimes you can achieve more if you can do two or more passes. CSC2420 Allan Borodin & Nisarg Shah 18

Missing Element Problem Problem: Given a stream a 1,, a n 1, where each element is a distinct integer from {1,, n}, find the unique missing element. An n-bit algorithm is obvious Keep a bit for each integer. At the end, spend O(n) time to search for the 0 bit. We can do O(log n) bits by maintaining the sum. Missing element = n n+1 Deterministic + exact. 2 SUM CSC2420 Allan Borodin & Nisarg Shah 19

Missing Elements Problem Problem: Given a stream a 1,, a n k, where each element is a distinct integer from {1,, n}, find all k missing elements. The previous algorithm can be generalized: Instead of just computing the sum, compute power-sums. S j 1 j k where S j = σ i=1 n k a i j At the end, we have k equations, and k unknowns. This uses O(k 2 log n) space. Computationally expensive to solve the equations o Using Newton s identities followed by finding roots of a polynomial CSC2420 Allan Borodin & Nisarg Shah 20

Missing Elements Problem We can design much more efficient algorithms if we use randomization. There is a streaming algorithm with space and time/item that is O(k log k log n). It can also be shown that Ω k log n k necessary. space is CSC2420 Allan Borodin & Nisarg Shah 21

Frequency Moments Another classic problem is that of computing frequency moments. Let A = a 1,, a m be a data stream with a i {1,, n}. Let m i denote the number of occurrences of value i. Then for k 0, the k th frequency moment is defined as F k = i n F 0 = # distinct elements F 1 = m F 2 = Gini s homogeneity index o The greater the value of F 2, the greater the homogeneity in A m i k CSC2420 Allan Borodin & Nisarg Shah 22

Frequency Moments Goal: Given ε, δ, find F k s.t. Pr F k F k > εf k δ Seminal paper by Alon, Matias, Szegedy [AMS 99] k = 0: For every c > 2, O(log n) space algorithm s.t. Pr 1 c F 0 F 0 cf 0 1 2Τc k = 2: O log n + log m log 1 Τδ Τε = O(1) space k 3: O m 1 1Τk poly 1 Τε polylog m, n, 1 Τδ space k > 5: Lower bound of Ω m 1 5Τk CSC2420 Allan Borodin & Nisarg Shah 23

Frequency Moments Goal: Given ε, δ, find F k s.t. Pr F k F k > εf k δ Seminal paper by Alon, Matias, Szegedy [AMS 99] k = 0: For every c > 2, O(log n) space algorithm s.t. Pr 1 c F 0 F 0 cf 0 1 2/c Exactly counting F 0 requires Ω(n) space: o Once the stream is processed, the algorithm acts as a membership tester. On new element x, the count increases by 1 iff x was not part of the stream. o Algorithm must have enough memory to distinguish between all possible 2 n states CSC2420 Allan Borodin & Nisarg Shah 24

Frequency Moments Goal: Given ε, δ, find F k s.t. Pr F k F k > εf k δ Seminal paper by Alon, Matias, Szegedy [AMS 99] k = 0: For every c > 2, O(log n) space algorithm s.t. Pr 1 c F 0 F 0 cf 0 1 2/c State-of-the-art is HyperLogLog Algorithm o Uses hash functions o Widely used, theoretically near-optimal, practically quite fast o Uses O ε 2 log log n + log n space o It can estimate > 10 9 distinct elements with 98% accuracy using only 1.5kB memory! CSC2420 Allan Borodin & Nisarg Shah 25

Frequency Moments Goal: Given ε, δ, find F k s.t. Pr F k F k > εf k δ Seminal paper by Alon, Matias, Szegedy [AMS 99] k > 2: The Ω m 1 5Τk bound was improved to Ω m 1 2Τk by Bar Yossef et al. o Their bound also works for real-valued k. Indyk and Woodruff [2005] gave an algorithm that works for real-valued k > 2 with a matching upper bound of O m 1 2Τk. CSC2420 Allan Borodin & Nisarg Shah 26

AMS F k Algorithm The basic idea is to define a random variable Y whose expected value is close to F k and variance is sufficiently small such that it can be calculated under the space constraint. We will present the AMS algorithm for computing F k, and sketch the proof for k 3 as well as the improved proof for k = 2. CSC2420 Allan Borodin & Nisarg Shah 27

AMS F k Algorithm Algorithm: Let s 1 = 8ε 2 k m 1 1 k and s2 = 2 log Τ Let Y = median Y 1,, Y s2, where Y i = mean X i,1,, X i,s1, where Space: 1 δ. o X i,j are i.i.d. random variables that are calculated as follows: o For each X i,j, choose a random p [1,, m] in advance. o When a p arrives, note down this value. o In the remaining stream, maintain r = {q q p and a q = a p }. o X i,j = m r k r 1 k. For s 1 s 2 variables X, log n space to store a p, log m space to store r. Note: This assumes we know m. But it can be estimated as the stream unfolds. CSC2420 Allan Borodin & Nisarg Shah 28

AMS F k Algorithm We want to show: E X = F k, and Var[X] is small. E X = E m r k r 1 k The m different choices of p [m] have probability 1/m. Thus, E[X] is just the sum of r k r 1 k across all choices of p. For each distinct value i, there will be m i terms: m i k m i 1 k + m i 1 k m i 2 k + + 1 k 0 k = m i k Thus, the overall sum is F k = σ i m i k. Thus, E Y = E X = F k CSC2420 Allan Borodin & Nisarg Shah 29

AMS F k Algorithm To show: Pr Y i F k > εf k Τ 1 8 Median over 2 log Τ 1 δ many Y i will do the rest. Chebyshev s inequality: Pr Y i E Y i > εe Y i Var X E X2 Var Y ε 2 E Y 2 Var Y i, and E Y = E X = F s 1 s k. 1 Thus, probability bound is: E X 2 s 1 ε 2 F 2 = E X 2 k 8ε 2 km 1 1 kε 2 F 2 k To show that this is at most 1/8, we want to show: E X 2 km 1 1 k F 2 k Show that: E X 2 kf 1 F 2k 1, and F 1 F 2k 1 m 1 1 k F k 2 Just more algebra! CSC2420 Allan Borodin & Nisarg Shah 30

Sketch of F 2 improvement They retain s 2 = 2 log 1/δ, but decrease s 1 to just a constant 16/ε 2. The idea is that X will not maintain a count for each value separately, but rather an aggregate. Z = σn t=1 b t m t, then X = Z 2 The vector b 1,, b n 1,1 n is chosen at random as follows: o Let V = {v 1,, v h } be O(n 2 ) four-wise independent vectors o Each v p = v p,1,, v p,n 1,1 n o Choose p {1,, h} at random, and set b 1,, b n = v p. CSC2420 Allan Borodin & Nisarg Shah 31

Majority Element Input: Stream A = a 1,, a m, where a i [n] Q: Is there a value i that appears more than m/2 times? Algorithm: Store candidate a, and a counter c (initially c = 0). For i = 1 m o If c = 0: Set a = a i, and c = 1. o Else: If a = a i, c c + 1 If a a i, c c 1 CSC2420 Allan Borodin & Nisarg Shah 32

Majority Element Space: Clearly O(log m + log n) bits Claim: If there exists a value v that appears more than m/2 times, then a = v at the end. Proof: Take an occurrence of v (say a i ), and let s pair it up: o If it decreases the counter, pair up with the unique element a j (j < i) that contributed the 1 we just decreased. o If it increases the counter: If the added 1 is never taken back, QED! If it is decreased by a j (j > i), pair up with that. Because at least occurrence of v is not paired, the never taken back case happens at least once. CSC2420 Allan Borodin & Nisarg Shah 33

Majority Element Space: Clearly O(log m + log n) bits Claim: If there exists a value v that appears more than m/2 times, then a = v at the end. A simpler proof: At any step, let c = c if a = v, and c = c otherwise. Every occurrence of v must increase c by 1. Every occurrence of a value other than v either increases or decreases c by 1. Majority more increments than decrements in c. Thus, a positive value at the end! CSC2420 Allan Borodin & Nisarg Shah 34

Majority Element Note 1: When a majority element does not exist, the algorithm doesn t necessarily find the mode. Note 2: If a majority element exists, it correctly finds that element. However, if there is no majority element, the algorithm does not detect that and still returns a value. It can be trivially checked if the returned value is indeed a majority element if a second pass over the stream is allowed. Surprisingly, we can prove that this cannot be done in 1- pass. (Next lecture!) CSC2420 Allan Borodin & Nisarg Shah 35