Lecture 10 September 27, 2016

Similar documents
Chapter 5 Divide and Conquer

1 Approximate Quantiles and Summaries

Lecture 3 Sept. 4, 2014

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

b + O(n d ) where a 1, b > 1, then O(n d log n) if a = b d d ) if a < b d O(n log b a ) if a > b d

Lecture 6 September 13, 2016

Maximum Flow. Jie Wang. University of Massachusetts Lowell Department of Computer Science. J. Wang (UMass Lowell) Maximum Flow 1 / 27

Lecture 6 September 21, 2016

CSC236H Lecture 2. Ilir Dema. September 19, 2018

Divide and Conquer Algorithms. CSE 101: Design and Analysis of Algorithms Lecture 14

(1 + ε)-approximation for Facility Location in Data Streams

CSE 4502/5717: Big Data Analytics

Lecture 9: Dynamics in Load Balancing

Divide and Conquer. Maximum/minimum. Median finding. CS125 Lecture 4 Fall 2016

CSC 2414 Lattices in Computer Science September 27, Lecture 4. An Efficient Algorithm for Integer Programming in constant dimensions

1 The Knapsack Problem

Lecture 2: Divide and conquer and Dynamic programming

Decision Problems TSP. Instance: A complete graph G with non-negative edge costs, and an integer

25 Minimum bandwidth: Approximation via volume respecting embeddings

Algorithms for Picture Analysis. Lecture 07: Metrics. Axioms of a Metric

Lecture 1 September 3, 2013

THE INVERSE FUNCTION THEOREM

Integer Linear Programs

Faster Algorithms for some Optimization Problems on Collinear Points

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

Z Algorithmic Superpower Randomization October 15th, Lecture 12

Lecture 7: Dynamic Programming I: Optimal BSTs

CS264: Beyond Worst-Case Analysis Lecture #14: Smoothed Analysis of Pareto Curves

CS 6820 Fall 2014 Lectures, October 3-20, 2014

Nearest Neighbor Search with Keywords: Compression

This means that we can assume each list ) is

1 Review of Vertex Cover

CS675: Convex and Combinatorial Optimization Fall 2014 Combinatorial Problems as Linear Programs. Instructor: Shaddin Dughmi

Breadth First Search, Dijkstra s Algorithm for Shortest Paths

CS1800: Mathematical Induction. Professor Kevin Gold

Lecture 16 Oct 21, 2014

1 Matroid intersection

Finite Metric Spaces & Their Embeddings: Introduction and Basic Tools

Space and Nondeterminism

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

CS 395T Computational Learning Theory. Scribe: Mike Halcrow. x 4. x 2. x 6

Lecture 4: An FPTAS for Knapsack, and K-Center

Lecture 6: Greedy Algorithms I

Ramanujan Graphs of Every Size

Lecture 6: September 22

Richard DiSalvo. Dr. Elmer. Mathematical Foundations of Economics. Fall/Spring,

On Approximating Minimum 3-connected m-dominating Set Problem in Unit Disk Graph

CS 577 Introduction to Algorithms: Strassen s Algorithm and the Master Theorem

Geometric Optimization Problems over Sliding Windows

High-Dimensional Indexing by Distributed Aggregation

The Online Metric Matching Problem for Doubling Metrics

CMPUT 675: Approximation Algorithms Fall 2014

Chapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved.

Lecture 2: Streaming Algorithms

CPSC 320 Sample Final Examination December 2013

Multimedia Databases 1/29/ Indexes for Multimedia Data Indexes for Multimedia Data Indexes for Multimedia Data

6.842 Randomness and Computation Lecture 5

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Solution to Proof Questions from September 1st

Lecture 1: September 25, A quick reminder about random variables and convexity

CS264: Beyond Worst-Case Analysis Lecture #18: Smoothed Complexity and Pseudopolynomial-Time Algorithms

Lecture 17: D-Stable Polynomials and Lee-Yang Theorems

Lecture 4 Thursday Sep 11, 2014

On the Distortion of Embedding Perfect Binary Trees into Low-dimensional Euclidean Spaces

Algorithms: COMP3121/3821/9101/9801

Online Learning, Mistake Bounds, Perceptron Algorithm

CS264: Beyond Worst-Case Analysis Lecture #20: From Unknown Input Distributions to Instance Optimality

Network Design and Game Theory Spring 2008 Lecture 2

Optimisation and Operations Research

Numerical Sequences and Series

Lecture 16 Oct. 26, 2017

Speeding up the Dreyfus-Wagner Algorithm for minimum Steiner trees

Lecture 17: Trees and Merge Sort 10:00 AM, Oct 15, 2018

CS264: Beyond Worst-Case Analysis Lecture #11: LP Decoding

Course 212: Academic Year Section 1: Metric Spaces

Integer Programming ISE 418. Lecture 8. Dr. Ted Ralphs

U.C. Berkeley Better-than-Worst-Case Analysis Handout 3 Luca Trevisan May 24, 2018

Network Design and Game Theory Spring 2008 Lecture 6

Homework 4, 5, 6 Solutions. > 0, and so a n 0 = n + 1 n = ( n+1 n)( n+1+ n) 1 if n is odd 1/n if n is even diverges.

Further discussion of Turing machines

Lecture 4 February 2nd, 2017

CS2223 Algorithms D Term 2009 Exam 3 Solutions

MA008/MIIZ01 Design and Analysis of Algorithms Lecture Notes 3

Real Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras. Lecture - 13 Conditional Convergence

CSE 202 Homework 4 Matthias Springer, A

New Minimal Weight Representations for Left-to-Right Window Methods

Linear-Time Approximation Algorithms for Unit Disk Graphs

Operations Research Lecture 6: Integer Programming

Topic: Balanced Cut, Sparsest Cut, and Metric Embeddings Date: 3/21/2007

1 AC 0 and Håstad Switching Lemma

Data Structures and Algorithms

Norms of Random Matrices & Low-Rank via Sampling

In-Class Soln 1. CS 361, Lecture 4. Today s Outline. In-Class Soln 2

Math 328 Course Notes

Lecture 5: Probabilistic tools and Applications II

Lecture 11 October 7, 2013

Math 320-2: Midterm 2 Practice Solutions Northwestern University, Winter 2015

Advanced Counting Techniques. Chapter 8

CS264: Beyond Worst-Case Analysis Lecture #15: Smoothed Complexity and Pseudopolynomial-Time Algorithms

ON THE APPROXIMABILITY OF

Transcription:

CS 395T: Sublinear Algorithms Fall 2016 Prof. Eric Price Lecture 10 September 27, 2016 Scribes: Quinten McNamara & William Hoza 1 Overview In this lecture, we focus on constructing coresets, which are concise representations of a larger dataset. We will be particularly interested in coresets for datasets in R 2 ; as an application, we will see a streaming algorithm for the k-median problem in R 2 in the insertion-only model. 2 The k-median problem The k-median problem can be thought of in terms of the facility location problem: given a set of consumers that need to be provided service, open a set of k stores that minimizes the cost (total distance traveled by the consumers.) Problem definition Let be a positive integer. Given a set of points P = (p 1,..., p n ) ([ ] 2 ) n, find centers C = (c 1,..., c k ) ([ ] 2 ) k that minimize d(p, C) := i [n] min j [k] p i c j. Here, could denote the 1-norm or the 2-norm. Offline complexity of the k-median problem Let C be the minimizer, and let OPT(P, k) = d(p, C ). The bad news is that it is NP-hard to exactly compute C. This means we should aim for a (1 + ɛ)-approximation algorithm, i.e. we should try to construct C such that d(p, C) (1 + ɛ) OPT(P, k). The good news is that in 1999, Kolliopoulos and Rao [KR99] provided a (1 + ɛ)-approximation algorithm for the k-median problem in R 2 with runtime O((1/ɛ) O(1/ɛ) n log n log k). The runtime of the [KR99] algorithm depends badly on ɛ, but for large values (say ɛ = 1/2) the runtime is reasonable. There s nothing particularly sublinear about the [KR99] algorithm, so we won t go through it in class. But we will use it as a black box. 2.1 Comparison to k-means The better-known k-means problem is very similar to the k-median problem; the difference is that we minimize min p i c j 2 2. j [k] i [n] 1

To see why the problems are called k-median and k-means, observe that we can break each problem into two subproblems: 1. Choose a partition P = P (1) P (k). 2. Choose k centers C = (c 1,..., c k ). The goal is to minimize j [k] i the k-means problem. P (j) i c j in the k-median problem or j [k] i P i c j 2 2 in Let s suppose that we ve already chosen a partition. Then for each part, we need to choose a center. One can show that for the k-means problem, the best choice of c j is the mean of the points in P (j). For the k-median problem, if is the 1-norm, then the best choice of c j is the coordinate-wise median of the points in P (j). This is easiest to see in one dimension: given p 1,..., p n R, the c minimizing i (p i c) 2 is the mean of the p i s, and the c minimizing i p i c is the median of the p i s. 3 Coresets If S = (s 1,..., s m ) ([ ] 2 ) m and W = (w 1,..., w m ) N m, we will think of (S, W ) as a weighted list of points, i.e. w j is the weight of s j. Definition 1 (Coreset). A (k, ɛ)-coreset for P = (p 1,..., p n ) is a weighted list of points (S, W ) such that there exists π : [n] [m] such that w i = π 1 (i), and d(p, S) ɛ OPT(P, k). Observe that without loss of generality, π(i) gives the index of the closest s i to p i, in which case d(p, S) = p i s π(i). Observe also that for ɛ < 1, we certainly will need m > k, by the definition of OPT(P, k). 3.1 Reducing the k-median problem to coreset construction Given a (k, ɛ)-coreset (S, W ) for P, we can construct an approximate solution to the k-median problem for P via the following algorithm: Choose C which minimizes d((s, W ), C) := w i min s i c j. j [k] i [m] Claim 2. C is a (1 + 2ɛ)-approximate solution to the k-median problem for p. 2

Proof. For any C = (c 1,..., c k ), d(p, C) = = min p i c j 1 j k min ( s π(i) c j ± p i s π(i) ) 1 j k = d((s, W ), C) ± d(p, S) = d((s, W ), C) ± ɛ OPT(P, k). by the triangle inequality By applying this fact first with C = C and then with C = C, we find that d(p, C) d((s, W ), C) + ɛ OPT(P, k) d((s, W ), C ) + ɛ OPT(P, k) d(p, C ) + 2ɛ OPT(P, k) = (1 + 2ɛ) OPT(P, k). by the definition of c Of course, this reduction is not efficient yet, because finding C is just another instance of the k- median problem. But a similar analysis shows that if C merely (1 + ɛ)-approximately minimizes d((s, W ), C), it will still be a (1 + O(ɛ))-approximate solution solution to the k-median problem for P. And as discussed, such a C can be constructed reasonably efficiently using the algorithm from [KR99]. 4 Constructing coresets 4.1 The offline setting Let s suppose all the data is available to us. (1, ɛ)-coreset in one dimension As a warm-up, how can we construct a (1, ɛ)-coreset in one dimension? In other words, we have numbers p 1,..., p n R with median c, and we would like to find a small number of points s 1,..., s m and a map π : [n] [m] so that if we move each p i to s π(i), then the total distance all the points traveled is small compared to the total distance all the points would travel if we moved them all to c. Construction: S = {c } {c ± (1 + ɛ) i : i {0, 1,..., log 1+ɛ }}. Proof of correctness: Without loss of generality, consider p i between c +(1+ɛ) i and c +(1+ɛ) i+1. The distance moved by p i is at most (1 + ɛ) i+1 (1 + ɛ) i, which is ɛ (1 + ɛ) i, which is at most ɛ p i c. It follows that d(p, (S, W )) ɛ OPT(P, k). The number of points m is O( 1 ɛ log ). 3

(k, ɛ)-coreset in one dimension We can generalize the construction to a (k, ɛ)-coreset easily enough: Let C = (c 1,..., c k ) be the optimal solution to the k-median problem for P. Repeat the k = 1 construction for each c j and overlay the coresets. The total number of points m will be log ). O( k ɛ (1, ɛ)-coreset in two dimensions We can generalize the construction to a (1, ɛ)-coreset for a twodimensional dataset as follows. Let c be the solution to the 1-median problem for P. Consider concentric disks D 1, D 2,..., D log centered at c, where disk D j has radius 2 j. Cover each D j with disks of radius 2 j 1 ɛ. Our coreset consists of the centers of these covering disks. To prove correctness, observe that if j is the smallest value such that p i D j, then p i c 2 j 1, and hence the distance traveled by p i is at most 2 j 1 ɛ ɛ p i c. How many disks does it take to cover each D j? Area considerations make it clear that we will need at least Ω(1/ ) disks, and it turns out that O( ) disks suffice. (This can be shown by considering a simple lattice of disks.) Therefore, the number of points in our coreset m is O( 1 log ). Two dimensions, arbitrary k Once again, generalizing to arbitrary k is straightforward (just repeat the construction for each center and overlay the coresets.) The number of points is O( k log ). Computational efficiency Unfortunately, our constructions have all relied on knowing the optimal k-median centers. What happens if we only have a (1 + ɛ)-approximation, e.g. from [KR99]? Given C, we ve shown how to efficient construct a weighted list (S, W ) of m O( k log ) points such that d(p, S) ɛ d(p, C). So as long as C is a 2-approximation to the k-median of P, then (S, W ) will be a (k, 2ɛ) coreset for P. 4.2 The streaming setting: Merge and reduce We now present a streaming approximation algorithm for constructing a (k, ɛ)-coreset. The general approach is to construct coresets of small batches of the data as they come through the stream, and then aggregate the coresets to create a coreset for the entire data. To be more precise, we imagine a binary forest of coresets. The leaves of the forest are batches of a few input points. Each node is the coreset of its children. Whenever two trees appear with the same height, a new root node is added to merge them into one tree. The algorithm only actually has to store a single coreset at each level of this forest. This paradigm is called the merge and reduce paradigm. Here is an explicit description of the algorithm. The algorithm maintains a list of coresets (S j, W j ), all initially undefined. 1. Read the input stream P = (p 1,..., p n ) a few points a time. For each batch of a few points: (a) Condense the points into a (k, ɛ/ log n)-coreset (S, W ). (b) Initialize j = 1. While (S j, W j ) is defined: i. Set (S, W ) to be a (k, ɛ/ log n)-coreset of (S, W ) (S j, W j ). ii. Delete (S j, W j ) so it is no longer defined. 4

iii. Increment j. (c) Define (S j, W j ) := (S, W ). 2. Output a (k, ɛ/ log n)-coreset of all currently defined (S j, W j ). Efficiency For simplicity, we can take the batch size to be, say, 1 point. Then the space used by the algorithm is only O( k log log 3 k n), since each coreset requires O( log ) space. (ɛ/ log n) 2 (In practice, it would be beneficial to have a larger batch size after all, we might as well take k log log 3 n points at a time.) Correctness We need to bound the error accumulation: Lemma 3. Assume ɛ 1/2. Suppose (S 1, W 1 ) is a (k, ɛ)-coreset for P 1 and (S 2, W 2 ) is a (k, ɛ)- coresets for P 2. Suppose ( S, W ) is a (k, ɛ )-coreset for (S 1, W 1 ) (S 2, W 2 ). Then ( S, W ) is a (k, 2ɛ + ɛ)-coreset for P 1 P 2. Proof. We need to bound the distance traveled when we move from P 1 P 2 to ( S, W ). By the triangle inequality, this is at most the distance traveled when we move to (S 1, W 1 ) (S 2, W 2 ) plus the distance traveled when we move from (S 1, W 1 ) (S 2, W 2 ) to ( S, W ): d(p 1 P 2, S) ɛ OPT(P 1 P 2, k) + d(( S, W ), (S 1, W 1 ) (S 2, W 2 )) ɛ OPT(P 1 P 2, k) + ɛ OPT((S 1, W 1 ) (S 2, W 2 ), k). Recall that earlier, when reducing the k-median problem to that of constructing coresets, we showed that OPT((S 1, W 1 ) (S 2, W 2 ), k) (1 + 2ɛ) OPT(P 1 P 2, k). Therefore, d(p 1 P 2, S) (ɛ + ɛ (1 + 2ɛ)) OPT(P 1 P 2, k) (ɛ + 2ɛ ) OPT(P 1 P 2, k). Based on this lemma, one can show by induction that the coresets maintained by the algorithm are all (k, ɛ)-coresets, completing the proof of correctness. References [KR99] S. G. Kolliopoulos and S. Rao. A nearly linear-time approximation scheme for the Euclidean k-median problem. In European Symposium on Algorithms, 1643:378 389, 1999. 5