Notes on MapReduce Algorithms

Similar documents
Lecture 4. 1 Estimating the number of connected components and minimum spanning tree

Fast Algorithms for Pseudoarboricity

An Introduction to Matroids & Greedy in Approximation Algorithms

Sublinear Algorithms for Big Data

CMPSCI611: The Matroid Theorem Lecture 5

Efficient Primal- Dual Graph Algorithms for Map Reduce

Sublinear-Time Algorithms

ABHELSINKI UNIVERSITY OF TECHNOLOGY

Cographs; chordal graphs and tree decompositions

A Randomized Rounding Approach to the Traveling Salesman Problem

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 2 Luca Trevisan August 29, 2017

Randomized Algorithms III Min Cut

k-regular subgraphs near the k-core threshold of a random graph

1 Some loose ends from last time

On shredders and vertex connectivity augmentation

CS 6820 Fall 2014 Lectures, October 3-20, 2014

Efficient Approximation for Restricted Biclique Cover Problems

Distributed Distance-Bounded Network Design Through Distributed Convex Programming

Greedy Algorithms My T. UF

Approximation Algorithms and Hardness of Approximation. IPM, Jan Mohammad R. Salavatipour Department of Computing Science University of Alberta

Approximation Algorithms for Channel Allocation Problems in. Broadcast Networks

UNIVERSITY OF YORK. MSc Examinations 2004 MATHEMATICS Networks. Time Allowed: 3 hours.

Generating p-extremal graphs

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

Discrete Optimization 2010 Lecture 2 Matroids & Shortest Paths

arxiv: v6 [cs.dc] 16 Sep 2015

4 Packing T-joins and T-cuts

Ring Sums, Bridges and Fundamental Sets

MATCHINGS IN k-partite k-uniform HYPERGRAPHS

Project in Computational Game Theory: Communities in Social Networks

Network Design and Game Theory Spring 2008 Lecture 2

JIGSAW PERCOLATION ON ERDÖS-RÉNYI RANDOM GRAPHS

Lecture notes on the ellipsoid algorithm

Discrete Wiskunde II. Lecture 5: Shortest Paths & Spanning Trees

10.3 Matroids and approximation

FINAL EXAM PRACTICE PROBLEMS CMSC 451 (Spring 2016)

Tight Bounds on Vertex Connectivity Under Vertex Sampling

Graph Sparsifiers. Smaller graph that (approximately) preserves the values of some set of graph parameters. Graph Sparsification

Disconnecting Networks via Node Deletions

DECOMPOSITIONS OF MULTIGRAPHS INTO PARTS WITH THE SAME SIZE

Graph coloring, perfect graphs

Shortest paths with negative lengths

Chapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved.

ACO Comprehensive Exam October 14 and 15, 2013

Approximation Algorithms for Asymmetric TSP by Decomposing Directed Regular Multigraphs

Bin packing with colocations

The solution space geometry of random linear equations

Densest/Heaviest k-subgraph on Interval Graphs, Chordal Graphs and Planar Graphs

HAMILTON CYCLES IN RANDOM REGULAR DIGRAPHS

Assignment 4: Solutions

CSE 591 Homework 3 Sample Solutions. Problem 1

Dominating a family of graphs with small connected subgraphs

The Steiner Network Problem

Approximating matching size from random streams

Probabilistic Graphical Models

Minimum spanning tree

How many randomly colored edges make a randomly colored dense graph rainbow hamiltonian or rainbow connected?

APPROXIMATION ALGORITHMS RESOLUTION OF SELECTED PROBLEMS 1

CS60007 Algorithm Design and Analysis 2018 Assignment 1

CS675: Convex and Combinatorial Optimization Fall 2014 Combinatorial Problems as Linear Programs. Instructor: Shaddin Dughmi

Balanced Allocation Through Random Walk

Santa Claus Schedules Jobs on Unrelated Machines

CO759: Algorithmic Game Theory Spring 2015

12. LOCAL SEARCH. gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria

Induced Ramsey-type theorems

CMPUT 675: Approximation Algorithms Fall 2014

Induced subgraphs with many repeated degrees

Chapter 11. Min Cut Min Cut Problem Definition Some Definitions. By Sariel Har-Peled, December 10, Version: 1.

Undirected Graphical Models

THE SZEMERÉDI REGULARITY LEMMA AND ITS APPLICATION

Graph-Constrained Group Testing

IS 709/809: Computational Methods in IS Research Fall Exam Review

Proof of the (n/2 n/2 n/2) Conjecture for large n

CS 161: Design and Analysis of Algorithms

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash

Graph-theoretic Problems

18.10 Addendum: Arbitrary number of pigeons

CS675: Convex and Combinatorial Optimization Fall 2016 Combinatorial Problems as Linear and Convex Programs. Instructor: Shaddin Dughmi

Minmax Tree Cover in the Euclidean Space

3.4 Relaxations and bounds

Lecture 4: An FPTAS for Knapsack, and K-Center

Packing and Covering Dense Graphs

Cuts & Metrics: Multiway Cut

Graph Theorizing Peg Solitaire. D. Paul Hoilman East Tennessee State University

R(p, k) = the near regular complete k-partite graph of order p. Let p = sk+r, where s and r are positive integers such that 0 r < k.

Learning Large-Alphabet and Analog Circuits with Value Injection Queries

Minimal Spanning Tree From a Minimum Dominating Set

Approximation Basics

1 Primals and Duals: Zero Sum Games

CSCI8980 Algorithmic Techniques for Big Data September 12, Lecture 2

A Polynomial-Time Algorithm for Pliable Index Coding

7. Lecture notes on the ellipsoid algorithm

1. Introduction Given k 2, a k-uniform hypergraph (in short, k-graph) consists of a vertex set V and an edge set E ( V

Lecture 5: The Principle of Deferred Decisions. Chernoff Bounds

Lecture 1 : Probabilistic Method

PRAMs. M 1 M 2 M p. globaler Speicher

Rao s degree sequence conjecture

A Note on Interference in Random Networks

Homework 6: Solutions Sid Banerjee Problem 1: (The Flajolet-Martin Counter) ORIE 4520: Stochastics at Scale Fall 2015

Recall: Matchings. Examples. K n,m, K n, Petersen graph, Q k ; graphs without perfect matching

Transcription:

Notes on MapReduce Algorithms Barna Saha 1 Finding Minimum Spanning Tree of a Dense Graph in MapReduce We are given a graph G = (V, E) on V = N vertices and E = m N 1+c edges for some constant c > 0. Our goal is to compute the minimum spanning tree of the graph. Algorithm 1 MST computation in MapReduce Set k = N c 2. Partition V into k parts of equal size: V 1, V 2,..., V k with V i V j = φ for i j and V i = N k for all i. Let E i,j E be the set of edges induced by the vertex set V i V j, that is E i,j = {(u, v) E u, v V i V j }. Distribute G i,j = {V i V j, E i,j } to each server and compute its minimum spanning tree M i,j. Distribute H = M i,j to a single server and compute the final MST M of H return M Theorem 1. The tree M computed by Algorithm 1 is the minimum spanning tree of G. Proof. The algorithm works by sparsifying the input graph. We now show that H contains all the relevant edges that may appear in the MST of G. Consider an edge e = (u, v) that was discarded, that is e E(G) but e / E(H). Observe that any edge e = (u, v) is present in at least one subgraph G i,j. If e / M i,j then by the cycle property of minimum spanning trees, there must be some cycle C E i,j such that e is the heaviest edge on the cycle. However, since E i,j E, we have now exhibited a cycle in G such that e is also the heaviest edge in that cycle in G. Therefore, e cannot be in the MST of G, and can be safely discarded. We now look at the memory usage at each machine. Lemma 1. Let k = N c 2, then with high probability the size of every E i,j is Õ(N 1+ c 2 ). Proof. We can bound the total number of edges in E i,j by bounding the total degree of the vertices in V i and V j. Thus, E i,j v V i deg(v) + v V j deg(v). For the purpose of this proof, let us partition the vertices into groups by their degree: let W 1 be the set of vertices of degree at most 2, W 2 be the set of vertices with degree 3 or 4, and in general, W i be the set of vertices with degree in between (2 i 1, 2 i ]. There are log N total groups. Consider the number of vertices from group W i that are mapped to V j. If the group has a small number of vertices, that is W i < 24N c 2 log N, then deg(v) N 24N c 2 log N = Õ(N 1+ c 2 ) v W i 1

If the group is large, define an indicator random variable, Xi u V j. Then W i V j = u W i Xi u. which is 1 if u W i belongs to E[ W i V j ] = W i k Then, by a simple application of Chernoff s bound, Therefore, by union bound P rob( W i V j > 3 2 W i k 10 log N log N ) exp( 24 3.2 2 ) = 1 n 2 P rob( i, W i V j > 3 W i W i 24N c log n 2 log N) 2 k n 2 Hence, with probability at least 1 log N : N 2 deg(v) deg(v) v V j i v V j W i 3 deg(v) = O(N 1+ 2k i v W i c 2 ) Lemma 1 tells us that with high probability each part has Õ(N 1+ c 2 ) edges. Therefore, each reducer uses sub linear amount of memory space to compute M i,j. There are N c total parts, again sublinear in the input size. Each part outputs M i,j which contain at most N/k 1 edges. Hence size of H is at most O(N c N k ) = O(N 1+ c 2 ) 2 An Alternate Algorithm for MST in MapReduce Recall that we use n to denote the total input size that is n = N + M. Fix some ɛ > 0. Algorithm 2 MST computation in MapReduce if E N 1+ɛ then Compute T = MST (E) return T end if l Θ( E / N 1+ɛ ) Partition E into E 1, E 2,...,E l where E i N 1+ɛ using a universal hash function h : E {1, 2,..., l}. In parallel: Compute T i, the minimum spanning tree on G = (V, E i ) Recursively return the minimum spanning tree on G = (V, E = i T i ) Theorem 2. Algorithm 2 terminates after c ɛ iterations, and returns the minimum spanning tree of the original graph G correctly. Proof. To show correctness, note that any edge that is not part of the MST on a subgraph of G is also not part of the MST of G by the cycle property of the minimum spanning trees. It remains to show that the memory constraints are not violated, and the total number of rounds is bounded. 2

First, since the partition is done randomly, applying the Chernoff bound, no machine gets more than 2N 1+ɛ edges with high probability. Also, note that after first round, After the second round, i T i l(n 1) = N 1+c 1+c ɛ (N 1) = N N 1+ɛ i T i l(n 1) = N 1+c ɛ 1+c 2ɛ (N 1) = N N 1+ɛ Continuing, after c ɛ iterations, the input is small enough to fit in the memory of a single machine, and the overall the algorithm terminates in c ɛ rounds. 3 Densest Subgraph Computation Definition. Let G = (V, E) be an undirected graph. Given S V, its density ρ(s) is defined as ρ(s) = E(S) S The maximum density ρ (G) of the graph is then ρ (G) = max S V {ρ(s)}. The above definition can be naturally extended to weighted graphs. A simple greedy algorithm First, let us look into a very simple greedy algorithm. The algorithm repeatedly removes the minimum degree vertex among the remaining vertices, and compute the resultant density. At the end, it returns the density of the subgraph that is maximum among all the iterations. The algorithm is naturally sequential, as it removes one vertex at a time based on minimum degree. This simple algorithm gives a 1 2-approximation to the densest subgraph as follows. Suppose the optimum subgraph is S OP T with density ρ. Then each vertex in S OP T must have degree at least ρ, otherwise, a simple calculation shows that removing that vertex improves the density of the subgraph contradicting its optimality. Consider the first iteration, at which our greedy algorithm tries to remove a vertex from S OP T. At that time, suppose the vertices still under consideration is T S OP T. Every vertex in T must have degree at least ρ. Thus the total number of edges in T is at least ρ T/2, or its density is at least ρ /2. Since we return the subgraph with maximum density, the returned subgraph will have density at least ρ 2. We now give a simple streaming algorithm, which can also be implemented in the map reduce framework. A simple streaming algorithm Starting with the given graph G, the algorithm computes the current density, ρ(g), and removes all of the nodes (and their incident edges) whose degree is less than (2 + 2ɛ)ρ(G) for some fixed ɛ > 0. If the resulting graph is non-empty, then the algorithm recurses on the remaining graph, with node set denoted by S, again computing its density and removing all of the nodes whose degree is lower than the specified threshold; we denote these nodes by A(S). Then, the node set reduces to S \ A(S), and the recursion continues in the same way. 3

1 Lemma 2. The algorithm above obtains a 2+2ɛ-approximation to the densest subgraph problem. Proof. Fix some optimal solution S. For each i S, deg S (i) ρ(s ). Since i S deg S(i) = 2 S ρ(s), at least one node must be removed in every pass. Now, consider the first time in the pass when a node i from the optimal solution S is removed. That is A(S) S φ. This moment is guaranteed to exist, since S eventually becomes empty. Clearly, S S. Let i A(S) S. We have This implies ρ(s) ρ(s ) (2+2ɛ). ρ(s ) deg S (i) deg S (i) (2 + 2ɛ)ρ(S) Lemma 3. The algorithm above terminates in O(log (1 + ɛ)n) passes. Proof. At each step of the pass, we have Thus, Or, 2 E(S) = i A(S) deg S (i) + i S\A(S) deg S (i) > 2(1 + ɛ)( S A(S) )ρ(s) = 2(1 + ɛ)( S A(S) ) E(S) S A(S) > ɛ 1 + ɛ S. S \ A(S) < 1 1 + ɛ S Therefore, the cardinality of the remaining set S decreases by a factor of at least 1/(1 + ɛ) during each pass. Hence, the algorithm terminates in O(log (1+ɛ) n) passes. 4 Estimating Density via Sampling Let G be the graph formed by sampling each edge in G independently with probability p where p = c ɛ 2 log N N M where c is a large enough constant, 1 2 > ɛ > 0 is the desired accuracy that we want, N = V (G) and M = E(G). We may assume that M is sufficiently large such that p < 1, otherwise the entire graph can be kept in a single machine. Let U be an arbitrary set of k nodes. We use ρ G and ρ G to denote the density of a set of nodes in G and G respectively. We first relate the density of U in G with that in G. Lemma 4. For all U V (G), P r(ρ G (U) pop T / 10) N 8 if ρ G (U) OP T / 60 P r( ρ G (U) pρ G (U) ɛpρ G (U)) N 8 if ρ G (U) OP T / 60 Proof. Let X denote the number of induced edges in U in G. Then EX = ρ G (U) U p Let OP T denote the optimum density, then OP T M N. Hence p c ɛ 2 log N / OP T. 4

First, let us consider the case when U does not have density close to OP T. Let ρ G (U) OP T/60. Then, P r(ρ G (U) pop T / 10) = P r(x p U OP T / 10) = P r(x p U ρ G (U)(1 + OP T / 12 ρ G (U) )) exp( p U ρ G (U) OP T / 12 ρ G (U) ) c U log N = exp( 12ɛ 2 ) 1 N 10 U On the other hand, if ρ G (U) OP T/60,then P r( ρ G (U) pρ G (U) ɛpρ G (U)) = P r( X p U ρ G (U) ɛp U ρ G (U)) 2exp( ɛ 2 p U ρ G (U) / 3) 2exp( c U log N / 180) 1 N 10 U Here we have used c to be a sufficiently large constant. Now there are at most n U subgraphs of size U. Hence with probability at least 1 1 N 9 U the above concentration result holds for all subgraphs of size U. There are N possible values of U. Thus, overall, again by apply a union bound, the above concentration result holds for all subgraphs of G with probability at least 1 1 N 8 Theorem 3. Let U =argmax U {ρ G (U)}. Then with high probability, (1 ɛ) 2 OP T ρ G (U ) OP T Proof. Consider the subgraph S with optimum density OP T in G. By the above lemma, in G, its density is at least (1 ɛ)pop T. Thus ρ G (U) (1 ɛ)pop T. Assuming (1 ɛ) 1/10, it must hold that ρ G (U) OP T/60, but then ρ G (U) (1 ɛ) 1 p ρ G (U) (1 ɛ)2 OP T. 5