Multi-join Query Evaluation on Big Data Lecture 2
|
|
- Barnard Doyle
- 5 years ago
- Views:
Transcription
1 Multi-join Query Evaluation on Big Data Lecture 2 Dan Suciu March, 2015 Dan Suciu Multi-Joins Lecture 2 March, / 34
2 Multi-join Query Evaluation Outline Part 1 Optimal Sequential Algorithms. Thursday 14:15-15:45 Part 2 Lower bounds for Parallel Algorithms. Friday 14:15-15:45 Part 3 Optimal Parallel Algorithms. Saturday 9-10:30 Part 3 Data Skew. Saturday 11-12:30 Dan Suciu Multi-Joins Lecture 2 March, / 34
3 Brief Review of the AGM Bound Q(x) = R 1 (x 1 ),..., R l (x l ) Equal Cardinalities: R 1 =... R l = m Q m ρ, where ρ = fractional edge covering number of Q. General Case: R 1 = m 1,..., R l = m l Q min u m u 1 1 mu l l Dan Suciu Multi-Joins Lecture 2 March, / 34
4 Outline for Lecture 2 Background: Parallel Databases The MPC Model Algorithm for Triangles Lower Bound on the Communication Cost Lower Bound for General Queries Summary Dan Suciu Multi-Joins Lecture 2 March, / 34
5 Parallel Query Processing: Overview Parallel Databases: Queries and updates GAMMA machine (80 s), today in all commercial DBMS. Typically x10 s nodes. FO in AC 0 ; SQL embarrassingly parallel Parallel Data Analytics: Queries only this course MapReduce, Hadoop, PigLatin, Dremel, Scope, Spark. Typically x100 s or x1000 s nodes. Data reshuffling / communication is new bottleneck. Distributed State: Updates (transactions) will not discuss Replicate objects (3-5 times). Central problem: consistency. E.g. Google, Amazon, Yahoo, Microsoft, Ebay. Eventual consistency (NoSQL, Dynamo, BigTable, Peanuts). Strong consistency: Paxos, Spanner. Dan Suciu Multi-Joins Lecture 2 March, / 34
6 Parallel Query Processing: Overview Parallel Databases: Queries and updates GAMMA machine (80 s), today in all commercial DBMS. Typically x10 s nodes. FO in AC 0 ; SQL embarrassingly parallel Parallel Data Analytics: Queries only this course MapReduce, Hadoop, PigLatin, Dremel, Scope, Spark. Typically x100 s or x1000 s nodes. Data reshuffling / communication is new bottleneck. Distributed State: Updates (transactions) will not discuss Replicate objects (3-5 times). Central problem: consistency. E.g. Google, Amazon, Yahoo, Microsoft, Ebay. Eventual consistency (NoSQL, Dynamo, BigTable, Peanuts). Strong consistency: Paxos, Spanner. Dan Suciu Multi-Joins Lecture 2 March, / 34
7 Parallel Query Processing: Overview Parallel Databases: Queries and updates GAMMA machine (80 s), today in all commercial DBMS. Typically x10 s nodes. FO in AC 0 ; SQL embarrassingly parallel Parallel Data Analytics: Queries only this course MapReduce, Hadoop, PigLatin, Dremel, Scope, Spark. Typically x100 s or x1000 s nodes. Data reshuffling / communication is new bottleneck. Distributed State: Updates (transactions) will not discuss Replicate objects (3-5 times). Central problem: consistency. E.g. Google, Amazon, Yahoo, Microsoft, Ebay. Eventual consistency (NoSQL, Dynamo, BigTable, Peanuts). Strong consistency: Paxos, Spanner. Dan Suciu Multi-Joins Lecture 2 March, / 34
8 Key Concepts Input data of size m is partitioned on p servers, connected by a network. Query processing involves local computation and communication. Performance parameters [Gray&Dewitt 92] Speedup How does performance change when p increases? Ideal: linear speedup. Scaleup How does performance chance when p, m increase at same rate? Ideal: constant scaleup. Speed Speed # processors (=P) # processors (=P) AND data size Dan Suciu Multi-Joins Lecture 2 March, / 34
9 Data Partition Given a database of size m, partition it on p servers. Balanced partition Each server holds m/p data. Skewed partition Some server holds m/p data. Usually, the input data is already partitioned, but we need to re-partition for a particular problem. Data reshuffling. Dan Suciu Multi-Joins Lecture 2 March, / 34
10 Example 1: Hash-Partitioned Join Compute Q(x, y, z) = R(x, y) S(y, z), where R = m 1, S = m 2 Input data The input relations R, S are partitioned on the servers. Data reshuffling Given hash function h Domain [p]: Send every tuple R(x, y) to server h(y). Send every tuple S(y, z) to server h(y). Computation In parallel, each server i computes the join R i (x, y) S i (y, z) of its local fragments R i, S i. If y is a key, then the load/server is m 1 /p + m 2 /p (why?) Linear Speedup Otherwise, we may have skew (why?) What is the worst speedup? Dan Suciu Multi-Joins Lecture 2 March, / 34
11 Understanding Skew Given: R(x, y) with m items, p servers. Send each tuple R(x, y) to server h(y), where h = random function. Claim 1 For each server u, its expected load is E[L u ] = m/p. Why? Claim 2 We say that R is skewed if some value y occurs more than m/p times in R. Then some server has L u > m/p. Claim 3 If R is not skewed, the max u L u is O(m/p) with high probability. (Details in lecture 4.) Take-away: we assume no skew, then: Max-load = O(Expected load) Dan Suciu Multi-Joins Lecture 2 March, / 34
12 Understanding Skew Given: R(x, y) with m items, p servers. Send each tuple R(x, y) to server h(y), where h = random function. Claim 1 For each server u, its expected load is E[L u ] = m/p. Why? Claim 2 We say that R is skewed if some value y occurs more than m/p times in R. Then some server has L u > m/p. Claim 3 If R is not skewed, the max u L u is O(m/p) with high probability. (Details in lecture 4.) Take-away: we assume no skew, then: Max-load = O(Expected load) Dan Suciu Multi-Joins Lecture 2 March, / 34
13 Understanding Skew Given: R(x, y) with m items, p servers. Send each tuple R(x, y) to server h(y), where h = random function. Claim 1 For each server u, its expected load is E[L u ] = m/p. Why? Claim 2 We say that R is skewed if some value y occurs more than m/p times in R. Then some server has L u > m/p. Claim 3 If R is not skewed, the max u L u is O(m/p) with high probability. (Details in lecture 4.) Take-away: we assume no skew, then: Max-load = O(Expected load) Dan Suciu Multi-Joins Lecture 2 March, / 34
14 Understanding Skew Given: R(x, y) with m items, p servers. Send each tuple R(x, y) to server h(y), where h = random function. Claim 1 For each server u, its expected load is E[L u ] = m/p. Why? Claim 2 We say that R is skewed if some value y occurs more than m/p times in R. Then some server has L u > m/p. Claim 3 If R is not skewed, the max u L u is O(m/p) with high probability. (Details in lecture 4.) Take-away: we assume no skew, then: Max-load = O(Expected load) Dan Suciu Multi-Joins Lecture 2 March, / 34
15 Example 2: Broadcast Join Compute Q(x, y, z) = R(x, y) S(y, z), where R = m 1 S = m 2 Input data The input relations R, S are partitioned on the servers. Broadcast Send every tuple S(y, z) to every server Computation In parallel, each server u computes the join R u (x, y) S(y, z) of its local fragment R u with S. If m 2 m 1 /p then the Broadcast Join is very effective. Used a lot in practice. Dan Suciu Multi-Joins Lecture 2 March, / 34
16 Massively Parallel Communication Model (MPC) [Beame 13] The MPC model is the following: p servers are connected by a network. Servers are infinitely powerful. Input Data of size m is initially uniformly partitioned. Computation = several rounds. One round = local computation plus global communication. Dan Suciu Multi-Joins Lecture 2 March, / 34
17 Massively Parallel Communication Model (MPC) [Beame 13] The MPC model is the following: p servers are connected by a network. Servers are infinitely powerful. Input Data of size m is initially uniformly partitioned. Computation = several rounds. One round = local computation plus global communication. The only cost is the communication. Definition (The Load of a algorithm on the MPC Model) L u = maximum amount of data received by server u during any round L = max u L u Dan Suciu Multi-Joins Lecture 2 March, / 34
18 Massively Parallel Communication Model (MPC) Number of servers = p Input (size=m) O(m/p) O(m/p) Server Server p Input data = size m Algorithm = Several rounds L Server 1 L.... L Round 1 Server p L Round 2 One round = Compute & communicate Server 1 Max communication load per server = L L Server p L Round 3 Dan Suciu Multi-Joins Lecture 2 March, / 34
19 Load/Rounds Tradeoff in the MPC Model Naive 1-Round Send entire data to server 1, compute locally. L = m Naive p-rounds At each round, send a m/p-fragment of the data to server 1, then compute locally. L = m/p. Ideal Algorithms 1-Round, load L = m/p (but rarely possible) Real Algorithms O(1) rounds, and L = O(m/p 1 ε ), for 0 ε < 1. Dan Suciu Multi-Joins Lecture 2 March, / 34
20 Load/Rounds Tradeoff in the MPC Model Naive 1-Round Send entire data to server 1, compute locally. L = m Naive p-rounds At each round, send a m/p-fragment of the data to server 1, then compute locally. L = m/p. Ideal Algorithms 1-Round, load L = m/p (but rarely possible) Real Algorithms O(1) rounds, and L = O(m/p 1 ε ), for 0 ε < 1. Dan Suciu Multi-Joins Lecture 2 March, / 34
21 Load/Rounds Tradeoff in the MPC Model Naive 1-Round Send entire data to server 1, compute locally. L = m Naive p-rounds At each round, send a m/p-fragment of the data to server 1, then compute locally. L = m/p. Ideal Algorithms 1-Round, load L = m/p (but rarely possible) Real Algorithms O(1) rounds, and L = O(m/p 1 ε ), for 0 ε < 1. Dan Suciu Multi-Joins Lecture 2 March, / 34
22 Load/Rounds Tradeoff in the MPC Model Naive 1-Round Send entire data to server 1, compute locally. L = m Naive p-rounds At each round, send a m/p-fragment of the data to server 1, then compute locally. L = m/p. Ideal Algorithms 1-Round, load L = m/p (but rarely possible) Real Algorithms O(1) rounds, and L = O(m/p 1 ε ), for 0 ε < 1. Dan Suciu Multi-Joins Lecture 2 March, / 34
23 Examples on the MPC Model Example: Join Q(x, y, z) = R(x, y), S(y, z). Round 1 (Hash-partitioned join) Send tuple R(x, y) to server h(y), send S(y, z) to server h(y), compute the join locally. Load: O(m/p) (assuming no skew). Example: Triangles Q(x, y, z) = R(x, y), S(y, z), T (z, x) Round 1 hash-partitioned join: Aux(x, y, z) = R(x, y), S(y, z) Round 2 hash-partitioned join: Q(x, y, z) = Aux(x, y, z), T (z, x) Load: can be as high as m 2 /p because of the intermediate result!! Can we compute triangles with a smaller load? Dan Suciu Multi-Joins Lecture 2 March, / 34
24 Examples on the MPC Model Example: Join Q(x, y, z) = R(x, y), S(y, z). Round 1 (Hash-partitioned join) Send tuple R(x, y) to server h(y), send S(y, z) to server h(y), compute the join locally. Load: O(m/p) (assuming no skew). Example: Triangles Q(x, y, z) = R(x, y), S(y, z), T (z, x) Round 1 hash-partitioned join: Aux(x, y, z) = R(x, y), S(y, z) Round 2 hash-partitioned join: Q(x, y, z) = Aux(x, y, z), T (z, x) Load: can be as high as m 2 /p because of the intermediate result!! Can we compute triangles with a smaller load? Dan Suciu Multi-Joins Lecture 2 March, / 34
25 A Simple Lower Bound for the MPC Model Let Q be a query, with R 1 = = R l = m. Fact For any algorithm for Q with r rounds and load L, it holds: r L m/p 1/ρ. Dan Suciu Multi-Joins Lecture 2 March, / 34
26 A Simple Lower Bound for the MPC Model Let Q be a query, with R 1 = = R l = m. Fact For any algorithm for Q with r rounds and load L, it holds: r L m/p 1/ρ. Proof Construct an instance where Q = m ρ. One server: receives r L tuples from R j, for all j, hence finds (r L) ρ answers. All p servers find p(r L) ρ answers. It follows: r L m/p 1/ρ. Dan Suciu Multi-Joins Lecture 2 March, / 34
27 A Simple Lower Bound for the MPC Model Let Q be a query, with R 1 = = R l = m. Fact For any algorithm for Q with r rounds and load L, it holds: r L m/p 1/ρ. Proof Construct an instance where Q = m ρ. One server: receives r L tuples from R j, for all j, hence finds (r L) ρ answers. All p servers find p(r L) ρ answers. It follows: r L m/p 1/ρ. Question For Q = R(x, y), S(y, z) it should be r L m/p 1/2, but we computed a join with load L = m/p. Contradiction? Dan Suciu Multi-Joins Lecture 2 March, / 34
28 A Simple Lower Bound for the MPC Model Let Q be a query, with R 1 = = R l = m. Fact For any algorithm for Q with r rounds and load L, it holds: r L m/p 1/ρ. Proof Construct an instance where Q = m ρ. One server: receives r L tuples from R j, for all j, hence finds (r L) ρ answers. All p servers find p(r L) ρ answers. It follows: r L m/p 1/ρ. Question For Q = R(x, y), S(y, z) it should be r L m/p 1/2, but we computed a join with load L = m/p. Contradiction? Answer: the algorithm is for skew-free data, the bound for skewed data. Dan Suciu Multi-Joins Lecture 2 March, / 34
29 One-Round Algorithm for Triangles [Afrati&Ullman 10,Suri&Vassilvitski 11] Q(x, y, z) = R(x, y), S(y, z), T (z, x) Place the p servers in a cube: [p] [p 1/3 ] [p 1/3 ] [p 1/3 ]. Round 1 In parallel, each server does the following: Send R(x, y) to all servers (h 1 (x), h 2 (y), ) Send S(y, z) to all servers (, h 2 (y), h 3 (z)) Send T (z, x) to all servers (h 1 (x),, h 3 (z)) Then compute Q locally Theorem (Beame 14) (1) If h 1, h 2, h 3 are independent hash functions, then the expected load at some server u is E[L u ] = m 1+m 2 +m 3 def = L. [Why do we need independence?] p 2/3 (2) If the data has no skew, then maximum load is O(L) w.h.p. Dan Suciu Multi-Joins Lecture 2 March, / 34
30 One-Round Algorithm for Triangles [Afrati&Ullman 10,Suri&Vassilvitski 11] Q(x, y, z) = R(x, y), S(y, z), T (z, x) Place the p servers in a cube: [p] [p 1/3 ] [p 1/3 ] [p 1/3 ]. Round 1 In parallel, each server does the following: Send R(x, y) to all servers (h 1 (x), h 2 (y), ) Send S(y, z) to all servers (, h 2 (y), h 3 (z)) Send T (z, x) to all servers (h 1 (x),, h 3 (z)) Then compute Q locally Theorem (Beame 14) (1) If h 1, h 2, h 3 are independent hash functions, then the expected load at some server u is E[L u ] = m 1+m 2 +m 3 def = L. [Why do we need independence?] p 2/3 (2) If the data has no skew, then maximum load is O(L) w.h.p. Dan Suciu Multi-Joins Lecture 2 March, / 34
31 One-Round Algorithm for Triangles [Afrati&Ullman 10,Suri&Vassilvitski 11] Q(x, y, z) = R(x, y), S(y, z), T (z, x) Place the p servers in a cube: [p] [p 1/3 ] [p 1/3 ] [p 1/3 ]. Round 1 In parallel, each server does the following: Send R(x, y) to all servers (h 1 (x), h 2 (y), ) Send S(y, z) to all servers (, h 2 (y), h 3 (z)) Send T (z, x) to all servers (h 1 (x),, h 3 (z)) Then compute Q locally Theorem (Beame 14) (1) If h 1, h 2, h 3 are independent hash functions, then the expected load at some server u is E[L u ] = m 1+m 2 +m 3 def = L. [Why do we need independence?] p 2/3 (2) If the data has no skew, then maximum load is O(L) w.h.p. Dan Suciu Multi-Joins Lecture 2 March, / 34
32 Discussion We will call the algorithm HyperCube, following [Beame 13]. Each tuple R(x, y) is replicated p 1/3 times. Hence, load L = m/p 2/3 Partitioning R(x, y) (h 1 (x), h 2 (y), ) more tolerant to skew: Each value x is sent to p 1/3 buckets, each y is sent to p 1/3 buckets. Can tolerate degrees up to m/p 1/3 (better than m/p). Notice that the algorithm only shuffles the data: each server still has to compute the query locally. Important to use worst-case algorithm. Non-linear speedup, because the load is m/p 2/3. Can we we compute triangles with a load of m/p? Dan Suciu Multi-Joins Lecture 2 March, / 34
33 Lower Bound for Triangle Queries Q(x, y, z) = R(x, y), S(y, z), T (z, x) Assume R + S + T = m. Denote L lower = m/p 2/3. Theorem Any 1-round algorithm that computes Q must have load L L lower, even on database instances without skew. Hence, HyperCube is optimal for computing triangles on permutations. Next, we will prove the theorem. Dan Suciu Multi-Joins Lecture 2 March, / 34
34 Lower Bound for Triangle Queries Q(x, y, z) = R(x, y), S(y, z), T (z, x) We assume R, S, T are permutations over a domain of size n. Example n = 4: R x y S y z T z x = Q x y z Question: what is E[ Q ], over random permutations R, S, T? Theorem (Beame 13) Denote L lower = 3n/p 2/3. Let A be an algorithm for Q, with load L < L lower. Then, the expected number of triangles returned by A is E[ A ] ( L ) L lower 3/2 E[ Q ] Dan Suciu Multi-Joins Lecture 2 March, / 34
35 Lower Bound for Triangle Queries Q(x, y, z) = R(x, y), S(y, z), T (z, x) We assume R, S, T are permutations over a domain of size n. Example n = 4: R x y S y z T z x = Q x y z Question: what is E[ Q ], over random permutations R, S, T? Theorem (Beame 13) Denote L lower = 3n/p 2/3. Let A be an algorithm for Q, with load L < L lower. Then, the expected number of triangles returned by A is E[ A ] ( L ) L lower 3/2 E[ Q ] Dan Suciu Multi-Joins Lecture 2 March, / 34
36 Lower Bound for Triangle Queries Q(x, y, z) = R(x, y), S(y, z), T (z, x) We assume R, S, T are permutations over a domain of size n. Example n = 4: R x y S y z T z x = Q x y z Question: what is E[ Q ], over random permutations R, S, T? Theorem (Beame 13) Denote L lower = 3n/p 2/3. Let A be an algorithm for Q, with load L < L lower. Then, the expected number of triangles returned by A is E[ A ] ( L ) L lower 3/2 E[ Q ] Dan Suciu Multi-Joins Lecture 2 March, / 34
37 Discussion: Inputs, Messages, Bits Initially, the relations R, S, T are on disjoint servers. This is w.l.o.g. Stronger: assume that R is on server 1, S on server 2, T on server 3. Any server u receives three messages: msg 1 about R msg 2 about S msg 3 about T msg 1 + msg 2 + msg 3 L bits. Messages may encode arbitrary information. E.g. bit 1 says R is even ; bit 2 says R has a 17-cycle ; etc. No lower bound is possible for a fixed input R, S, T : an algorithm may simply check for that input and encode using L = O(1) bits. Instead, the lower bound is a statement about all permutations. Dan Suciu Multi-Joins Lecture 2 March, / 34
38 Notation: Bits v.s. Tuples Size of db in #tuples m = R + S + T. Size of db in #bits M = R log n + S log n + T log n = m log n where n = size of the domain. If R is a random permutation, then size(r) = log(n!) n log n bits. L lower = m/p 2/3 tuples becomes L lower = 3 log(n!)/p 2/3 bits. Will prove the following, where L, L lower are expressed in bits: L E[ A ] ( L lower ) 3/2 E[ Q ] Dan Suciu Multi-Joins Lecture 2 March, / 34
39 Proof Part 1: E[ Q ] on Random Inputs Q(x, y, z) = R(x, y), S(y, z), T (z, x) Lemma E[ Q ] = 1, where the expectation is over random permutations R, S, T. Proof. Note: i, j, k [n], P((i, j) R) = P((j, k) S) = P((k, i) T ) = 1/n. E[ Q ] = P((i, j) R (j, k) S (k, i) T ) i,j,k independence = P((i, j) R) P((j, k) S) P((k, i) T ) = n 3 1 i,j,k n 1 n 1 n = 1 Dan Suciu Multi-Joins Lecture 2 March, / 34
40 Proof Part 1: E[ Q ] on Random Inputs Q(x, y, z) = R(x, y), S(y, z), T (z, x) Lemma E[ Q ] = 1, where the expectation is over random permutations R, S, T. Proof. Note: i, j, k [n], P((i, j) R) = P((j, k) S) = P((k, i) T ) = 1/n. E[ Q ] = P((i, j) R (j, k) S (k, i) T ) i,j,k independence = P((i, j) R) P((j, k) S) P((k, i) T ) = n 3 1 i,j,k n 1 n 1 n = 1 Dan Suciu Multi-Joins Lecture 2 March, / 34
41 Proof Part 1: E[ Q ] on Random Inputs Q(x, y, z) = R(x, y), S(y, z), T (z, x) Lemma E[ Q ] = 1, where the expectation is over random permutations R, S, T. Proof. Note: i, j, k [n], P((i, j) R) = P((j, k) S) = P((k, i) T ) = 1/n. E[ Q ] = P((i, j) R (j, k) S (k, i) T ) i,j,k independence = P((i, j) R) P((j, k) S) P((k, i) T ) = n 3 1 i,j,k n 1 n 1 n = 1 Dan Suciu Multi-Joins Lecture 2 March, / 34
42 Discussion In expectation, there is one triangle! E[ Q ] = 1 Will prove: if algorithm A has load L < L lower, then E[A] = (L/L lower ) 3/2. Dan Suciu Multi-Joins Lecture 2 March, / 34
43 Proof Part 2: What We Learn from a Message Fix a server u [p], and a message msg 1 it received about R. Definition The set of known tuples is: K 1 (msg 1 ) = {(i, j) R(msg 1 (R) = msg 1 (i, j) R)} Similarly K 2 (msg 2 ), K 3 (msg 3 ) known tuples in S and T. Observation Upon receiving msg 1, msg 2, msg 3, server u can output triangle (i, j, k) iff (i, j) K 1 (msg 1 ) (j, k) K 2 (msg 2 ) (k, i) K 2 (msg 3 ) Dan Suciu Multi-Joins Lecture 2 March, / 34
44 Proof Part 2: What We Learn from a Message Fix a server u [p], and a message msg 1 it received about R. Definition The set of known tuples is: K 1 (msg 1 ) = {(i, j) R(msg 1 (R) = msg 1 (i, j) R)} Similarly K 2 (msg 2 ), K 3 (msg 3 ) known tuples in S and T. Observation Upon receiving msg 1, msg 2, msg 3, server u can output triangle (i, j, k) iff (i, j) K 1 (msg 1 ) (j, k) K 2 (msg 2 ) (k, i) K 2 (msg 3 ) Dan Suciu Multi-Joins Lecture 2 March, / 34
45 Proof Part 2: What We Learn from a Message Fix a server u [p], and a message msg 1 it received about R. Definition The set of known tuples is: K 1 (msg 1 ) = {(i, j) R(msg 1 (R) = msg 1 (i, j) R)} Similarly K 2 (msg 2 ), K 3 (msg 3 ) known tuples in S and T. Observation Upon receiving msg 1, msg 2, msg 3, server u can output triangle (i, j, k) iff (i, j) K 1 (msg 1 ) (j, k) K 2 (msg 2 ) (k, i) K 2 (msg 3 ) Dan Suciu Multi-Joins Lecture 2 March, / 34
46 Discussion The useful information we learn from a message is the set of known tuples. We show next: if L 1 = size(msg 1 ) is small, then K 1 (msg 1 ) is small If you know only a few bits, then you know only a few tuples Later: if K 1, K 2, K 3 are small, the server knows only few triangles (AGM) Dan Suciu Multi-Joins Lecture 2 March, / 34
47 Proof Part 3: With Few Bits You Know Only Few Tuples Denote H(R) = log(n!) the entropy of R. Proposition Let f 1 < 1. If msg 1 (R) has f 1 H(R) bits, then E[ K 1 (msg 1 (R)) ] f 1 n, where the expectation is over random permutations R. Dan Suciu Multi-Joins Lecture 2 March, / 34
48 Proof Part 3: With Few Bits You Know Only Few Tuples Denote H(R) = log(n!) the entropy of R. Proposition Let f 1 < 1. If msg 1 (R) has f 1 H(R) bits, then E[ K 1 (msg 1 (R)) ] f 1 n, where the expectation is over random permutations R. Proof. Denote k = K 1(m 1). Dan Suciu Multi-Joins Lecture 2 March, / 34
49 Proof Part 3: With Few Bits You Know Only Few Tuples Denote H(R) = log(n!) the entropy of R. Proposition Let f 1 < 1. If msg 1 (R) has f 1 H(R) bits, then E[ K 1 (msg 1 (R)) ] f 1 n, where the expectation is over random permutations R. Proof. Denote k = K 1(m 1). H(R m 1) log[(n k)!] (1 k ) H(R) because log[(n k)!]/(n k) log(n!)/n. n Dan Suciu Multi-Joins Lecture 2 March, / 34
50 Proof Part 3: With Few Bits You Know Only Few Tuples Denote H(R) = log(n!) the entropy of R. Proposition Let f 1 < 1. If msg 1 (R) has f 1 H(R) bits, then E[ K 1 (msg 1 (R)) ] f 1 n, where the expectation is over random permutations R. Proof. Denote k = K 1(m 1). H(R m 1) log[(n k)!] (1 k ) H(R) because log[(n k)!]/(n k) log(n!)/n. n H(R) = H(R, msg 1 (R)) R determines msg 1 (R) Dan Suciu Multi-Joins Lecture 2 March, / 34
51 Proof Part 3: With Few Bits You Know Only Few Tuples Denote H(R) = log(n!) the entropy of R. Proposition Let f 1 < 1. If msg 1 (R) has f 1 H(R) bits, then E[ K 1 (msg 1 (R)) ] f 1 n, where the expectation is over random permutations R. Proof. Denote k = K 1(m 1). H(R m 1) log[(n k)!] (1 k ) H(R) because log[(n k)!]/(n k) log(n!)/n. n H(R) = H(R, msg 1 (R)) R determines msg 1 (R) = H(msg 1 (R)) + m1 H(R m 1) P(m 1) chain rule Dan Suciu Multi-Joins Lecture 2 March, / 34
52 Proof Part 3: With Few Bits You Know Only Few Tuples Denote H(R) = log(n!) the entropy of R. Proposition Let f 1 < 1. If msg 1 (R) has f 1 H(R) bits, then E[ K 1 (msg 1 (R)) ] f 1 n, where the expectation is over random permutations R. Proof. Denote k = K 1(m 1). H(R m 1) log[(n k)!] (1 k ) H(R) because log[(n k)!]/(n k) log(n!)/n. n H(R) = H(R, msg 1 (R)) R determines msg 1 (R) = H(msg 1 (R)) + m1 H(R m 1) P(m 1) chain rule f 1 H(R) + m1 (1 K 1(m 1 ) n ) H(R) P(m 1) K = f 1 H(R) + [1 1 (m 1 ) P(m 1 ) m1 ] H(R) = f n 1 H(R) + [1 E[ K 1(m 1 ) ] ] H(R) n Dan Suciu Multi-Joins Lecture 2 March, / 34
53 Proof Part 3: With Few Bits You Know Only Few Tuples Denote H(R) = log(n!) the entropy of R. Proposition Let f 1 < 1. If msg 1 (R) has f 1 H(R) bits, then E[ K 1 (msg 1 (R)) ] f 1 n, where the expectation is over random permutations R. Proof. Denote k = K 1(m 1). H(R m 1) log[(n k)!] (1 k ) H(R) because log[(n k)!]/(n k) log(n!)/n. n H(R) = H(R, msg 1 (R)) R determines msg 1 (R) = H(msg 1 (R)) + m1 H(R m 1) P(m 1) chain rule f 1 H(R) + m1 (1 K 1(m 1 ) n ) H(R) P(m 1) K = f 1 H(R) + [1 1 (m 1 ) P(m 1 ) m1 ] H(R) = f n 1 H(R) + [1 E[ K 1(m 1 ) ] ] H(R) n It follows: E[ K 1(m 1) ] f 1n Dan Suciu Multi-Joins Lecture 2 March, / 34
54 Proof Part 4: Few Known Tuples form Few Triangles Continue to fix one server u. Its load L = L 1 + L 2 + L 3. Denote: a ij = Pr((i, j) K 1 (msg 1 (R))); similarly b jk, c ki for S, T. Denote A u the set of triangles returned by u: E[ A u ] = i,j,k a ij b jk c ki (( ij a 2 ij )( jk b 2 jk )( ki c 2 ki ))1/2 (Friedgut) a ij 1/n (why?) and ij a ij = E[ K 1 (msg 1 (R)) ] It follows: ij a 2 ij 1/n ij a ij L 1 log(n!) L 1 log(n!) n (why?) E[ A u ] ( L 1 log(n!) L 2 log(n!) L 3 log(n!) )1/2 ( L 1+L 2 +L 3 3 log(n!) )3/2 = ( L M )3/2 triangles. 3/2 All p servers return E[ A ] p ( L M )3/2 = ( L M ) = ( L L lower ) 3/2 triangles. p 2/3 QED Dan Suciu Multi-Joins Lecture 2 March, / 34
55 Proof Part 4: Few Known Tuples form Few Triangles Continue to fix one server u. Its load L = L 1 + L 2 + L 3. Denote: a ij = Pr((i, j) K 1 (msg 1 (R))); similarly b jk, c ki for S, T. Denote A u the set of triangles returned by u: E[ A u ] = i,j,k a ij b jk c ki (( ij a 2 ij )( jk b 2 jk )( ki c 2 ki ))1/2 (Friedgut) a ij 1/n (why?) and ij a ij = E[ K 1 (msg 1 (R)) ] It follows: ij a 2 ij 1/n ij a ij L 1 log(n!) L 1 log(n!) n (why?) E[ A u ] ( L 1 log(n!) L 2 log(n!) L 3 log(n!) )1/2 ( L 1+L 2 +L 3 3 log(n!) )3/2 = ( L M )3/2 triangles. 3/2 All p servers return E[ A ] p ( L M )3/2 = ( L M ) = ( L L lower ) 3/2 triangles. p 2/3 QED Dan Suciu Multi-Joins Lecture 2 March, / 34
56 Proof Part 4: Few Known Tuples form Few Triangles Continue to fix one server u. Its load L = L 1 + L 2 + L 3. Denote: a ij = Pr((i, j) K 1 (msg 1 (R))); similarly b jk, c ki for S, T. Denote A u the set of triangles returned by u: E[ A u ] = i,j,k a ij b jk c ki (( ij a 2 ij )( jk b 2 jk )( ki c 2 ki ))1/2 (Friedgut) a ij 1/n (why?) and ij a ij = E[ K 1 (msg 1 (R)) ] It follows: ij a 2 ij 1/n ij a ij L 1 log(n!) L 1 log(n!) n (why?) E[ A u ] ( L 1 log(n!) L 2 log(n!) L 3 log(n!) )1/2 ( L 1+L 2 +L 3 3 log(n!) )3/2 = ( L M )3/2 triangles. 3/2 All p servers return E[ A ] p ( L M )3/2 = ( L M ) = ( L L lower ) 3/2 triangles. p 2/3 QED Dan Suciu Multi-Joins Lecture 2 March, / 34
57 Proof Part 4: Few Known Tuples form Few Triangles Continue to fix one server u. Its load L = L 1 + L 2 + L 3. Denote: a ij = Pr((i, j) K 1 (msg 1 (R))); similarly b jk, c ki for S, T. Denote A u the set of triangles returned by u: E[ A u ] = i,j,k a ij b jk c ki (( ij a 2 ij )( jk b 2 jk )( ki c 2 ki ))1/2 (Friedgut) a ij 1/n (why?) and ij a ij = E[ K 1 (msg 1 (R)) ] It follows: ij a 2 ij 1/n ij a ij L 1 log(n!) L 1 log(n!) n (why?) E[ A u ] ( L 1 log(n!) L 2 log(n!) L 3 log(n!) )1/2 ( L 1+L 2 +L 3 3 log(n!) )3/2 = ( L M )3/2 triangles. 3/2 All p servers return E[ A ] p ( L M )3/2 = ( L M ) = ( L L lower ) 3/2 triangles. p 2/3 QED Dan Suciu Multi-Joins Lecture 2 March, / 34
58 Proof Part 4: Few Known Tuples form Few Triangles Continue to fix one server u. Its load L = L 1 + L 2 + L 3. Denote: a ij = Pr((i, j) K 1 (msg 1 (R))); similarly b jk, c ki for S, T. Denote A u the set of triangles returned by u: E[ A u ] = i,j,k a ij b jk c ki (( ij a 2 ij )( jk b 2 jk )( ki c 2 ki ))1/2 (Friedgut) a ij 1/n (why?) and ij a ij = E[ K 1 (msg 1 (R)) ] It follows: ij a 2 ij 1/n ij a ij L 1 log(n!) L 1 log(n!) n (why?) E[ A u ] ( L 1 log(n!) L 2 log(n!) L 3 log(n!) )1/2 ( L 1+L 2 +L 3 3 log(n!) )3/2 = ( L M )3/2 triangles. 3/2 All p servers return E[ A ] p ( L M )3/2 = ( L M ) = ( L L lower ) 3/2 triangles. p 2/3 QED Dan Suciu Multi-Joins Lecture 2 March, / 34
59 Proof Part 4: Few Known Tuples form Few Triangles Continue to fix one server u. Its load L = L 1 + L 2 + L 3. Denote: a ij = Pr((i, j) K 1 (msg 1 (R))); similarly b jk, c ki for S, T. Denote A u the set of triangles returned by u: E[ A u ] = i,j,k a ij b jk c ki (( ij a 2 ij )( jk b 2 jk )( ki c 2 ki ))1/2 (Friedgut) a ij 1/n (why?) and ij a ij = E[ K 1 (msg 1 (R)) ] It follows: ij a 2 ij 1/n ij a ij L 1 log(n!) L 1 log(n!) n (why?) E[ A u ] ( L 1 log(n!) L 2 log(n!) L 3 log(n!) )1/2 ( L 1+L 2 +L 3 3 log(n!) )3/2 = ( L M )3/2 triangles. 3/2 All p servers return E[ A ] p ( L M )3/2 = ( L M ) = ( L L lower ) 3/2 triangles. p 2/3 QED Dan Suciu Multi-Joins Lecture 2 March, / 34
60 Proof Part 4: Few Known Tuples form Few Triangles Continue to fix one server u. Its load L = L 1 + L 2 + L 3. Denote: a ij = Pr((i, j) K 1 (msg 1 (R))); similarly b jk, c ki for S, T. Denote A u the set of triangles returned by u: E[ A u ] = i,j,k a ij b jk c ki (( ij a 2 ij )( jk b 2 jk )( ki c 2 ki ))1/2 (Friedgut) a ij 1/n (why?) and ij a ij = E[ K 1 (msg 1 (R)) ] It follows: ij a 2 ij 1/n ij a ij L 1 log(n!) L 1 log(n!) n (why?) E[ A u ] ( L 1 log(n!) L 2 log(n!) L 3 log(n!) )1/2 ( L 1+L 2 +L 3 3 log(n!) )3/2 = ( L M )3/2 triangles. 3/2 All p servers return E[ A ] p ( L M )3/2 = ( L M ) = ( L L lower ) 3/2 triangles. p 2/3 QED Dan Suciu Multi-Joins Lecture 2 March, / 34
61 Discussion of the Lower Bound for the Triangle Query Algorithm A with load L < L lower = m/p 2/3, E[A] (L/L lower ) 3/2 E[ Q ] The result is for Skew-free Data There exists at least one instance on which A fails (trivial). Even if A is randomized, there exists at least one instance where A fails with high probability (Yao s lemma). Parallelism gets harder as p increases. An algorithm with load L = O( m 1 ), with ε < 1/3 reports only O( ) triangles. Fewer, as p 1 ε p 1/3 ε p increases! Dan Suciu Multi-Joins Lecture 2 March, / 34
62 Discussion of the Lower Bound for the Triangle Query Algorithm A with load L < L lower = m/p 2/3, E[A] (L/L lower ) 3/2 E[ Q ] The result is for Skew-free Data There exists at least one instance on which A fails (trivial). Even if A is randomized, there exists at least one instance where A fails with high probability (Yao s lemma). Parallelism gets harder as p increases. An algorithm with load L = O( m 1 ), with ε < 1/3 reports only O( ) triangles. Fewer, as p 1 ε p 1/3 ε p increases! Dan Suciu Multi-Joins Lecture 2 March, / 34
63 Generalization to Full Conjunctive Queries Will discuss the equal-cardinality case today. Will discuss the general case in Lecture 3. Dan Suciu Multi-Joins Lecture 2 March, / 34
64 The Fractional Vertex Cover / Edge Packing Hypergraph: Q = R 1 (x 1 ),..., R l (x l ) Nodes: x 1,..., x k, edges: R 1,..., R l. Definition A fractional vertex cover of Q is a sequence v 1 0,..., v k 0 such that: j i x i R j v i 1 A fractional edge packing of Q is a sequence u 1 0,..., u l 0 such that: i j x i R j u j 1 By duality: min v i v i = max u j u j = τ Dan Suciu Multi-Joins Lecture 2 March, / 34
65 The Fractional Vertex Cover / Edge Packing Hypergraph: Q = R 1 (x 1 ),..., R l (x l ) Nodes: x 1,..., x k, edges: R 1,..., R l. Definition A fractional vertex cover of Q is a sequence v 1 0,..., v k 0 such that: j i x i R j v i 1 A fractional edge packing of Q is a sequence u 1 0,..., u l 0 such that: i j x i R j u j 1 By duality: min v i v i = max u j u j = τ Dan Suciu Multi-Joins Lecture 2 March, / 34
66 The HyperCube Algorithm for General Queries Q(x) = R 1 (x 1 ),..., R l (x l ), R 1 =... = R l = m p servers. Dan Suciu Multi-Joins Lecture 2 March, / 34
67 The HyperCube Algorithm for General Queries Q(x) = R 1 (x 1 ),..., R l (x l ), R 1 =... = R l = m p servers. def v = (v 1,..., v k ) any fractional vertex cover; v 0 = i v i. Organize the p servers in a hypercube: [p] [p v 1 v0 ] [p Choose k independent hash functions h 1,..., h k v k v0 ]. Dan Suciu Multi-Joins Lecture 2 March, / 34
68 The HyperCube Algorithm for General Queries Q(x) = R 1 (x 1 ),..., R l (x l ), R 1 =... = R l = m p servers. def v = (v 1,..., v k ) any fractional vertex cover; v 0 = i v i. Organize the p servers in a hypercube: [p] [p v 1 v k v0 ] [p v0 ]. Choose k independent hash functions h 1,..., h k Round 1 Each server sends each tuple R j (x j1, x j2,...) to all servers whose coordinates j 1, j 2,... are h j1 (x j1 ), h j2 (x j2 ),... and broadcasts along the missing dimensions. Then, each server computes Q on its local data. Dan Suciu Multi-Joins Lecture 2 March, / 34
69 The HyperCube Algorithm for General Queries Q(x) = R 1 (x 1 ),..., R l (x l ), R 1 =... = R l = m p servers. def v = (v 1,..., v k ) any fractional vertex cover; v 0 = i v i. Organize the p servers in a hypercube: [p] [p v 1 v k v0 ] [p v0 ]. Choose k independent hash functions h 1,..., h k Round 1 Each server sends each tuple R j (x j1, x j2,...) to all servers whose coordinates j 1, j 2,... are h j1 (x j1 ), h j2 (x j2 ),... and broadcasts along the missing dimensions. Then, each server computes Q on its local data. Theorem The load of the HyperCube algorithm is L = l m p 1/v 0 Proof. Fix a server. E[# tuples in R j ] = = O( m p 1/v 0 ). m p i R j v i /v 0 m p 1/v 0 since i R j v i 1. Dan Suciu Multi-Joins Lecture 2 March, / 34
70 Lower Bound for General Queries Definition R [n] r is called a matching of arity r if R = n and every column is a key. Example: matching of arity 3, n = 4 R x y z Theorem Suppose all arities are 2. For any packing u, denote L lower = m p 1/ j u j. Then, for any algorithm A with load L < L lower, it returns E[ A ] (L/L lower ) u E[ Q ] answers, where the expectation is over random matchings Proof in the section today. (Q: what about arities 1?) Dan Suciu Multi-Joins Lecture 2 March, / 34
71 Lower Bound for General Queries Corollary HyperCube algorithm is optimal. (Because min v i v i = max u j u j = τ.) Dan Suciu Multi-Joins Lecture 2 March, / 34
72 Summary of Lecture 2 Parallel query engines today compute one join at a time. Issues: communication rounds, intermediate results, skew. The HyperCube algorithm: one round. Restrictions: Equal-cardinalities (will generalize in Lecture 3) Skew-free databases (skew is open, but will discuss in Lecture 4). HyperCube is more resilient to skew than a join. A surprising fact: Parallel algorithms: lower bound given by fractional edge packing. Sequential algorithms: lower given by fractional edge cover. Multiple rounds: all lower bounds [Beame 13] use weaker model. Open problem: an algorithm with load O(n/p) cannot compute Q = R 1 (x 0, x 1 ), R 2 (x 1, x 2 ), R 3 (x 2, x 3 ), R 4 (x 3, x 4 ), R 5 (x 4, x 5 ) in two rounds, over random permutations. Dan Suciu Multi-Joins Lecture 2 March, / 34
Query Processing with Optimal Communication Cost
Query Processing with Optimal Communication Cost Magdalena Balazinska and Dan Suciu University of Washington AITF 2017 1 Context Past: NSF Big Data grant PhD student Paris Koutris received the ACM SIGMOD
More informationA Worst-Case Optimal Multi-Round Algorithm for Parallel Computation of Conjunctive Queries
A Worst-Case Optimal Multi-Round Algorithm for Parallel Computation of Conjunctive Queries Bas Ketsman Dan Suciu Abstract We study the optimal communication cost for computing a full conjunctive query
More informationCommunication Steps for Parallel Query Processing
Communication Steps for Parallel Query Processing Paul Beame, Paraschos Koutris and Dan Suciu University of Washington, Seattle, WA {beame,pkoutris,suciu}@cs.washington.edu ABSTRACT We consider the problem
More informationParallel-Correctness and Transferability for Conjunctive Queries
Parallel-Correctness and Transferability for Conjunctive Queries Tom J. Ameloot1 Gaetano Geck2 Bas Ketsman1 Frank Neven1 Thomas Schwentick2 1 Hasselt University 2 Dortmund University Big Data Too large
More informationSingle-Round Multi-Join Evaluation. Bas Ketsman
Single-Round Multi-Join Evaluation Bas Ketsman Outline 1. Introduction 2. Parallel-Correctness 3. Transferability 4. Special Cases 2 Motivation Single-round Multi-joins Less rounds / barriers Formal framework
More informationEarly stopping: the idea. TRB for benign failures. Early Stopping: The Protocol. Termination
TRB for benign failures Early stopping: the idea Sender in round : :! send m to all Process p in round! k, # k # f+!! :! if delivered m in round k- and p " sender then 2:!! send m to all 3:!! halt 4:!
More information7 RC Simulates RA. Lemma: For every RA expression E(A 1... A k ) there exists a DRC formula F with F V (F ) = {A 1,..., A k } and
7 RC Simulates RA. We now show that DRC (and hence TRC) is at least as expressive as RA. That is, given an RA expression E that mentions at most C, there is an equivalent DRC expression E that mentions
More informationc Copyright 2015 Paraschos Koutris
c Copyright 2015 Paraschos Koutris Query Processing for Massively Parallel Systems Paraschos Koutris A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy
More informationLecture 26 December 1, 2015
CS 229r: Algorithms for Big Data Fall 2015 Prof. Jelani Nelson Lecture 26 December 1, 2015 Scribe: Zezhou Liu 1 Overview Final project presentations on Thursday. Jelani has sent out emails about the logistics:
More informationLecture 6. Today we shall use graph entropy to improve the obvious lower bound on good hash functions.
CSE533: Information Theory in Computer Science September 8, 010 Lecturer: Anup Rao Lecture 6 Scribe: Lukas Svec 1 A lower bound for perfect hash functions Today we shall use graph entropy to improve the
More informationFactorized Relational Databases Olteanu and Závodný, University of Oxford
November 8, 2013 Database Seminar, U Washington Factorized Relational Databases http://www.cs.ox.ac.uk/projects/fd/ Olteanu and Závodný, University of Oxford Factorized Representations of Relations Cust
More informationHow to Optimally Allocate Resources for Coded Distributed Computing?
1 How to Optimally Allocate Resources for Coded Distributed Computing? Qian Yu, Songze Li, Mohammad Ali Maddah-Ali, and A. Salman Avestimehr Department of Electrical Engineering, University of Southern
More informationComplexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler
Complexity Theory Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität Wien 15 May, 2018 Reinhard
More informationOutline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.
Complexity Theory Complexity Theory Outline Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität
More informationNotes on MapReduce Algorithms
Notes on MapReduce Algorithms Barna Saha 1 Finding Minimum Spanning Tree of a Dense Graph in MapReduce We are given a graph G = (V, E) on V = N vertices and E = m N 1+c edges for some constant c > 0. Our
More informationCOMS E F15. Lecture 22: Linearity Testing Sparse Fourier Transform
COMS E6998-9 F15 Lecture 22: Linearity Testing Sparse Fourier Transform 1 Administrivia, Plan Thu: no class. Happy Thanksgiving! Tue, Dec 1 st : Sergei Vassilvitskii (Google Research) on MapReduce model
More informationDatabase Theory VU , SS Complexity of Query Evaluation. Reinhard Pichler
Database Theory Database Theory VU 181.140, SS 2018 5. Complexity of Query Evaluation Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität Wien 17 April, 2018 Pichler
More information2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51
2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each
More informationFault-Tolerant Consensus
Fault-Tolerant Consensus CS556 - Panagiota Fatourou 1 Assumptions Consensus Denote by f the maximum number of processes that may fail. We call the system f-resilient Description of the Problem Each process
More informationLimits to Approximability: When Algorithms Won't Help You. Note: Contents of today s lecture won t be on the exam
Limits to Approximability: When Algorithms Won't Help You Note: Contents of today s lecture won t be on the exam Outline Limits to Approximability: basic results Detour: Provers, verifiers, and NP Graph
More informationComputational Game Theory Spring Semester, 2005/6. Lecturer: Yishay Mansour Scribe: Ilan Cohen, Natan Rubin, Ophir Bleiberg*
Computational Game Theory Spring Semester, 2005/6 Lecture 5: 2-Player Zero Sum Games Lecturer: Yishay Mansour Scribe: Ilan Cohen, Natan Rubin, Ophir Bleiberg* 1 5.1 2-Player Zero Sum Games In this lecture
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 3: Query Processing Query Processing Decomposition Localization Optimization CS 347 Notes 3 2 Decomposition Same as in centralized system
More informationvariance of independent variables: sum of variances So chebyshev predicts won t stray beyond stdev.
Announcements No class monday. Metric embedding seminar. Review expectation notion of high probability. Markov. Today: Book 4.1, 3.3, 4.2 Chebyshev. Remind variance, standard deviation. σ 2 = E[(X µ X
More informationCausality and Time. The Happens-Before Relation
Causality and Time The Happens-Before Relation Because executions are sequences of events, they induce a total order on all the events It is possible that two events by different processors do not influence
More informationLower Bound Techniques for Multiparty Communication Complexity
Lower Bound Techniques for Multiparty Communication Complexity Qin Zhang Indiana University Bloomington Based on works with Jeff Phillips, Elad Verbin and David Woodruff 1-1 The multiparty number-in-hand
More informationLecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane
Randomized Algorithms Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane Sotiris Nikoletseas Professor CEID - ETY Course 2017-2018 Sotiris Nikoletseas, Professor Randomized
More informationComplexity theory for fellow CS students
This document contains some basics of the complexity theory. It is mostly based on the lecture course delivered at CS dept. by Meran G. Furugyan. Differences are subtle. Disclaimer of warranties apply:
More informationInformation-Theoretic Lower Bounds on the Storage Cost of Shared Memory Emulation
Information-Theoretic Lower Bounds on the Storage Cost of Shared Memory Emulation Viveck R. Cadambe EE Department, Pennsylvania State University, University Park, PA, USA viveck@engr.psu.edu Nancy Lynch
More informationCS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms
CS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms Tim Roughgarden March 3, 2016 1 Preamble In CS109 and CS161, you learned some tricks of
More informationStream Computation and Arthur- Merlin Communication
Stream Computation and Arthur- Merlin Communication Justin Thaler, Yahoo! Labs Joint Work with: Amit Chakrabarti, Dartmouth Graham Cormode, University of Warwick Andrew McGregor, Umass Amherst Suresh Venkatasubramanian,
More informationSecond main application of Chernoff: analysis of load balancing. Already saw balls in bins example. oblivious algorithms only consider self packet.
Routing Second main application of Chernoff: analysis of load balancing. Already saw balls in bins example synchronous message passing bidirectional links, one message per step queues on links permutation
More informationRandomized Algorithms. Lecture 4. Lecturer: Moni Naor Scribe by: Tamar Zondiner & Omer Tamuz Updated: November 25, 2010
Randomized Algorithms Lecture 4 Lecturer: Moni Naor Scribe by: Tamar Zondiner & Omer Tamuz Updated: November 25, 2010 1 Pairwise independent hash functions In the previous lecture we encountered two families
More informationA Lower Bound for Boolean Satisfiability on Turing Machines
A Lower Bound for Boolean Satisfiability on Turing Machines arxiv:1406.5970v1 [cs.cc] 23 Jun 2014 Samuel C. Hsieh Computer Science Department, Ball State University March 16, 2018 Abstract We establish
More informationProbabilistic Proofs of Existence of Rare Events. Noga Alon
Probabilistic Proofs of Existence of Rare Events Noga Alon Department of Mathematics Sackler Faculty of Exact Sciences Tel Aviv University Ramat-Aviv, Tel Aviv 69978 ISRAEL 1. The Local Lemma In a typical
More informationStochastic Submodular Cover with Limited Adaptivity
Stochastic Submodular Cover with Limited Adaptivity Arpit Agarwal Sepehr Assadi Sanjeev Khanna Abstract In the submodular cover problem, we are given a non-negative monotone submodular function f over
More informationarxiv: v7 [cs.db] 22 Dec 2015
It s all a matter of degree: Using degree information to optimize multiway joins Manas R. Joglekar 1 and Christopher M. Ré 2 1 Department of Computer Science, Stanford University 450 Serra Mall, Stanford,
More informationLecture 13: 04/23/2014
COMS 6998-3: Sub-Linear Algorithms in Learning and Testing Lecturer: Rocco Servedio Lecture 13: 04/23/2014 Spring 2014 Scribe: Psallidas Fotios Administrative: Submit HW problem solutions by Wednesday,
More informationQueries and Materialized Views on Probabilistic Databases
Queries and Materialized Views on Probabilistic Databases Nilesh Dalvi Christopher Ré Dan Suciu September 11, 2008 Abstract We review in this paper some recent yet fundamental results on evaluating queries
More informationNP-Complete Reductions 2
x 1 x 1 x 2 x 2 x 3 x 3 x 4 x 4 12 22 32 CS 447 11 13 21 23 31 33 Algorithms NP-Complete Reductions 2 Prof. Gregory Provan Department of Computer Science University College Cork 1 Lecture Outline NP-Complete
More informationDatabase Theory VU , SS Ehrenfeucht-Fraïssé Games. Reinhard Pichler
Database Theory Database Theory VU 181.140, SS 2018 7. Ehrenfeucht-Fraïssé Games Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität Wien 15 May, 2018 Pichler 15
More informationSection 6 Fault-Tolerant Consensus
Section 6 Fault-Tolerant Consensus CS586 - Panagiota Fatourou 1 Description of the Problem Consensus Each process starts with an individual input from a particular value set V. Processes may fail by crashing.
More informationThe Communication Complexity of Correlation. Prahladh Harsha Rahul Jain David McAllester Jaikumar Radhakrishnan
The Communication Complexity of Correlation Prahladh Harsha Rahul Jain David McAllester Jaikumar Radhakrishnan Transmitting Correlated Variables (X, Y) pair of correlated random variables Transmitting
More informationSymmetry and hardness of submodular maximization problems
Symmetry and hardness of submodular maximization problems Jan Vondrák 1 1 Department of Mathematics Princeton University Jan Vondrák (Princeton University) Symmetry and hardness 1 / 25 Submodularity Definition
More informationFinding similar items
Finding similar items CSE 344, section 10 June 2, 2011 In this section, we ll go through some examples of finding similar item sets. We ll directly compare all pairs of sets being considered using the
More information(x 1 +x 2 )(x 1 x 2 )+(x 2 +x 3 )(x 2 x 3 )+(x 3 +x 1 )(x 3 x 1 ).
CMPSCI611: Verifying Polynomial Identities Lecture 13 Here is a problem that has a polynomial-time randomized solution, but so far no poly-time deterministic solution. Let F be any field and let Q(x 1,...,
More informationarxiv: v1 [cs.ai] 21 Sep 2014
Oblivious Bounds on the Probability of Boolean Functions WOLFGANG GATTERBAUER, Carnegie Mellon University DAN SUCIU, University of Washington arxiv:409.6052v [cs.ai] 2 Sep 204 This paper develops upper
More informationChapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved.
Chapter 11 Approximation Algorithms Slides by Kevin Wayne. Copyright @ 2005 Pearson-Addison Wesley. All rights reserved. 1 Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should
More informationMa/CS 117c Handout # 5 P vs. NP
Ma/CS 117c Handout # 5 P vs. NP We consider the possible relationships among the classes P, NP, and co-np. First we consider properties of the class of NP-complete problems, as opposed to those which are
More informationCommunication Complexity
Communication Complexity Jaikumar Radhakrishnan School of Technology and Computer Science Tata Institute of Fundamental Research Mumbai 31 May 2012 J 31 May 2012 1 / 13 Plan 1 Examples, the model, the
More informationLecture 4 Chiu Yuen Koo Nikolai Yakovenko. 1 Summary. 2 Hybrid Encryption. CMSC 858K Advanced Topics in Cryptography February 5, 2004
CMSC 858K Advanced Topics in Cryptography February 5, 2004 Lecturer: Jonathan Katz Lecture 4 Scribe(s): Chiu Yuen Koo Nikolai Yakovenko Jeffrey Blank 1 Summary The focus of this lecture is efficient public-key
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 6: Computational Complexity of Learning Proper vs Improper Learning Learning Using FIND-CONS For any family of hypothesis
More informationOn Locating-Dominating Codes in Binary Hamming Spaces
Discrete Mathematics and Theoretical Computer Science 6, 2004, 265 282 On Locating-Dominating Codes in Binary Hamming Spaces Iiro Honkala and Tero Laihonen and Sanna Ranto Department of Mathematics and
More information6.842 Randomness and Computation Lecture 5
6.842 Randomness and Computation 2012-02-22 Lecture 5 Lecturer: Ronitt Rubinfeld Scribe: Michael Forbes 1 Overview Today we will define the notion of a pairwise independent hash function, and discuss its
More informationLecture 4: Proof of Shannon s theorem and an explicit code
CSE 533: Error-Correcting Codes (Autumn 006 Lecture 4: Proof of Shannon s theorem and an explicit code October 11, 006 Lecturer: Venkatesan Guruswami Scribe: Atri Rudra 1 Overview Last lecture we stated
More informationLecture 21: P vs BPP 2
Advanced Complexity Theory Spring 206 Prof. Dana Moshkovitz Lecture 2: P vs BPP 2 Overview In the previous lecture, we began our discussion of pseudorandomness. We presented the Blum- Micali definition
More informationCS 347 Distributed Databases and Transaction Processing Notes03: Query Processing
CS 347 Distributed Databases and Transaction Processing Notes03: Query Processing Hector Garcia-Molina Zoltan Gyongyi CS 347 Notes 03 1 Query Processing! Decomposition! Localization! Optimization CS 347
More information10.4 The Kruskal Katona theorem
104 The Krusal Katona theorem 141 Example 1013 (Maximum weight traveling salesman problem We are given a complete directed graph with non-negative weights on edges, and we must find a maximum weight Hamiltonian
More informationCSE 20 DISCRETE MATH WINTER
CSE 20 DISCRETE MATH WINTER 2016 http://cseweb.ucsd.edu/classes/wi16/cse20-ab/ Today's learning goals Define and differentiate between important sets Use correct notation when describing sets: {...}, intervals
More informationFrom Secure MPC to Efficient Zero-Knowledge
From Secure MPC to Efficient Zero-Knowledge David Wu March, 2017 The Complexity Class NP NP the class of problems that are efficiently verifiable a language L is in NP if there exists a polynomial-time
More informationLecture 38. CSE 331 Dec 4, 2017
Lecture 38 CSE 331 Dec 4, 2017 Quiz 2 1:00-1:10pm Lecture starts at 1:15pm Final Reminder Review sessions/extra OH Shortest Path Problem Input: (Directed) Graph G=(V,E) and for every edge e has a cost
More informationAlgorithms for Data Science
Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based
More informationFlow Algorithms for Two Pipelined Filtering Problems
Flow Algorithms for Two Pipelined Filtering Problems Anne Condon, University of British Columbia Amol Deshpande, University of Maryland Lisa Hellerstein, Polytechnic University, Brooklyn Ning Wu, Polytechnic
More informationLecture 12: Interactive Proofs
princeton university cos 522: computational complexity Lecture 12: Interactive Proofs Lecturer: Sanjeev Arora Scribe:Carl Kingsford Recall the certificate definition of NP. We can think of this characterization
More informationApplication: Bucket Sort
5.2.2. Application: Bucket Sort Bucket sort breaks the log) lower bound for standard comparison-based sorting, under certain assumptions on the input We want to sort a set of =2 integers chosen I+U@R from
More information6.852: Distributed Algorithms Fall, Class 10
6.852: Distributed Algorithms Fall, 2009 Class 10 Today s plan Simulating synchronous algorithms in asynchronous networks Synchronizers Lower bound for global synchronization Reading: Chapter 16 Next:
More informationReal-Time Course. Clock synchronization. June Peter van der TU/e Computer Science, System Architecture and Networking
Real-Time Course Clock synchronization 1 Clocks Processor p has monotonically increasing clock function C p (t) Clock has drift rate For t1 and t2, with t2 > t1 (1-ρ)(t2-t1)
More informationCS425: Algorithms for Web Scale Data
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Challenges
More informationPrivacy-Preserving Data Mining
CS 380S Privacy-Preserving Data Mining Vitaly Shmatikov slide 1 Reading Assignment Evfimievski, Gehrke, Srikant. Limiting Privacy Breaches in Privacy-Preserving Data Mining (PODS 2003). Blum, Dwork, McSherry,
More informationPolynomial Identity Testing and Circuit Lower Bounds
Polynomial Identity Testing and Circuit Lower Bounds Robert Špalek, CWI based on papers by Nisan & Wigderson, 1994 Kabanets & Impagliazzo, 2003 1 Randomised algorithms For some problems (polynomial identity
More informationLecture 5 - Information theory
Lecture 5 - Information theory Jan Bouda FI MU May 18, 2012 Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 1 / 42 Part I Uncertainty and entropy Jan Bouda (FI MU) Lecture 5 - Information
More informationMethods that transform the original network into a tighter and tighter representations
hapter pproximation of inference: rc, path and i-consistecy Methods that transform the original network into a tighter and tighter representations all 00 IS 75 - onstraint Networks rc-consistency X, Y,
More informationCS264: Beyond Worst-Case Analysis Lecture #11: LP Decoding
CS264: Beyond Worst-Case Analysis Lecture #11: LP Decoding Tim Roughgarden October 29, 2014 1 Preamble This lecture covers our final subtopic within the exact and approximate recovery part of the course.
More informationDistributed Oblivious RAM for Secure Two-Party Computation
Seminar in Distributed Computing Distributed Oblivious RAM for Secure Two-Party Computation Steve Lu & Rafail Ostrovsky Philipp Gamper Philipp Gamper 2017-04-25 1 Yao s millionaires problem Two millionaires
More informationPCP Theorem and Hardness of Approximation
PCP Theorem and Hardness of Approximation An Introduction Lee Carraher and Ryan McGovern Department of Computer Science University of Cincinnati October 27, 2003 Introduction Assuming NP P, there are many
More informationLecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing
Bloom filters and Hashing 1 Introduction The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of
More informationLecture 2: January 18
CS271 Randomness & Computation Spring 2018 Instructor: Alistair Sinclair Lecture 2: January 18 Disclaimer: These notes have not been subjected to the usual scrutiny accorded to formal publications. They
More informationCS632 Notes on Relational Query Languages I
CS632 Notes on Relational Query Languages I A. Demers 6 Feb 2003 1 Introduction Here we define relations, and introduce our notational conventions, which are taken almost directly from [AD93]. We begin
More informationTheory of Computer Science to Msc Students, Spring Lecture 2
Theory of Computer Science to Msc Students, Spring 2007 Lecture 2 Lecturer: Dorit Aharonov Scribe: Bar Shalem and Amitai Gilad Revised: Shahar Dobzinski, March 2007 1 BPP and NP The theory of computer
More informationNotes on Complexity Theory Last updated: November, Lecture 10
Notes on Complexity Theory Last updated: November, 2015 Lecture 10 Notes by Jonathan Katz, lightly edited by Dov Gordon. 1 Randomized Time Complexity 1.1 How Large is BPP? We know that P ZPP = RP corp
More informationINSTITUTE OF MATHEMATICS THE CZECH ACADEMY OF SCIENCES. A trade-off between length and width in resolution. Neil Thapen
INSTITUTE OF MATHEMATICS THE CZECH ACADEMY OF SCIENCES A trade-off between length and width in resolution Neil Thapen Preprint No. 59-2015 PRAHA 2015 A trade-off between length and width in resolution
More informationAlgorithms for Collective Communication. Design and Analysis of Parallel Algorithms
Algorithms for Collective Communication Design and Analysis of Parallel Algorithms Source A. Grama, A. Gupta, G. Karypis, and V. Kumar. Introduction to Parallel Computing, Chapter 4, 2003. Outline One-to-all
More informationThe Complexity of a Reliable Distributed System
The Complexity of a Reliable Distributed System Rachid Guerraoui EPFL Alexandre Maurer EPFL Abstract Studying the complexity of distributed algorithms typically boils down to evaluating how the number
More informationTheoretical Cryptography, Lecture 10
Theoretical Cryptography, Lecture 0 Instructor: Manuel Blum Scribe: Ryan Williams Feb 20, 2006 Introduction Today we will look at: The String Equality problem, revisited What does a random permutation
More informationMath 262A Lecture Notes - Nechiporuk s Theorem
Math 6A Lecture Notes - Nechiporuk s Theore Lecturer: Sa Buss Scribe: Stefan Schneider October, 013 Nechiporuk [1] gives a ethod to derive lower bounds on forula size over the full binary basis B The lower
More informationCS505: Distributed Systems
Department of Computer Science CS505: Distributed Systems Lecture 10: Consensus Outline Consensus impossibility result Consensus with S Consensus with Ω Consensus Most famous problem in distributed computing
More informationLecture 16: Communication Complexity
CSE 531: Computational Complexity I Winter 2016 Lecture 16: Communication Complexity Mar 2, 2016 Lecturer: Paul Beame Scribe: Paul Beame 1 Communication Complexity In (two-party) communication complexity
More informationParallel Pipelined Filter Ordering with Precedence Constraints
Parallel Pipelined Filter Ordering with Precedence Constraints Amol Deshpande #, Lisa Hellerstein 2 # University of Maryland, College Park, MD USA amol@cs.umd.edu Polytechnic Institute of NYU, Brooklyn,
More informationAlgorithms and Theory of Computation. Lecture 11: Network Flow
Algorithms and Theory of Computation Lecture 11: Network Flow Xiaohui Bei MAS 714 September 18, 2018 Nanyang Technological University MAS 714 September 18, 2018 1 / 26 Flow Network A flow network is a
More information1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:
CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at
More informationNew Attacks on the Concatenation and XOR Hash Combiners
New Attacks on the Concatenation and XOR Hash Combiners Itai Dinur Department of Computer Science, Ben-Gurion University, Israel Abstract. We study the security of the concatenation combiner H 1(M) H 2(M)
More informationValency Arguments CHAPTER7
CHAPTER7 Valency Arguments In a valency argument, configurations are classified as either univalent or multivalent. Starting from a univalent configuration, all terminating executions (from some class)
More informationV.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74
V.4 MapReduce. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4 and RU Chapter 2!74 Why MapReduce? Large clusters of commodity computers (as opposed to few supercomputers) Challenges:
More informationLecture 4: DES and block ciphers
Lecture 4: DES and block ciphers Johan Håstad, transcribed by Ernir Erlingsson 2006-01-25 1 DES DES is a 64 bit block cipher with a 56 bit key. It selects a 64 bit block and modifies it depending on the
More informationModel Counting for Logical Theories
Model Counting for Logical Theories Wednesday Dmitry Chistikov Rayna Dimitrova Department of Computer Science University of Oxford, UK Max Planck Institute for Software Systems (MPI-SWS) Kaiserslautern
More informationNP-COMPLETE PROBLEMS. 1. Characterizing NP. Proof
T-79.5103 / Autumn 2006 NP-complete problems 1 NP-COMPLETE PROBLEMS Characterizing NP Variants of satisfiability Graph-theoretic problems Coloring problems Sets and numbers Pseudopolynomial algorithms
More informationCSE 20 DISCRETE MATH SPRING
CSE 20 DISCRETE MATH SPRING 2016 http://cseweb.ucsd.edu/classes/sp16/cse20-ac/ Today's learning goals Describe computer representation of sets with bitstrings Define and compute the cardinality of finite
More informationNotes for Lecture 27
U.C. Berkeley CS276: Cryptography Handout N27 Luca Trevisan April 30, 2009 Notes for Lecture 27 Scribed by Madhur Tulsiani, posted May 16, 2009 Summary In this lecture we begin the construction and analysis
More informationOn the Complexity of Mapping Pipelined Filtering Services on Heterogeneous Platforms
On the Complexity of Mapping Pipelined Filtering Services on Heterogeneous Platforms Anne Benoit, Fanny Dufossé and Yves Robert LIP, École Normale Supérieure de Lyon, France {Anne.Benoit Fanny.Dufosse
More informationFriendly Logics, Fall 2015, Lecture Notes 5
Friendly Logics, Fall 2015, Lecture Notes 5 Val Tannen 1 FO definability In these lecture notes we restrict attention to relational vocabularies i.e., vocabularies consisting only of relation symbols (or
More informationKnowledge Discovery and Data Mining 1 (VO) ( )
Knowledge Discovery and Data Mining 1 (VO) (707.003) Map-Reduce Denis Helic KTI, TU Graz Oct 24, 2013 Denis Helic (KTI, TU Graz) KDDM1 Oct 24, 2013 1 / 82 Big picture: KDDM Probability Theory Linear Algebra
More information