CS 347 Parallel and Distributed Data Processing

Similar documents
CS 347 Distributed Databases and Transaction Processing Notes03: Query Processing

CS 347 Parallel and Distributed Data Processing

P Q1 Q2 Q3 Q4 Q5 Tot (60) (20) (20) (20) (60) (20) (200) You are allotted a maximum of 4 hours to complete this exam.

CS 347 Parallel and Distributed Data Processing

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

Correlated subqueries. Query Optimization. Magic decorrelation. COUNT bug. Magic example (slide 2) Magic example (slide 1)

CSE 562 Database Systems

Relational Algebra on Bags. Why Bags? Operations on Bags. Example: Bag Selection. σ A+B < 5 (R) = A B

CS 347. Parallel and Distributed Data Processing. Spring Notes 11: MapReduce

Introduction to Cryptography Lecture 13

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins

Logic and Databases. Phokion G. Kolaitis. UC Santa Cruz & IBM Research Almaden. Lecture 4 Part 1

communication complexity lower bounds yield data structure lower bounds

Database Design and Implementation

Quiz 2. Due November 26th, CS525 - Advanced Database Organization Solutions

Cut-and-Choose Yao-Based Secure Computation in the Online/Offline and Batch Settings

Benny Pinkas Bar Ilan University

Factorized Relational Databases Olteanu and Závodný, University of Oxford

7 RC Simulates RA. Lemma: For every RA expression E(A 1... A k ) there exists a DRC formula F with F V (F ) = {A 1,..., A k } and

Relational operations

Flow Algorithms for Two Pipelined Filtering Problems

Databases 2011 The Relational Algebra

Multi-join Query Evaluation on Big Data Lecture 2

2.4 Graphing Inequalities

CS425: Algorithms for Web Scale Data

Question 1. The Chinese University of Hong Kong, Spring 2018

ANALYSIS OF PRIVACY-PRESERVING ELEMENT REDUCTION OF A MULTISET

Jeffrey D. Ullman Stanford University

Flow Algorithms for Parallel Query Optimization

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

Environment (Parallelizing Query Optimization)

A Dichotomy. in in Probabilistic Databases. Joint work with Robert Fink. for Non-Repeating Queries with Negation Queries with Negation

Lecture 21: Algebraic Computation Models

Relational Algebra and Calculus

Foundations of Databases

Lecture 19: Interactive Proofs and the PCP Theorem

Relational Algebra 2. Week 5

Fast Cut-and-Choose Based Protocols for Malicious and Covert Adversaries

CS54100: Database Systems

Lecture 4 February 2nd, 2017

Non-Interactive Zero-Knowledge Proofs of Non-Membership

A Worst-Case Optimal Multi-Round Algorithm for Parallel Computation of Conjunctive Queries

2. Accelerated Computations

Schema Refinement & Normalization Theory: Functional Dependencies INFS-614 INFS614, GMU 1

CSE 4502/5717 Big Data Analytics Spring 2018; Homework 1 Solutions

DRAFT. Algebraic computation models. Chapter 14

The Query Containment Problem: Set Semantics vs. Bag Semantics. Phokion G. Kolaitis University of California Santa Cruz & IBM Research - Almaden

INTRODUCTION TO RELATIONAL DATABASE SYSTEMS

Turing s Thesis. Fall Costas Busch - RPI!1

α-acyclic Joins Jef Wijsen May 4, 2017

An Efficient and Secure Protocol for Privacy Preserving Set Intersection

Verifying Computations in the Cloud (and Elsewhere) Michael Mitzenmacher, Harvard University Work offloaded to Justin Thaler, Harvard University

Communication Complexity

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

High Performance Computing

Query Processing. 3 steps: Parsing & Translation Optimization Evaluation

Schedule. Today: Jan. 17 (TH) Jan. 24 (TH) Jan. 29 (T) Jan. 22 (T) Read Sections Assignment 2 due. Read Sections Assignment 3 due.

Lecture 14: Secure Multiparty Computation

Lecture 11: Non-Interactive Zero-Knowledge II. 1 Non-Interactive Zero-Knowledge in the Hidden-Bits Model for the Graph Hamiltonian problem

1 Approximate Quantiles and Summaries

Lecture 3: Interactive Proofs and Zero-Knowledge

Lecture 4: DES and block ciphers

Exam 1. March 12th, CS525 - Midterm Exam Solutions

DESIGN THEORY FOR RELATIONAL DATABASES. csc343, Introduction to Databases Renée J. Miller and Fatemeh Nargesian and Sina Meraji Winter 2018

Quantum Mechanics - I Prof. Dr. S. Lakshmi Bala Department of Physics Indian Institute of Technology, Madras. Lecture - 16 The Quantum Beam Splitter

FIRST-ORDER QUERY EVALUATION ON STRUCTURES OF BOUNDED DEGREE

Lecture 11: Data Flow Analysis III

NP-COMPLETE PROBLEMS. 1. Characterizing NP. Proof

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

6.830 Lecture 11. Recap 10/15/2018

CS6931 Database Seminar. Lecture 6: Set Operations on Massive Data

Advanced Topics in Cryptography

Approximate Query Processing Using Wavelets

CS475: Linear Equations Gaussian Elimination LU Decomposition Wim Bohm Colorado State University

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

Session 4: Efficient Zero Knowledge. Yehuda Lindell Bar-Ilan University

UVA UVA UVA UVA. Database Design. Relational Database Design. Functional Dependency. Loss of Information

Data-Intensive Distributed Computing

Lecture 10: Data Flow Analysis II

Logical Provenance in Data-Oriented Workflows (Long Version)

Optimal Reductions between Oblivious Transfers using Interactive Hashing

A Tight Lower Bound for Dynamic Membership in the External Memory Model

Instructor: Sudeepa Roy

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events

Algorithms for Data Science

Secure Indexes* Eu-Jin Goh Stanford University 15 March 2004

Query answering using views

Chosen IV Statistical Analysis for Key Recovery Attacks on Stream Ciphers

Movies Title Director Actor

CS632 Notes on Relational Query Languages I

Busch Complexity Lectures: Turing s Thesis. Costas Busch - LSU 1

GAV-sound with conjunctive queries

Differential Privacy

Lecture 16 Oct 21, 2014

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Branch Prediction based attacks using Hardware performance Counters IIT Kharagpur

Finally the Weakest Failure Detector for Non-Blocking Atomic Commit

Extending Oblivious Transfers Efficiently

Lifted Inference: Exact Search Based Algorithms

Transcription:

CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 3: Query Processing

Query Processing Decomposition Localization Optimization CS 347 Notes 3 2

Decomposition Same as in centralized system 1. Normalization 2. Eliminating redundancy 3. Algebraic rewriting CS 347 Notes 3 3

Normalization Convert from general language to a standard form E.g., relational algebra CS 347 Notes 3 4

Normalization Example: select R.A, R.C from R, S where R.A = S.A and ((R.B = 1 and S.D = 2) or (R.C > 3 and S.D = 2)) R.A,R.C σ (R.B=1 R.C>3) S.D=2 A conjunctive normal form R S CS 347 Notes 3 5

Normalization Also detect invalid expressions select * from R where R.D = 3 R does not have D attribute CS 347 Notes 3 6

Eliminating Redundancy E.g., in conditions (S.A = 1) (S.A > 5) false (S.A < 10) (S.A < 5) S.A < 5 CS 347 Notes 3 7

Eliminating Redundancy E.g., sharing common sub-expressions S σ cond σ cond T S σ cond T R R R CS 347 Notes 3 8

Algebraic Rewriting E.g., pushing conditions down σ cond3 σ cond R S σ cond1 σ cond2 R S CS 347 Notes 3 9

Query Processing Decomposition One or more algebraic query trees on relations Localization Replacing relations by corresponding fragments CS 347 Notes 3 10

Localization Steps (1) Start with query (2) Replace relations by fragments (3) Push up and, σ down (apply CS 245 rules) (4) Simplify ~ eliminate unnecessary operations CS 347 Notes 3 11

Localization Notation for fragment [R : cond] conditions its tuples satisfy fragment CS 347 Notes 3 12

Example A (1) σ E=3 R CS 347 Notes 3 13

Example A (2) σ E=3 [R 1 : E<10] [R 2 : E 10] CS 347 Notes 3 14

Example A (3) σ E=3 σ E=3 [R 1 : E<10] [R 2 : E 10] CS 347 Notes 3 15

Example A (3) σ E=3 σ E=3 [R 1 : E<10] [R 2 : E 10] Ø CS 347 Notes 3 16

Example A (4) σ E=3 [R 1 : E<10] CS 347 Notes 3 17

Localization Rule 1 σ c1 [R: c2] σ c1 [R: c1 c2] [R: false] Ø CS 347 Notes 3 18

Example A σ E=3 [R 2 : E 10] σ E=3 [R 2 : E=3 E 10] σ E=3 [R 2 : false] Ø CS 347 Notes 3 19

Example B (1) A A = common attribute R S CS 347 Notes 3 20

Example B (2) A [R 1 : A<5] [R 2 : 5 A 10] [R 3 : A>10] [S 1 : A<5] [S 2 : A 5] CS 347 Notes 3 21

Example B (3) A A A A [R 1 : A<5] [S 1 : A<5] [R 1 : A<5] [S 2 : A 5] [R 3 : A>10] [S 1 : A<5] [R 3 : A>10] [S 2 : A 5] A A [R 2 : 5 A 10] [S 1 : A<5] [R 2 : 5 A 10] [S 2 : A 5] CS 347 Notes 3 22

Example B (3) A A A A [R 1 : A<5] [S 1 : A<5] [R 1 : A<5] [S 2 : A 5] [R 3 : A>10] [S 1 : A<5] [R 3 : A>10] [S 2 : A 5] A A [R 2 : 5 A 10] [S 1 : A<5] [R 2 : 5 A 10] [S 2 : A 5] CS 347 Notes 3 23

Example B (4) A A A [R 1 : A<5] [S 1 : A<5] [R 2 : 5 A 10] [S 2 : A 5] [R 3 : A>10] [S 2 : A 5] CS 347 Notes 3 24

Localization Rule 2 [R: c 1 ] [S: c 2 ] [R S: c 1 c 2 R.A=S.A] A A [R: false] Ø CS 347 Notes 3 25

Example B [R 1 : A < 5] [S 2 : A 5] A [R 1 A S 2 : R 1.A < 5 S 2.A 5 R 1.A=S 2.A] [R 1 A S 2 : false] Ø CS 347 Notes 3 26

Example C (2) K [R 1 : A<10] [R 2 : A 10] [S 1 : K=R.K R.A<10] [S 2 : K=R.K R.A 10] derived fragmentation CS 347 Notes 3 27

Example C (3) K K K K [R 1 ] [S 1 ] [R 1 ] [S 2 ] [S 1 ] [R 2 ] [R 2 ] [S 2 ] CS 347 Notes 3 28

Example C (3) K K K K [R 1 ] [S 1 ] [R 1 ] [S 2 ] [S 1 ] [R 2 ] [R 2 ] [S 2 ] CS 347 Notes 3 29

Example C (4) K K [R 1 : A<10] [S 1 : K=R.K R.A<10] [R 2 : A 10] [S 2 : K=R.K R.A 10] CS 347 Notes 3 30

Example C [R 1 : A < 10] [S 2 : K = R.K R.A 10] K [R 1 K S 2 : R 1.A < 10 S 2.K = R.K R.A 10 R 1.K = S 2.K] [R 1 K S 2 : false] Ø CS 347 Notes 3 31

Example D (1) A R R1 (K, A, B) R2 (K, C, D) vertical fragmentation CS 347 Notes 3 32

Example D (2) A K R 1 (K,A,B) R 2 (K,C,D) CS 347 Notes 3 33

Example D (2) A K K,A K,A R 1 (K,A,B) R 2 (K,C,D) CS 347 Notes 3 34

Example D (4) A R 1 (K, A, B) CS 347 Notes 3 35

Rule 3 Consider the vertical fragmentation R i = (R), A i A A i Then for any B A B (R) = B ( R i B A i Ø ) i CS 347 Notes 3 36

Query Processing Decomposition Localization Optimization Overview Joins and other operations Inter-operation parallelism Optimization strategies CS 347 Notes 3 37

Overview Optimization process 1. Generate query plans P 1 P 2 P 3 P n 2. Estimate size of intermediate results 3. Estimate cost of plan C 1 C 2 C 3 C n 4. Pick minimum CS 347 Notes 3 38

Overview Differences from centralized optimization New strategies for some operations Parallel/distributed sort Parallel/distributed join Semi-join Privacy preserving join Duplicate elimination Aggregation Selection Many ways to assign and schedule processors CS 347 Notes 3 39

Parallel/Distributed Sort Input a) Relation R on single site/disk b) Relation R fragmented/partitioned by sort attribute c) Relation R fragmented/partitioned by another attribute CS 347 Notes 3 40

Parallel/Distributed Sort Output a) Sorted R on single site/disk b) Sorted R fragments/partitions 5 6 10 12 15 19 20 21 50 F 1 F 2 F 3 CS 347 Notes 3 41

Basic Sort R(K, ) to be sorted on K Horizontally fragmented on K using vector (k 0, k 1,..., k n ) 7 3 k 0 k 1 11 10 20 17 14 27 22 CS 347 Notes 3 42

Basic Sort Algorithm 1. Sort each fragment independently 2. Ship results if necessary CS 347 Notes 3 43

Basic Sort Same idea on different architectures Shared nothing P 1 P 2 M F 1 M F 2 Shared memory P 1 P 2 F 1 M F 2 CS 347 Notes 3 44

Range-Partition Sort R(K, ) to be sorted on K R located at one or more site/disk, not fragmented on K CS 347 Notes 3 45

Range-Partition Sort Algorithm 1. Range partition on K 2. Basic sort R a R 1 local sort R 1 k 0 R b R 2 local sort R 2 result k 1 local sort R 3 R 3 CS 347 Notes 3 46

Partition Vectors Problem Select a good partition vector given fragments 7 52 11 14 31 8 15 11 32 17 10 12 4 R a R b R c CS 347 Notes 3 47

Partition Vectors Example approach Each site sends to coordinator MIN sort key MAX sort key Number of tuples Coordinator Computes vector and distributes to sites Decides the number of sites to perform local sorts CS 347 Notes 3 48

Partition Vectors Sample scenario Coordinator receives S A : MIN = 5 MAX = 9 # of tuples = 10 S B : MIN = 7 MAX = 16 # of tuples = 10 CS 347 Notes 3 49

Partition Vectors Sample scenario Coordinator receives S A : MIN = 5 MAX = 9 # of tuples = 10 S B : MIN = 7 MAX = 16 # of tuples = 10 Expected tuples 2 = 10 / (9 5 + 1) 1 = 10 / (16 7 + 1) 5 10 15 20 k 0? assuming we want to sort at 2 sites CS 347 Notes 3 50

Partition Vectors Sample scenario Expected tuples 2 1 Expected tuples with key < k 0 = half of total tuples 2(k 0 5) + (k 0 7) = 10 k 0 = (10 + 10 + 7) / 3 = 9 5 10 15 20 k 0? CS 347 Notes 3 51

Partition Vectors Variations Send more info to coordinator: a) Partition vector for local site b) Histogram 3 3 3 5 6 8 10 # of tuples local vector 5 6 7 8 9 10 CS 347 Notes 3 52

Partition Vectors Variations Multiple rounds 1. Sites send range and # of tuples to coordinator 2. Coordinator distributes preliminary vector V 0 3. Sites send coordinator # of tuples in each V 0 range 4. Coordinator computes and distributes final vector V f CS 347 Notes 3 53

Partition Vectors Variations Distributed algorithm that does not require a coordinator? CS 347 Notes 3 54

Parallel External Sort-Merge Same as range-partition sort except does sorting first R a local sort R a R 1 k 0 R b local sort R b R 2 result k 1 R 3 in order merge CS 347 Notes 3 55

Parallel/Distributed Join Input Relations R, S May or may not be fragmented Output R S Result at one or more sites CS 347 Notes 3 56

Partitioned Join (Equijoin) local join R 1 R A S 1 S A R B R 2 S 2 S B f(a) R 3 S 3 S C result f(a) CS 347 Notes 3 57

Partitioned Join (Equijoin) Same partition function f is used for both R and S f can be range or hash partitioning Local join can be of any type (use any CS 245 optimization) Various scheduling options, e.g., a) Partition R; partition S; join b) Partition R; build local hash table for R; partition S and join CS 347 Notes 3 58

Partitioned Join (Equijoin) We already know why it works [R 1 ] [R 2 ] [R 3 ] [S 1 ] [S 2 ] [S 3 ] [R 1 ] [S 1 ] [R 2 ] [S 2 ] [R 3 ] [S 3 ] Sometimes we partition just to make this join possible CS 347 Notes 3 59

Partitioned Join (Equijoin) Selecting a good partition function f is very important Goal is to make all R i + S i the same size CS 347 Notes 3 60

Asymmetric Fragment + Replicate Join local join R A R 1 S S A R B R 2 S S B partition R 3 S result union CS 347 Notes 3 61

Asymmetric Fragment + Replicate Join Can use any partition function f for R (even round robin) Can do any join, not just equijoin (e.g., R S) R.A<S.B CS 347 Notes 3 62

General Fragment + Replicate Join R A R 1 R 1 R B R 2 R 2 f partition 3 fragments R 3 R 3 n copies of each fragment CS 347 Notes 3 63

General Fragment + Replicate Join R A R 1 R 1 R B R 2 R 2 f partition 3 fragments R 3 R 3 n copies of each fragment S is partitioned in similar fashion CS 347 Notes 3 64

General Fragment + Replicate Join R 1 S 1 R 1 S 2 R 2 S 1 R 2 S 2 n m pairings of R, S fragments R 3 S 1 R 3 S 2 result CS 347 Notes 3 65

General Fragment + Replicate Join The asymmetric F+R join is special case of the general F+R Asymmetric F+R may be good if S is small Also works on non-equijoins CS 347 Notes 3 66

Semi-Join Goal: reduce communication traffic R S ( R S ) S or A A A R ( S R ) or A ( R S ) ( S R ) A A A A CS 347 Notes 3 67

Semi-Join R S A B A C R 2 a 10 b S 3 x 10 y 25 c 15 z 30 d 25 w 32 x CS 347 Notes 3 68

Semi-Join R S A B A C R 2 a 10 b S 3 x 10 y 25 c 15 z 30 d 25 w 32 x A R = [2, 10, 25, 30] CS 347 Notes 3 69

Semi-Join R S A B A C R 2 a 10 b S 3 x 10 y 25 c 15 z 30 d 25 w 32 x R S A R = [2, 10, 25, 30] 10 y S R = 25 w CS 347 Notes 3 70

Semi-Join A B A C R 2 a 10 b S 3 x 10 y 25 c 15 z 30 d 25 w 32 x Transmitted data With semi-join R ( S R ): T = 4 A + 2 A + C + result With join R S : T = 4 A + B + result better if B is large CS 347 Notes 3 71

Semi-Join Assume R is the smaller relation Then, in general, ( R S ) S is better than ( R S ) if A A A size( A S ) + size ( R S ) < size ( R ) A Can do similar comparison for other joins Only taking into account transmission cost CS 347 Notes 3 72

Semi-Join Trick: encode A S (or A R) as a bit vector keys in S 0 0 1 1 0 1 0 0 0 0 1 0 1 0 0 one bit/possible key CS 347 Notes 3 73

Semi-Join Useful for three way joins R S T CS 347 Notes 3 74

Semi-Join Useful for three way joins R S T Option 1: R S T where R = R S S = S T CS 347 Notes 3 75

Semi-Join Useful for three way joins R S T Option 1: R S T where R = R S S = S T Option 2: R S T where R = R S S = S T CS 347 Notes 3 76

Semi-Join Useful for three way joins R S T Option 1: R S T where R = R S S = S T Option 2: R S T where R = R S S = S T Many options ~ exponential in # of relations CS 347 Notes 3 77

Privacy Preserving Join Site 1 has R(A, B) Site 2 has S(A, C) Want to compute R S Site 1 should not discover any S info not in the join Site 2 should not discover any R info not in the join CS 347 Notes 3 78

Privacy Preserving Join Semi-join won t work If site 1 sends A R to site 2, site 2 learns all keys of R A B A C R a 1 b 1 A R = (a 1, a 2, a 3, a 4 ) S a 1 c 1 a 2 b 2 a 3 c 2 a 3 b 3 a 5 c 3 a 4 b 4 a 7 c 4 site 1 site 2 CS 347 Notes 3 79

Privacy Preserving Join Fix: send hash keys only Site 1 hashes each value of A before sending Site 2 hashes its own A values (same h) to see what tuples match R A B a 1 b 1 a 2 b 2 a 3 b 3 A R = (h(a 1 ), h(a 2 ), h(a 3 ), h(a 4 )) site 2 sees it has h(a 1 ), h(a 3 ) S A C a 1 c 1 a 3 c 2 a 5 c 3 a 4 b 4 a 7 c 4 (a1, c1), (a3, c3) site 1 site 2 CS 347 Notes 3 80

Privacy Preserving Join What is the problem? A B A C R a 1 b 1 A R = (h(a 1 ), h(a 2 ), h(a 3 ), h(a 4 )) S a 1 c 1 a 2 b 2 site 2 sees it has h(a 1 ), h(a 3 ) a 3 c 2 a 3 b 3 a 5 c 3 a 4 b 4 a 7 c 4 (a1, c1), (a3, c3) site 1 site 2 CS 347 Notes 3 81

Privacy Preserving Join What is the problem? A B A C R a 1 b 1 A R = (h(a 1 ), h(a 2 ), h(a 3 ), h(a 4 )) S a 1 c 1 a 2 b 2 site 2 sees it has h(a 1 ), h(a 3 ) a 3 c 2 a 3 b 3 a 5 c 3 a 4 b 4 (a1, c1), (a3, c3) a 7 c 4 site 1 site 2 Dictionary attack Site 2 can take all keys a 1, a 2, a 3, and checks if h(a 1 ), h(a 2 ), h(a 3 ), matches what site 1 sent CS 347 Notes 3 82

Privacy Preserving Join Our adversary model: honest but curious Dictionary attack is possible (cheating is internal and can t be caught) Sending incorrect keys not possible (cheater could be caught) CS 347 Notes 3 83

Privacy Preserving Join One solution Information Sharing Across Private Databases. Agrawal et al., SIGMOD 2003 Use commutative encryption functions E i (x) = x encrypted using a key private to site i E 1 (E 2 (x)) = E 2 (E 1 (x)) Shorthand: E 1 (x) is x, E 2 (x) is x, E 1 (E 2 (x)) is x CS 347 Notes 3 84

Privacy Preserving Join One solution R A B a 1 b 1 a 2 b 2 ( a 1, a 2, a 3, a 4 ) S A C a 1 c 1 a 3 c 2 a 3 b 3 ( a 1, a 2, a 3, a 4 ) a 5 c 3 a 4 b 4 a 7 c 4 ( a 1, a 3, a 5, a 7 ) site 1 site 2 Computes ( a 1, a 3, a 5, a 7 ), intersects with ( a 1, a 2, a 3, a 4 ) (a 1, b 1 ), (a 3, b 3 ) CS 347 Notes 3 85

Other Privacy Preserving Operations Inequality join Similarity join R S R.A>S.A R S sim(r.a,s.a)<e CS 347 Notes 3 86

Other Parallel Operations Duplicate elimination Sort first (in parallel) then eliminate duplicates in result Partition tuples (range or hash) and eliminate locally Aggregates Partition by grouping attributes; compute aggregate locally CS 347 Notes 3 87

Parallel/Distributed Aggregates R a id dept salary 1 toy 10 2 toy 20 3 sales 15 R b id dept salary 4 sales 5 5 toy 20 6 mgmt 15 7 sales 10 8 mgmt 30 sum (sal) group by dept CS 347 Notes 3 88

Parallel/Distributed Aggregates R a R b id dept salary 1 toy 10 2 toy 20 3 sales 15 id dept salary 4 sales 5 5 toy 20 6 mgmt 15 7 sales 10 8 mgmt 30 id dept salary 1 toy 10 2 toy 20 5 toy 20 6 mgmt 15 8 mgmt 30 id dept salary 3 sales 15 4 sales 5 7 sales 10 sum (sal) group by dept CS 347 Notes 3 89

Parallel/Distributed Aggregates R a id dept salary 1 toy 10 id dept salary 1 toy 10 sum dept sum toy 50 2 toy 20 2 toy 20 mgmt 45 3 sales 15 5 toy 20 6 mgmt 15 R b id dept salary 4 sales 5 5 toy 20 8 mgmt 30 id dept salary sum dept sum 6 mgmt 15 3 sales 15 sales 30 7 sales 10 4 sales 5 8 mgmt 30 7 sales 10 sum (sal) group by dept CS 347 Notes 3 90

Parallel/Distributed Aggregates R a id dept salary 1 toy 10 2 toy 20 3 sales 15 R b id dept salary 4 sales 5 5 toy 20 6 mgmt 15 7 sales 10 8 mgmt 30 sum (sal) group by dept with less data CS 347 Notes 3 91

Parallel/Distributed Aggregates R a id dept salary 1 toy 10 sum dept salary toy 30 2 toy 20 toy 20 3 sales 15 mgmt 45 R b id dept salary 4 sales 5 5 toy 20 sum dept salary 6 mgmt 15 sales 15 7 sales 10 sales 15 8 mgmt 30 sum (sal) group by dept with less data CS 347 Notes 3 92

Parallel/Distributed Aggregates R a id dept salary 1 toy 10 sum dept salary toy 30 sum dept sum toy 50 2 toy 20 toy 20 mgmt 45 3 sales 15 mgmt 45 R b id dept salary 4 sales 5 5 toy 20 sum dept salary sum dept sum 6 mgmt 15 sales 15 sales 30 7 sales 10 sales 15 8 mgmt 30 sum (sal) group by dept with less data CS 347 Notes 3 93

Parallel/Distributed Aggregates Enhancement Perform aggregate during partition to reduce data transmitted Does not work for all aggregate functions which ones? CS 347 Notes 3 94

Preview: MapReduce data A 1 map data B 1 reduce data C 1 data A 2 data B 2 data C 1 data A 3 CS 347 Notes 3 95

Parallel/Distributed Selection Straightforward if one can leverage range or hash partitioning How do indexes work? CS 347 Notes 3 96

Parallel/Distributed Selection Partition vector can act as the root of a distributed index k 0 k 1 local indexes site 1 site 2 site 3 CS 347 Notes 3 97

Parallel/Distributed Selection Distributed indexes on a non-partition attributes get complicated k 0 k 1 index sites tuple sites CS 347 Notes 3 98

Parallel/Distributed Selection Indexing schemes Which one is better in a distributed environment? How to make updates and expansion efficient? Where to store the directory and set of participants? Is global knowledge necessary? If the index is not too big, it may be better to keep it whole and replicate it CS 347 Notes 3 99

Query Processing Decomposition Localization Optimization Overview Joins and other operations Inter-operation parallelism Optimization strategies CS 347 Notes 3 100

Inter-Operation Parallelism Pipelined Independent CS 347 Notes 3 101

Pipelined Parallelism site 2 σ c S site 1 R site 1 R tuples matching σ c site 2 join S probe result CS 347 Notes 3 102

Pipelined Parallelism Pipelining cannot be used in all cases stream of R tuples stream of S tuples CS 347 Notes 3 103

Independent Parallelism site 1 site 2 R S T V 1. temp 1 R S temp 2 T V 2. result temp 1 temp 2 CS 347 Notes 3 104

Summary As we consider query plans for optimization, we must consider various new strategies for individual operations scheduling operations CS 347 Notes 3 105