CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 4: Query Optimization

Query Optimization Cost estimation Strategies for exploring plans Q min CS 347 Notes 4 2

Cost Estimation Based on estimating result sizes Like in centralized databases CS 347 Notes 4 3

Cost Estimation But # of IOs may not be the best metric E.g., transmission time may dominate work at site 1 T 1 work at site 2 T 2 answer time/$ CS 347 Notes 4 4

Cost Estimation Another reason why # of IOs is not enough: parallelism Plan A Plan B 100 IOs site 1 site 2 50 IOs 70 IOs site 3 50 IOs CS 347 Notes 4 5

Cost Estimation Cost metrics E.g., IOs, bytes transmitted, $, Additive Task 1 Response time metric Not additive Need scheduling and dependency info Task 2 Task 3 Skew is important CS 347 Notes 4 6

Cost Estimation Also take into account Start up cost Data distribution cost/time Resource contention (for memory, disk, network) Cost of assembling results CS 347 Notes 4 7

Cost Estimation Response time example site 1 site 2 site 3 site 4 start up distribution searching + sending results final processing CS 347 Notes 4 8

Search Strategies Exhaustive (with pruning) Hill climbing (greedy) Query separation CS 347 Notes 4 9

Exhaustive Search Consider all query plans (given a set of techniques for operators) Prune some plans Heuristics CS 347 Notes 4 10

Exhaustive Search Example R S T A B R > S > T R S T R S R T S R S T T S T R 2 1 2 1 (S R) T (T S) R ship S to R semi join ship T to S semi join 1 2 Prune because cross-product not necessary Prune because larger relation first CS 347 Notes 4 11

Search Strategies In generating plans, keep goal in mind E.g., if goal is parallelism (in system with fast network) Consider partitioning relations first E.g., if goal is reduction of network traffic Consider semi-joins CS 347 Notes 4 12

Hill Climbing Better plans 1 x initial plan Worse plans CS 347 Notes 4 13

Hill Climbing 2 Better plans 1 x initial plan Worse plans CS 347 Notes 4 14

Hill Climbing Example R S T V A B C relation site size R 1 10 S 2 20 T 3 30 V 4 40 tuple size = 1 Goal: minimize data transmission CS 347 Notes 4 15

Hill Climbing Initial plan Send relations to one site What site do we send all relations to? To site 1: cost = 20 + 30 + 40 = 90 To site 2: cost = 10 + 30 + 40 = 80 To site 3: cost = 10 + 20 + 40 = 70 To site 4: cost = 10 + 20 + 30 = 60 CS 347 Notes 4 16

Hill Climbing P 0 R (1 4) S (2 4) T (3 4) Compute R S T V at site 4 CS 347 Notes 4 17

Hill Climbing Local search Consider sending each relation to neighbor 4 R S 1 2 4 R S 1 2 R CS 347 Notes 4 18

Hill Climbing Assume size R S = 20 size S T = 5 size T V = 1 Option A 4 10 20 1 R S 2 4 R S 20 10 1 2 R cost = 30 cost = 30 No savings CS 347 Notes 4 19

Hill Climbing Option B 4 10 20 R S 1 2 4 S R 1 20 20 S 2 cost = 30 cost = 40 Worse off CS 347 Notes 4 20

Hill Climbing Option C 4 20 30 S T 2 3 4 T S 2 5 30 T 3 cost = 50 cost = 35 Win CS 347 Notes 4 21

Hill Climbing Option D 4 20 30 S T 2 3 4 S T 5 20 2 3 S cost = 50 cost = 25 Bigger win CS 347 Notes 4 22

Hill Climbing P 1 S (2 3) α = S T R (1 4) T (3 4) Compute answer at site 4 CS 347 Notes 4 23

Hill Climbing Repeat local search Treat α = S T as relation α R 4 1 3 α R 4 α vs. 1 3 4 R α 1 3 R CS 347 Notes 4 24

Hill Climbing Hill climbing may miss best plan E.g., best plan could be P best T (3 4) β = T V β (4 2) β' = β S β' (2 1) β'' = β' R β'' (1 4) (optional) Compute answer V 4 β'' β T 1 β' 2 3 R S T CS 347 Notes 4 25

Hill Climbing Hill climbing may miss best plan E.g., best plan could be P best T (3 4) β = T V β (4 2) β' = β S β' (2 1) β'' = β' R β'' (1 4) (optional) Compute answer = 30 = 1 = 1 = 1 = 33 total CS 347 Notes 4 26 β'' 4 1 2 3 β' R V S β T Costs could be low because β is very selective T

Search Strategies Exhaustive (with pruning) Hill climbing (greedy) Query separation CS 347 Notes 4 27

Query Separation Separate query into 2 or more steps Optimize each step independently CS 347 Notes 4 28

Query Separation Example Simple queries technique σ c1 A σ c2 σ c3 R S CS 347 Notes 4 29

Query Separation 1. Compute R = A [ σ c2 R ] S = A [ σ c3 S ] 2. Compute J = R S 3. Compute answer σ c1 { [ J σ c2 R ] [ J σ c3 S ] } σ c2 R σ c1 A σ c3 S CS 347 Notes 4 30

Query Separation 1. Compute R = A [ σ c2 R ] S = A [ σ c3 S ] 2. Compute J = R S Compute the A values in the answer first 3. Compute answer σ c1 { [ J σ c2 R ] [ J σ c3 S ] } Get tuples from sites matching A and compute answer next CS 347 Notes 4 31

Query Separation Simple query Relations have a single attribute Output has a single attribute E.g., J = R S CS 347 Notes 4 32

Query Separation Idea 1. Decompose query Local processing Simple query (or queries) Final processing 2. Optimize simple query CS 347 Notes 4 33

Query Separation Philosophy Hard part is distributed join Do this part with only keys; get the rest of the data later Simpler to optimize simple queries CS 347 Notes 4 34

Summary Cost estimation Optimization strategies Exhaustive (with pruning) Hill climbing (greedy) Query separation CS 347 Notes 4 35

Words of Wisdom Optimization is like chess playing May have to make sacrifices for later gains Move data, partition relations, build indexes CS 347 Notes 4 36