Module 10: Query Optimization

Size: px

Start display at page:

Download "Module 10: Query Optimization"

Belinda Harrington
5 years ago
Views:

1 Module 10: Query Optimization Module Outline 10.1 Outline of Query Optimization 10.2 Motivating Example 10.3 Equivalences in the relational algebra 10.4 Heuristic optimization 10.5 Explosion of search space 10.6 Dynamic programming strategy (System R) Web Forms Transaction Manager Lock Manager Plan Executor Operator Evaluator Concurrency Control Applications SQL Commands Files and Index Structures Buffer Manager Disk Space Manager Parser Optimizer SQL Interface You are here! Query Processor Recovery Manager DBMS Index Files Data Files System Catalog Database 292

2 10.1 Outline of Query Optimization The success of relational database technology is largely due to the systems ability to automatically find evaluation plans for declaratively specified queries. Given some (SQL) query Q, the system 1 parses and analyzes Q, 2 derives a relational algebra expression E that computes Q, 3 transforms and simplifies E, and 4 annotates the operators in E with access methods and operator algorithms to obtain an evaluation plan P. Discussed here: Task 3 is often called algebraic (or re-write) query optimization, while task 4 is also called non-algebraic (or cost-based) query optimization. 293

3 10.2 Motivating Example From query to plan Example: List the airports from which flights operated by Swiss (airline code LX) fly to any German (DE) airport. Airport : code country name FRA DE Frankfurt ZRH CH Zurich MUC DE Munich. Flight : from to airline FRA ZRH LX ZRH MUC LX FRA MUC US. SQL query Q: SELECT FROM WHERE f.from Flight f, Airport a f.to = a.code AND f.airline = LX AND a.country = DE 294

4 From query to plan... SQL query Q: SELECT FROM WHERE f.from Flight f, Airport a f.to = a.code AND f.airline = LX AND a.country = DE Relational algebra expression E that computes Q: from airline= LX country= DE Airport to=code Flight 295

5 From query to plan... Relational algebra expression E that computes Q: from airline= LX country= DE Airport to=code Flight One (of many) plan(s) P to evaluate Q: from scan airline= LX country= DE scan Airport heap scan to=code NL- Flight index scan on to 296

6 10.3 Equivalences in the relational algebra Two relational algebra expressions E 1, E 2 are equivalent if on every legal database instance the two expressions generate the same set of tuples. Note: the order of tuples is irrelevant Such equivalences are denoted by equivalence rules of the form E 1 E 2 (such a rule may be applied by the system in both directions, ). We know those equivalence rules from the course Information Systems. 297

7 Some equivalence rules 1 Conjunctive selections can be deconstructed into a sequence of individual selections: p 1 p 2 (E) p 1 (p 2 (E)) 2 Selection operations are commutative: p 1 (p 2 (E)) p 2 (p 1 (E)) 3 Only the last projection in a sequence of projections is needed, the others can be omitted: L 1 (L 2 ( L n (E) )) L 1 (E) 4 Selections can be combined with Cartesian products and joins: i) ii) p(e 1 E 2 ) E 1 p E 2 p(e 1 q E 2 ) E 1 p q E 2 298

8 Pictorial description of 4 i): p E 1 E 2 p E 1 E 2 299

9 5 Join operations are commutative: E 1 p E 2 E 2 p E 1 6 i) Natural joins (equality of common attributes) are associative: (E 1 E 2 ) E 3 E 1 (E 2 E 3 ) ii) Generals joins are associative in the following sense: (E 1 p E 2 ) q r E 3 ) E 1 p q (E 2 r E 3 ) where predicate r involves attributes of E 2, E 3 only. 7 Selection distributes over joins in the following ways: i) If predicate p involves attributes of E 1 only: p(e 1 q E 2 ) p(e 1 ) q E 2 ii) If predicate p involves only attributes of E 1 and q involves only attributes of E 2 : p q(e 1 r E 2 ) p(e 1 ) r q(e 2 ) (this is a consequence of rules 7 (a) and 1 ). 300

10 8 Projection distributes over join as follows: L 1 L 2 (E 1 p E 2 ) L 1 (E 1 ) p L 2 (E 2 ) if p involves attributes in L 1 L 2 only and L i contains attributes of E i only. 9 The set operations union and intersection are commutative: E 1 E 2 E 2 E 1 E 1 E 2 E 2 E 1 10 The set operations union and intersection are associative: (E 1 E 2 ) E 3 E 1 (E 2 E 3 ) (E 1 E 2 ) E 3 E 1 (E 2 E 3 ) 301

11 11 The selection operation distributes over, and \: p(e 1 E 2 ) p(e 1 ) p(e 2 ) p(e 1 E 2 ) p(e 1 ) p(e 2 ) p(e 1 \ E 2 ) p(e 1 ) \ p(e 2 ) Also: p(e 1 E 2 ) p(e 1 ) E 2 p(e 1 \ E 2 ) p(e 1 ) \ E 2 (this does not apply for ) 12 The projection operation distributes over : L(E 1 E 2 ) L(E 1 ) L(E 2 ) 302

12 10.4 Heuristic optimization Query optimizers use the equivalence rules of relational algebra to improve the expected performance of a given query in most cases. The optimization is guided by the following heuristics: (a) Break apart conjunctive selections into a sequence of simpler selections (rule 1 preparatory step for (b)). (b) Move down the query tree for the earliest possible execution (rules 2, 7, 11 reduce number of tuples processed). (c) Replace pairs by (rule 4 (a) avoid large intermediate results). (d) Break apart and move as far down the tree as possible lists of projection attributes, create new projections where possible (rules 3, 8, 12 reduce tuple widths early). (e) Perform the joins with the smallest expected result first. 303

13 Heuristic optimization: example SQL query Q: SELECT FROM WHERE AND AND p.ticketno Flight f, Passenger p, Crew c f.flightno = p.flightno AND f.flightno = c.flightno f.date = AND f.to = FRA p.name = c.name AND c.job = Pilot ( What would be a natural language formulation of Q?) 304

14 SELECT FROM WHERE AND AND p.ticketno Flight f, Passenger p, Crew c f.flightno = p.flightno AND f.flightno = c.flightno f.date = AND f.to = FRA p.name = c.name AND c.job = Pilot Canonical relational algebra expression (reflects the semantics of the SQL SELECT- FROM-WHERE block directly): p.ticketno f.flightno=p.flightno f.flightno=c.flightno c.job= Pilot Flight f Crew c Passenger p 305

15 Heuristic optimization: example 1 Break apart conjunctive selection to prepare push-down of selections: p.ticketno f.flightno=p.flightno f.flightno=c.flightno f.date= f.to= FRA p.name=c.name Flight f c.job= Pilot Crew c Passenger p 306

16 Heuristic optimization: example 2 Push down selection as far as possible (but no further!): p.ticketno f.flightno=c.flightno p.name=c.name f.flightno=p.flightno c.job= Pilot Crew c Passenger p f.to= FRA f.date= Flight f 307

17 Heuristic optimization: example 3 Re-unite sequences of selections into single conjunctive selections: p.ticketno f.flightno=c.flightno p.name=c.name f.flightno=p.flightno c.job= Pilot f.to= FRA f.date= Passenger p Crew c Flight f 308

18 Heuristic optimization: example 4 Introduce projections to reduce tuple widths: p.ticketno f.flightno=c.flightno p.name=c.name f.flightno=p.flightno c.flightno,c.name f.flightno c.job= Pilot Crew c f.to= FRA f.date= p.ticketno,p.flightno,p.name Flight f Passenger p 309

19 Heuristic optimization: example 5 Combine cartesian products and selections into joins: p.ticketno f.flightno f.flightno=p.flightno f.flightno=c.flightno p.name=c.name c.flightno,c.name c.job= Pilot Crew c f.to= FRA f.date= p.ticketno,p.flightno,p.name Flight f Passenger p 310

20 Heuristic optimization: example 6 Relation Passenger presumably is the largest relation, re-order the joins (associativity of general joins, rule 6 ii)): p.ticketno f.flightno f.flightno=c.flightno f.flightno=p.flightno p.name=c.name c.flightno,c.name p.ticketno,p.flightno,p.name Passenger p f.to= FRA f.date= c.job= Pilot Flight f Crew c 311

21 Choosing an evaluation plan When the optimizer annotates the resulting algebra expression E it needs to consider the interaction of the chosen operator algorithms/access methods. Choosing the cheapest (in terms of I/O) algorithm for each operation independently may not yield overall cheapest plan P. Example: merge join may be costlier than nested loops join (operands need to be sorted first), but yields output in sorted order (good for subsequent duplicate elimination, selection, grouping,... ) We need to consider all possible plans and then choose the best one in a cost-based fashion. 312

22 10.5 Explosion of search space Consider finding the best join order for the query R 1 R 2 R 3 R 4 Several join tree shapes (due to associativity, commutativity of ): R 1 R 2... R 1 R 2 bushy R 3 R 4... R 1 R 2 R 4 R 3 left-deep # of different join orders for an n-way join: R 3 R 4 right-deep (2n 2)! (n 1)! (n = 7 : , n = 10 : ) 313

23 Derivation of the number of possible join orderings Let J(n) denote the number of different join orderings for a join of n argument relations. Obvisouly, J(n) = T (n) n!... with T (n) the number of different binary tree shapes and n! the number of leaf permutations. 1 We can now derive T (n) inductively: T (1) = 1, n 1 T (n) = T (i) T (n i)... namely, T (n) = all possibilities T (left subtree) T (right subtree) It turns out that T (n) = C(n 1), for C(n) the n-th Catalan number, ( C(n) = 1 2n ) n+1 n = (2n)! (n+1)! n! Substituting T (n) = C(n 1), we obtain T (n) n! = 1 (2(n 1))! (n 1)! 1 see (Cormen et al., 1990) 314

24 Restricting the search space Fact: query optimization will not be able to find the overall best plan. Instead: optimizers try to avoid the really bad plans (I/O cost of different plans may differ substantially!) Restrict the search space: consider left-deep join orders only (left is outer relation, right is inner): R 1 R 2 R 4 R 3 Left-deep trees may be evaluated in a fully pipelined fashion (inner input is stored relation), intermediate results need not be written to temporary files, (Block) NL- may profit from available indexes on inner relation. Number of possible left-deep join orders for n-way join is only n! 315

25 Single relation plans Optimizer enumerates (generates) all possible plans to assess their cost. If query involves a single relation R only: Single relation plans: Consider each available method (e.g., heap scan, (un)clustered index scan) to access the tuples of a single relation R i. Keep the access method involving the least estimated cost. 316

26 Cost estimates for single relation plans (System R style) IBM System R ( 1970s): first successful relational database system, introduced most of the query optimization techniques still in use today. Pragmatic yet successful cost model for access methods on rel. R: Access method Cost { Height(I) + 1 if I is B + tree access primary key index I 2.2 if I is hash index clustered index I matching predicate p ( I + R ) sel(p) 2 unclustered index I matching predicate p sequential scan ( I + R ) sel(p) R 2 If sel(p) is unknown, assume 1/

27 Cost estimates for a single relation plan Query Q: SELECT FROM WHERE A R B = c Database profile: R = 500, R = , V (B, R) = 10 Q 1/V (B, R) R = 1/ = tuples retrieved }{{} sel(b=c) 1 Database maintains clustered index I B ( I B = 50) on attribute B: cost = ( I B + R ) 1/V (B, R) = ( ) 1/10 = 55 pages 318

28 Cost estimates for a single relation plan 2 Database maintains unclustered index I B ( I B = 50) on attribute B: cost = ( I B + R ) 1/V (B, R) = ( ) 1/10 = pages 3 No index support, use sequential file scan to access R: cost = R = 500 pages To evaluate query Q, use clustered index I B 319

29 Plans for multiple relation (join) queries We need to make sure not to miss the best left-deep join plan. Degrees of freedom left: 1 For each base relation in the query, consider all access methods. 2 For each join operation, select a join algorithm. How many possible query plans are left now? Back-of-envelope calculation (query with n relations) Assume j join algorithms available, i indexes per relation: #plans n! j n 1 (i + 1) n Example: with n = 3 relations and j = 3, i = 2: #plans 3! =

30 Plan enumeration 1 : example setup Example query (n = 3): SELECT FROM WHERE a.name, f.airline, c.name Airport a, Flight f, Crew c f.to = a.code AND f.flightno = c.flightno (Airport = A, Flight = F, Crew = C) Assumptions: Available join algorithms: hash join, block NL-, block INL- Available indexes: clustered B + tree index I on attribute Flight.to, I = 50 A = 500, 80 tuples/page F = 1000, 100 tuples/page C = F A tuples fit on a page 321

31 Plan enumeration 2 : candidate plans Enumerate n! left-deep join trees (3! = 6): C A F F A C C F A A C F F C A A F C Prune plans with (note: no join predicate between A, C) immediately! 4 candidate plans remain. 322

32 Plan enumeration 3 : join algorithm choices Candidate plan: C A F Possible join algorithm choices: NL- NL- NL- C H- C A F A F H- H- NL- C H- C A F A F Repeat for remaining 3 candidate plans. 323

33 Plan enumeration 4 : access method choices NL- Candidate plan: NL- C A F Possible access method choices: NL- NL- NL- C heap scan INL- C heap scan heap scan A F heap scan heap scan A F index scan on F.to Repeat for remaining candidate plans. 324

34 Plan enumeration 5 : cost estimation Estimate cost for candidate plan: NL- INL- C heap scan heap scan A F index scan Cost heap scan A: 500 (pages) Cost of A F : A sel(a.code = F.to) ( F + I ) = F.to is key / ( ) A F = A F /100 = F /100 = /100 = (pages) Cost of (A F ) C: A F C = = Total estimated cost: =

35 Plan enumeration 5 : cost estimation Current candidate plan: Remember: A = 500, F = 1 000, C = 10 A F = NL- NL- C heap scan heap scan A F heap scan NL-: scan left input + scan right input once for each page in left input Total estimated cost: A + A F + A F B = =

36 Plan enumeration 5 : cost estimation Current candidate plan: Remember: A = 500, F = 1 000, C = 10 A F = H- NL- C heap scan heap scan A F heap scan NL-: scan left input + scan right input once for each page in left input H- (assume 2 passes): 2 (scan both inputs + hash both inputs into buckets) + read hash buckets with join partners Total estimated cost: A + A F + 2 A F + 2 B + (A F ) B = =

37 Plan enumeration 5 : cost estimation Current candidate plan: Remember: A = 500, F = 1 000, C = 10 A F = NL- H- C heap scan heap scan A F heap scan NL-: scan left input + scan right input once for each page in left input H- (assume 2 passes): 2 (scan both inputs + hash both inputs into buckets) + read hash buckets with join partners Total estimated cost: 2 ( A + F ) + A F + A F B = 2 ( ) =

38 Plan enumeration 5 : cost estimation Current candidate plan: Remember: A = 500, F = 1 000, C = 10 A F = H- H- C heap scan heap scan A F heap scan NL-: scan left input + scan right input once for each page in left input H- (assume 2 passes): 2 (scan both inputs + hash both inputs into buckets) + read hash buckets with join partners Total estimated cost: 2 ( A + F ) + A F + 2 ( A F + B ) + B = 2 ( ) ( ) + 10 =

39 Repeated enumeration of identical sub-plans The plan enumeration reconsiders the same sub-plans over and over again. Cost and result size of sub-plan indepedent of larger embedding plan: NL- H- NL- NL- C scan NL- C scan H- C scan scan A F scan scan A F scan scan A F scan H- NL- H- H- C scan INL- scan A C scan INL- C scan F scan scan A F index scan A F index! Idea: Remember already considered sub-plans in memoization data structure. Resulting approach known as dynamic programming. 330

40 10.6 Dynamic programming strategy (System R) Divide plan enumeration into n passes (for a query with n joined relations): 1 Pass 1 (all 1-relation plans): Find best 1-relation plans for each relation (i.e., select access method) 2 Pass 2 (all 2-relation plans): Find best way to join plans of Pass 1 to another relation (generate left-deep trees: sub-plans of Pass 1 appear as outer in join). 3 Pass n (all n-relation plans): Find best way to join plans of Pass n 1 to the nth relation (sub-plans of Pass n 1 appear as outer in join) A k 1 relation sub-plan P is not combined with a kth relation R unless there is a join condition between the relations in P and R or all join conditions already present in P (avoid if possible). 331

41 Plan enumeration: pruning, interesting orders For each sub-plan obtained this way, remember cost and result size estimates! Pruning: For each subset of relations joined, keep only cheapest sub-plan overall + cheapest sub-plans that generate an intermediate result with an interesting order of tuples. Interesting order determined by presence of SQL ORDER BY clause in the query presence of SQL GROUP BY clause in the query join attributes of subsequent equi-joins (prepare for merge-). 332

42 System R style plan enumeration Example query: SELECT FROM WHERE a.name, f.airline, c.name Airport a, Flight f, Crew c f.to = a.code AND f.flightno = c.flightno Now assume: Available join algorithms: merge-, block NL-, block INL- Available indexes: clustered B + tree index I on A.code, height(i) = 3, I leaf = 500 A = , 5 tuples/page F = 10, 10 tuples/page C = 10, 20 tuples/page 10 F A tuples fit on a page, 10 F C tuples fit on a page 333

43 System R: Pass 1 (1-relation plans) Access methods for A: 1 heap scan cost = A = index scan on A.code, index I cost = I + A = = Keep 1 and 2 since 2 has interesting order on attribute to which is a join attribute. Access method for F : 1 heap scan cost = F = 10 Access method for C: 1 heap scan cost = C =

44 System R: Pass 2 (2-relation plans) Start with 1-relation plan to access A as outer:? Heap scan of A as outer: 1? = NL- A F cost = F = = ? = M- (assume 2-way sort/merge): cost = F + F = Index scan of A as outer: 3? = NL- cost = F = = ? = M- (assume 2-way sort/merge): cost = F + F = Keep 4 only (N.B. uses interesting order in non-optimal sub-plan!) 335

45 System R: Pass 2 (cont d) Start with F as outer:? F A/C? A as inner: 1? = NL-, heap scan A cost = F + F A = ? = INL-, index scan A cost = F + F (height(i) + 1) = (3 + 1) = 410 3? = M-, heap scan A cost = F + A + 2 ( F + A ) = ? = M-, index scan A cost = F + 2 F = Keep! C as inner: 5? = NL- cost = F + F C = = 110 6? = M- Keep! cost = F + C + 2 ( F + C ) = ( ) =

46 System R: Pass 2 (cont d) Start with C as outer:? C F 1? = NL- cost = C + C F = = 110 2? = M- cost = C + F + 2 ( C + F ) = ( ) = 60 Keep! N.B. C A not enumerated because of cross product ( ) avoidance. 337

47 System R: further pruning of 2-relation plans A F : M- INL- 1 index A F scan 2 scan F A index cost = , order on to cost = 410, no order C F : M- M- 3 scan C F scan 4 scan F C scan cost = 60, order on flightno cost = 60, order on flightno Keep 2 and 3 or 4 (order in 1 not interesting for subsequent join(s)). 338

48 System R: Pass 3 (3-relation plans) Best (A F ) sub-plan: cost = 410, no order, A F = 10 NL- INL- C 1 scan F A index cost = A F C = = M- INL- C scan F A index cost = C + 2 ( A F + C ) = ( ) =

49 System R: Pass 3 (cont d) Best (C F ) sub-plan: cost = 60, order on flightno, C F = 10, C F = 100 NL- M- A scan scan F C scan cost = M- M- A scan scan F C scan cost = INL- M- A index scan F C scan cost = = 460 M- M- A index scan F C scan cost =

50 System R: And the winner is... INL- M- A index cost = 460 Observations: scan F C scan Best plan mixes join algorithms and exploits indexes. Worst plan had cost > (exact cost unknown due to pruning). Optimization yielded 1000-fold improvement over worst plan! 341

51 Bibliography Astrahan, M. M., Schkolnick, M., and Kim, W. (1980). Performance of the System R access path selection mechanism. In IFIP Congress, pages Chamberlin, D., Astrahan, M., Blasgen, M., Gray, J., King, W., Lindsay, B., Lorie, R., Mehl, J., Price, T., Putzolu, F., Selinger, P., Schkolnick, M., Shultz, D., Traiger, I., Wade, B., and Yost, R. (1981). History and evaluation of System/R. Communications of the ACM, 24(10): Cormen, T. T., Leiserson, C. E., and Rivest, R. L. (1990). Introduction to algorithms. MIT Press. Jarke, M. and Koch, J. (1984). Query optimization in database systems. ACM Computing Surveys, 16(2): Ramakrishnan, R. and Gehrke, J. (2003). Database Management Systems. McGraw-Hill, New York, 3 edition. W. Kim, D.S. Reiner, D. B., editor (1985). Query Processing in Database Systems. Springer-Verlag. 342

You are here! Query Processor. Recovery. Discussed here: DBMS. Task 3 is often called algebraic (or re-write) query optimization, while

You are here! Query Processor. Recovery. Discussed here: DBMS. Task 3 is often called algebraic (or re-write) query optimization, while Module 10: Query Optimization Module Outline 10.1 Outline of Query Optimization 10.2 Motivating Example 10.3 Equivalences in the relational algebra 10.4 Heuristic optimization 10.5 Explosion of search