CS 347 Parallel and Distributed Data Processing

Similar documents
CS 347 Parallel and Distributed Data Processing

CS 347 Distributed Databases and Transaction Processing Notes03: Query Processing

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins

P Q1 Q2 Q3 Q4 Q5 Tot (60) (20) (20) (20) (60) (20) (200) You are allotted a maximum of 4 hours to complete this exam.

A.I.: Beyond Classical Search

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

Quiz 2. Due November 26th, CS525 - Advanced Database Organization Solutions

Lecture 7: DecisionTrees

Mario A. Nascimento. Univ. of Alberta, Canada http: //

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics

Decoding Revisited: Easy-Part-First & MERT. February 26, 2015

CS425: Algorithms for Web Scale Data

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

6.830 Lecture 11. Recap 10/15/2018

Local Search & Optimization

CS 347 Parallel and Distributed Data Processing

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Scalable Processing of Snapshot and Continuous Nearest-Neighbor Queries over One-Dimensional Uncertain Data

Multimedia Databases 1/29/ Indexes for Multimedia Data Indexes for Multimedia Data Indexes for Multimedia Data

Methods for finding optimal configurations

RESEARCH ON THE DISTRIBUTED PARALLEL SPATIAL INDEXING SCHEMA BASED ON R-TREE

Classification Based on Logical Concept Analysis

' Characteristics of DSS Queries: read-mostly, complex, adhoc, Introduction Tremendous growth in Decision Support Systems (DSS). with large foundsets

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Your Project Proposals

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Learning and Transferring Relational Instance-Based Policies

Principles of AI Planning

Principles of AI Planning

You are here! Query Processor. Recovery. Discussed here: DBMS. Task 3 is often called algebraic (or re-write) query optimization, while

Scaling Up. So far, we have considered methods that systematically explore the full search space, possibly using principled pruning (A* etc.).

Summarizing Measured Data

Initial and Boundary Conditions

CS 347. Parallel and Distributed Data Processing. Spring Notes 11: MapReduce

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash

Module 10: Query Optimization

Crowdsourcing Pareto-Optimal Object Finding by Pairwise Comparisons

*** SAMPLE ONLY! The actual NASUNI-FILER-MIB file can be accessed through the Filer address>/site_media/snmp/nasuni-filer-mib ***

Reducing storage requirements for biological sequence comparison

Staple Here. Student Name: End-of-Course Assessment. Algebra I. Pre-Test Performance Event

CS6220: DATA MINING TECHNIQUES

Methods that transform the original network into a tighter and tighter representations

Minimax strategies, alpha beta pruning. Lirong Xia

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Finding optimal configurations ( combinatorial optimization)

Finding Frequent Items in Probabilistic Data

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

Sequential Logic Optimization. Optimization in Context. Algorithmic Approach to State Minimization. Finite State Machine Optimization

Decision Tree Learning

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Artificial Intelligence Heuristic Search Methods

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Decision Tree Learning

Local Search and Optimization

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM

Handout 1: Introduction to Dynamic Programming. 1 Dynamic Programming: Introduction and Examples

Matrix Assembly in FEA

EECS 349:Machine Learning Bryan Pardo

Lecture 3: Decision Trees

Causal Belief Decomposition for Planning with Sensing: Completeness Results and Practical Approximation

Local Search & Optimization

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Summary. Local Search and cycle cut set. Local Search Strategies for CSP. Greedy Local Search. Local Search. and Cycle Cut Set. Strategies for CSP

Processing Aggregate Queries over Continuous Data Streams

CS 188: Artificial Intelligence Spring Announcements

Lecture 6: Greedy Algorithms I

Optimisation and Operations Research

2.4 Conditional Probability

CPU Scheduling. CPU Scheduler

Website What's Happening Residents Visitors Business Marina Town Government

Large-Scale Behavioral Targeting

Repetition Prepar Pr a epar t a ion for fo Ex am Ex

Using Geographical Information System Techniques for Finding Appropriate Location for opening up a new retail site

Factorized Relational Databases Olteanu and Závodný, University of Oxford

Prof. S. Boyd. Final exam

CS145: INTRODUCTION TO DATA MINING

Incremental Construction of Complex Aggregates: Counting over a Secondary Table

Response Time in Data Broadcast Systems: Mean, Variance and Trade-O. Shu Jiang Nitin H. Vaidya. Department of Computer Science

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

Databases 2011 The Relational Algebra

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Decision Tree Learning

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

P, NP, NP-Complete. Ruth Anderson

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

5 Spatial Access Methods

Decision Tree. Decision Tree Learning. c4.5. Example

Data Mining and Machine Learning (Machine Learning: Symbolische Ansätze)

EasySDM: A Spatial Data Mining Platform

CSE 562 Database Systems

Semantics of Ranking Queries for Probabilistic Data and Expected Ranks

(tree searching technique) (Boolean formulas) satisfying assignment: (X 1, X 2 )

CS6220: DATA MINING TECHNIQUES

5 Spatial Access Methods

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

How hard is it to find a good solution?

Essentials of Large Volume Data Management - from Practical Experience. George Purvis MASS Data Manager Met Office

Transcription:

CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 4: Query Optimization

Query Optimization Cost estimation Strategies for exploring plans Q min CS 347 Notes 4 2

Cost Estimation Based on estimating result sizes Like in centralized databases CS 347 Notes 4 3

Cost Estimation But # of IOs may not be the best metric E.g., transmission time may dominate work at site 1 T 1 work at site 2 T 2 answer time/$ CS 347 Notes 4 4

Cost Estimation Another reason why # of IOs is not enough: parallelism Plan A Plan B 100 IOs site 1 site 2 50 IOs 70 IOs site 3 50 IOs CS 347 Notes 4 5

Cost Estimation Cost metrics E.g., IOs, bytes transmitted, $, Additive Task 1 Response time metric Not additive Need scheduling and dependency info Task 2 Task 3 Skew is important CS 347 Notes 4 6

Cost Estimation Also take into account Start up cost Data distribution cost/time Resource contention (for memory, disk, network) Cost of assembling results CS 347 Notes 4 7

Cost Estimation Response time example site 1 site 2 site 3 site 4 start up distribution searching + sending results final processing CS 347 Notes 4 8

Search Strategies Exhaustive (with pruning) Hill climbing (greedy) Query separation CS 347 Notes 4 9

Exhaustive Search Consider all query plans (given a set of techniques for operators) Prune some plans Heuristics CS 347 Notes 4 10

Exhaustive Search Example R S T A B R > S > T R S T R S R T S R S T T S T R 2 1 2 1 (S R) T (T S) R ship S to R semi join ship T to S semi join 1 2 Prune because cross-product not necessary Prune because larger relation first CS 347 Notes 4 11

Search Strategies In generating plans, keep goal in mind E.g., if goal is parallelism (in system with fast network) Consider partitioning relations first E.g., if goal is reduction of network traffic Consider semi-joins CS 347 Notes 4 12

Hill Climbing Better plans 1 x initial plan Worse plans CS 347 Notes 4 13

Hill Climbing 2 Better plans 1 x initial plan Worse plans CS 347 Notes 4 14

Hill Climbing Example R S T V A B C relation site size R 1 10 S 2 20 T 3 30 V 4 40 tuple size = 1 Goal: minimize data transmission CS 347 Notes 4 15

Hill Climbing Initial plan Send relations to one site What site do we send all relations to? To site 1: cost = 20 + 30 + 40 = 90 To site 2: cost = 10 + 30 + 40 = 80 To site 3: cost = 10 + 20 + 40 = 70 To site 4: cost = 10 + 20 + 30 = 60 CS 347 Notes 4 16

Hill Climbing P 0 R (1 4) S (2 4) T (3 4) Compute R S T V at site 4 CS 347 Notes 4 17

Hill Climbing Local search Consider sending each relation to neighbor 4 R S 1 2 4 R S 1 2 R CS 347 Notes 4 18

Hill Climbing Assume size R S = 20 size S T = 5 size T V = 1 Option A 4 10 20 1 R S 2 4 R S 20 10 1 2 R cost = 30 cost = 30 No savings CS 347 Notes 4 19

Hill Climbing Option B 4 10 20 R S 1 2 4 S R 1 20 20 S 2 cost = 30 cost = 40 Worse off CS 347 Notes 4 20

Hill Climbing Option C 4 20 30 S T 2 3 4 T S 2 5 30 T 3 cost = 50 cost = 35 Win CS 347 Notes 4 21

Hill Climbing Option D 4 20 30 S T 2 3 4 S T 5 20 2 3 S cost = 50 cost = 25 Bigger win CS 347 Notes 4 22

Hill Climbing P 1 S (2 3) α = S T R (1 4) T (3 4) Compute answer at site 4 CS 347 Notes 4 23

Hill Climbing Repeat local search Treat α = S T as relation α R 4 1 3 α R 4 α vs. 1 3 4 R α 1 3 R CS 347 Notes 4 24

Hill Climbing Hill climbing may miss best plan E.g., best plan could be P best T (3 4) β = T V β (4 2) β' = β S β' (2 1) β'' = β' R β'' (1 4) (optional) Compute answer V 4 β'' β T 1 β' 2 3 R S T CS 347 Notes 4 25

Hill Climbing Hill climbing may miss best plan E.g., best plan could be P best T (3 4) β = T V β (4 2) β' = β S β' (2 1) β'' = β' R β'' (1 4) (optional) Compute answer = 30 = 1 = 1 = 1 = 33 total CS 347 Notes 4 26 β'' 4 1 2 3 β' R V S β T Costs could be low because β is very selective T

Search Strategies Exhaustive (with pruning) Hill climbing (greedy) Query separation CS 347 Notes 4 27

Query Separation Separate query into 2 or more steps Optimize each step independently CS 347 Notes 4 28

Query Separation Example Simple queries technique σ c1 A σ c2 σ c3 R S CS 347 Notes 4 29

Query Separation 1. Compute R = A [ σ c2 R ] S = A [ σ c3 S ] 2. Compute J = R S 3. Compute answer σ c1 { [ J σ c2 R ] [ J σ c3 S ] } σ c2 R σ c1 A σ c3 S CS 347 Notes 4 30

Query Separation 1. Compute R = A [ σ c2 R ] S = A [ σ c3 S ] 2. Compute J = R S Compute the A values in the answer first 3. Compute answer σ c1 { [ J σ c2 R ] [ J σ c3 S ] } Get tuples from sites matching A and compute answer next CS 347 Notes 4 31

Query Separation Simple query Relations have a single attribute Output has a single attribute E.g., J = R S CS 347 Notes 4 32

Query Separation Idea 1. Decompose query Local processing Simple query (or queries) Final processing 2. Optimize simple query CS 347 Notes 4 33

Query Separation Philosophy Hard part is distributed join Do this part with only keys; get the rest of the data later Simpler to optimize simple queries CS 347 Notes 4 34

Summary Cost estimation Optimization strategies Exhaustive (with pruning) Hill climbing (greedy) Query separation CS 347 Notes 4 35

Words of Wisdom Optimization is like chess playing May have to make sacrifices for later gains Move data, partition relations, build indexes CS 347 Notes 4 36