Window-aware Load Shedding for Aggregation Queries over Data Streams

Similar documents
Environment (Parallelizing Query Optimization)

1 Approximate Quantiles and Summaries

Data-Intensive Distributed Computing

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window

Mario A. Nascimento. Univ. of Alberta, Canada http: //

On a Design Space for Aggregating over Sliding Windows on a Stream

On queueing in coded networks queue size follows degrees of freedom

Large-Scale Behavioral Targeting

CS 347. Parallel and Distributed Data Processing. Spring Notes 11: MapReduce

DM74LS670 3-STATE 4-by-4 Register File

Z = F(X) Combinational circuit. A combinational circuit can be specified either by a truth table. Truth Table

On the Complexity of Mapping Pipelined Filtering Services on Heterogeneous Platforms

Mining Data Streams. The Stream Model Sliding Windows Counting 1 s

Maintaining Significant Stream Statistics over Sliding Windows

Maintaining Frequent Itemsets over High-Speed Data Streams

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

5. Density evolution. Density evolution 5-1

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

TSCCLOCK: A LOW COST, ROBUST, ACCURATE SOFTWARE CLOCK FOR NETWORKED COMPUTERS

Congestion Control. Need to understand: What is congestion? How do we prevent or manage it?

CS 343: Artificial Intelligence

These are special traffic patterns that create more stress on a switch

ECEN 689 Special Topics in Data Science for Communications Networks

Dictionary: an abstract data type

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Static CMOS Circuits. Example 1

Multiple Sequence Alignment using Profile HMM

CS246 Final Exam. March 16, :30AM - 11:30AM

CS 188: Artificial Intelligence Spring Announcements

Learning from Time-Changing Data with Adaptive Windowing

Statistical Privacy For Privacy Preserving Information Sharing

Errata and Proofs for Quickr [2]

CPE100: Digital Logic Design I

Combinational Logic. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C.

INSTITUT FÜR INFORMATIK

Fuzzy Systems. Possibility Theory.

Non-Work-Conserving Non-Preemptive Scheduling: Motivations, Challenges, and Potential Solutions

Random Sampling on Big Data: Techniques and Applications Ke Yi

EPoC UPSTREAM SYMBOL MAPPER OVERVIEW. Authors: Mark Laubach, Rich Prodan

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation

Data Canopy. Accelerating Exploratory Statistical Analysis. Abdul Wasay Xinding Wei Niv Dayan Stratos Idreos

Flow Algorithms for Two Pipelined Filtering Problems

The Design Procedure. Output Equation Determination - Derive output equations from the state table

Safety Properties for Querying XML Streams

Cuts. Cuts. Consistent cuts and consistent global states. Global states and cuts. A cut C is a subset of the global history of H

Time-Series Analysis Prediction Similarity between Time-series Symbolic Approximation SAX References. Time-Series Streams

EECS 144/244: Fundamental Algorithms for System Modeling, Analysis, and Optimization

Performance Analysis of a List-Based Lattice-Boltzmann Kernel

Random Access Game. Medium Access Control Design for Wireless Networks 1. Sandip Chakraborty. Department of Computer Science and Engineering,

Vehicle Scheduling for Accessible. Transportation Systems

django in the real world

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

Complex Objects in Data Bases Multi-dimensional Aggregation of Temporal Data Andreas Bierfert December 12, 2006

Time Series Data Cleaning

Regenerative Likelihood Ratio control schemes

CS Data Structures and Algorithm Analysis

Configuring Spatial Grids for Efficient Main Memory Joins

Wavelet decomposition of data streams. by Dragana Veljkovic

A Mathematical Solution to. by Utilizing Soft Edge Flip Flops

Introducing GIS analysis

Minimizing Clock Latency Range in Robust Clock Tree Synthesis

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for

Interfaces for Stream Processing Systems

AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis

Probabilistic Near-Duplicate. Detection Using Simhash

Factorized Relational Databases Olteanu and Závodný, University of Oxford

CORRECTNESS OF A GOSSIP BASED MEMBERSHIP PROTOCOL BY (ANDRÉ ALLAVENA, ALAN DEMERS, JOHN E. HOPCROFT ) PRATIK TIMALSENA UNIVERSITY OF OSLO

Using Earthscope and B4 LiDAR data to analyze Southern California s active faults

Efficient Primal- Dual Graph Algorithms for Map Reduce

How to deal with uncertainties and dynamicity?

Issues on Timing and Clocking

CMP N 301 Computer Architecture. Appendix C

An Ins t Ins an t t an Primer

Session 8C-5: Inductive Issues in Power Grids and Packages. Controlling Inductive Cross-talk and Power in Off-chip Buses using CODECs

Ad Placement Strategies

arxiv: v2 [cs.dc] 28 Apr 2013

Characterizing and Exploiting Reference Locality in Data Stream Applications

Progressive & Algorithms & Systems

Engineering Self-Organization and Emergence: issues and directions

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Sequential Logic. Handouts: Lecture Slides Spring /27/01. L06 Sequential Logic 1

Semestrial Project - Expedia Hotel Ranking

Logarithmic space. Evgenij Thorstensen V18. Evgenij Thorstensen Logarithmic space V18 1 / 18

Nearest Neighbor Search with Keywords in Spatial Databases

Error Correcting Codes Prof. Dr. P. Vijay Kumar Department of Electrical Communication Engineering Indian Institute of Science, Bangalore

Detection of Highly Correlated Live Data Streams

Shedding the Shackles of Time-Division Multiplexing

High Performance Computing

5. Sequential Logic x Computation Structures Part 1 Digital Circuits. Copyright 2015 MIT EECS

Design of linear Boolean network codes for combination networks

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics

CSE. 1. In following code. addi. r1, skip1 xor in r2. r3, skip2. counter r4, top. taken): PC1: PC2: PC3: TTTTTT TTTTTT

TASM: Top-k Approximate Subtree Matching

UNIVERSITY OF CALIFORNIA

Combinational Logic. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C.

Outline. Approximation: Theory and Algorithms. Application Scenario. 3 The q-gram Distance. Nikolaus Augsten. Definition and Properties

Convolutional Neural Networks

A Bayesian Method for Guessing the Extreme Values in a Data Set

Transcription:

Window-aware Load Shedding for Aggregation Queries over Data Streams Nesime Tatbul Stan Zdonik

Talk Outline Background Load shedding in Aurora Windowed aggregation queries Window-aware load shedding Experimental results Related work Conclusions and Future directions 2/26

Load Shedding in Aurora Aurora Query Network QoS. QoS. QoS Problem: When load > capacity, latency QoS degrades. Solution: Insert drop operators into the query plan. Result: Deliver approximate answers with low latency. 3/26

Subset-based based Approximation For all queries, the delivered tuple set must be a subset of the original query answer set. Maximum subset measure (e.g., [SIGMOD 03, VLDB 04] 04]) Loss-tolerance QoS of Aurora [VLDB 03] For each query, the number of consecutive result tuples missed (i.e., gap tolerance) ) must be below a certain threshold. 4/26

Why Subset Results? The application may expect to get correct values. Missing tuples get updated with more recent ones. Preserving subset guarantee anywhere in the query plan helps understand how the error propagates. It depends on the application. 5/26

Finite excerpts of a data stream Two parameters: size (ω)( ) and slide (δ)( Example: StockQuote(symbol, time, price) size = 10 min slide by 5 min Windows ( IBM, 10:00, 20) ( INTC, 10:00, 15) ( MSFT, 10:00, 22) ( IBM, 10:05, 18) ( MSFT, 10:05, 21) ( IBM, 10:10, 18) ( MSFT, 10:10, 20) ( IBM, 10:15, 20) ( INTC, 10:15, 20) ( MSFT, 10:15, 20).. 6/26

Windowed Aggregation Apply an aggregate function on the window Average, Sum, Count, Min, Max User-defined function Can have grouping, nesting, and sharing Example: Filter Aggregate Filter Aggregate Filter symbol= IBM ω = 5 min δ = 5 min diff = high-low diff > 5 ω = 60 min δ = 60 min count count > 0 7/26

Dropping from an Aggregation Query Tuple-based Approach.. 30 15 30 20 10 30 Average ω = 3 δ = 3.. 25 20 Drop before : non-subset result of nearly the same size.. 30 15 30 20 10 30 Drop p = 0.5.. 30 15 30 20 10 30 Average ω = 3 δ = 3.. 15 15 Drop after : subset result of smaller size.. 30 15 30 20 10 30 Average.. 25 20 Drop.. 25 20 ω = 3 p = 0.5 δ = 3 8/26

Dropping from an Aggregation Query Window-based Approach.. 30 15 30 20 10 30 Average ω = 3 δ = 3.. 25 20 Drop before : subset result of smaller size.. 30 15 30 20 10 30 Window Drop ω = 3, δ = 3 p = 0.5.. 30 15 30 20 10 30 Average ω = 3 δ = 3.. 25 20 window-aware load shedding 9/26

Functionality: Attach window keep/drop specifications into tuples. Discard tuples that can be early-dropped dropped. Key parameters: Window Drop Operator Window Specification -1 0 T (T > 0) Window Drop ω, δ, p, B Meaning Don t care. Drop the window. Keep the window. Keep all tuples with timestamp < T. ω: window size δ: window slide p: drop probability (one per group) B: batch size (one per group) 10/26

Window Drop for Nested Aggregation (Pipeline Arrangements) Window Drop ω, δ, B Aggregate 1 ω 1, δ 1 Aggregate n ω n, δ n B Example: ω 1 =3, δ 1 =2 ω 2 =3, δ 2 =3 1 ω=5, δ=3 1 2 3 4 5 6 7 8 9 10 11 1 3 5 7 9 4 7 10 1 2 3 4 5 6 7 8 9 10 11 1 4 7 10 n ω= ωi ( n 1) i= 1 δ = δn B = B 11/26

Window Drop for Shared Aggregation Example: ω 1 =3, δ 1 =3 extent = 0 ω 2 =3, δ 2 =2 extent = 1 ω=7, δ=6 Window Drop ω, δ, B (Fan-out Arrangements) common start every 6 units 1 4 7 10 1 2 3 4 5 6 7 8 9 10 11 max extent 1 3 5 7 9 1 2 3 4 5 6 7 8 9 10 11 1 7 12/26 Aggregate 1 ω 1, δ 1 Aggregate n ω n, δ n ω = lcm δ δ ( 1,..., n) n i= 1 B 1 B n { extent Ai } + max ( ) where extent( A ) δ = lcm( δ1,..., δ n) = ω δ i i i n Bi B = mini= 1 lcm( δ 1,..., δ n) / δi

Window Drop for Nesting + Sharing (Composite Arrangements) Apply the two basic rules in any order. Window Drop ω, δ, B Aggregate 0 ω 0, δ 0 Aggregate 1 ω 1, δ 1 Aggregate 2 ω 2, δ 2 Fan-out - Pipeline Pipeline - Fan-out B 1 B 2 ω 0 =4, δ 0 =1 ω=7, δ=6 ω 1 =3, δ 1 =3 ω=6, δ=3 ω 0 =4, δ 0 =1 ω 1 =3, δ 1 =3 ω 2 =3, δ 2 =2 ω=6, δ=2 ω 0 =4, δ 0 =1 ω 2 =3, δ 2 =2 ω=10, δ=6 ω 0 =4, δ 0 =1 ω 1 =3, δ 1 =3 ω 2 =3, δ 2 =2 ω=10, δ=6 ω 0 =4, δ 0 =1 ω 1 =3, δ 1 =3 ω 2 =3, δ 2 =2 13/26

Window Drop in Action Mark tuples & apply early drops at the Window Drop. drop keep.. 30 15 30 20 10 30 Window Drop.. 30 15 30 20 10 30.. 0-1 -1 4 ω = 3, δ = 3 p = 0.5 dropped fake Decode the marks at the Aggregate. skip open.. 30 20 10 30.. 0-1 -1 4 Average ω = 3, δ = 3.. 20.. 1 14/26

Fake Tuples Fake tuples carry no data, but only window specs. They arise when: the tuple content is not needed as the tuple only indicates a window drop the tuple to mark is originally missing, or filtered out during query processing Overhead is low. 15/26

Early Drops Sliding windows may overlap. Window drop can discard a tuple if and only if all of its windows are going to be dropped. At steady state, each tuple belongs to ~ ω/δ windows (i.e., for early drops, B ω/δ). If B < ω/δ,, then no need to push window drop further upstream than the first aggregate in the pipeline. 16/26

Background Talk Outline Load shedding in Aurora Windowed aggregation queries Window-aware load shedding Experimental results Related work Conclusions and Future directions 17/26

Setup Experiments Single-node node Borealis Streams: (time, value) pairs Aggregation queries with delays Basic measure: % total loss Goals Performance on nested and shared query plans Effect of window parameters Processing overhead 18/26

Performance for Nested Plans Window Drop Random Drop Delay 5ms Aggregate 10, 10 Delay 50ms Aggregate 100, 100 Delay 500ms 100 90 Window Drop Random Drop 80 70 60 % Loss 50 40 30 20 10 0 7 25 50 75 100 % Excess Rate 19/26

Performance for Shared Plans Window Drop Delay 5ms Aggregate 10, 10 Aggregate 20, 20 Random Drop Delay 50ms Delay 100ms 100 90 80 Window Drop Random Drop 70 % Total Loss 60 50 40 30 20 10 0 20 40 60 80 100 % Excess Rate 20/26

Effect of Window Slide Window Drop ω = 100, δ Delay 5ms Aggregate ω = 100, δ Delay 5ms 100 90 80 Excess Rate = 25% Excess Rate = 50% Excess Rate = 66% % Loss 70 60 50 40 30 20 10 0 1 2 10 100 Window Slide (δ) 21/26

Processing Overhead Window Drop ω, δ = ω Filter Delay 10ms Aggregate ω, δ = ω Throughput Ratio (Window-Drop (p=0) / No Window-Drop) Window size (ω) 25 50 75 100 Selectivity = 1.0 0.99 0.99 1.0 1.0 Selectivity = 0.5 0.96 0.98 0.98 1.0 22/26

Related Work Load shedding for aggregation queries Statistical approach of Babcock et al [ICDE 04] Punctuation-based stream processing Tucker et al [TKDE 03], Li et al [SIGMOD 05] Other load shedding work (examples) General: Aurora [VLDB 03], Data Triage [ICDE 05] On joins: Das et al [SIGMOD 03], STREAM [VLDB 04] With optimization: NiagaraCQ [ICDE 03, SIGMOD 04] Approximate query processing on aggregates Synopsis-based approaches, online aggregation [SIGMOD 97] 23/26

Conclusions We provide a Window-aware Load Shedding technique for a general class of aggregation queries over data streams with: sliding windows user-defined aggregates nesting, sharing, grouping Window Drop preserves window integrity, enabling: easy control of error propagation subset guarantee for query results early load reduction that minimizes error 24/26

Future Directions Prediction-based load shedding Can we guess the missing values? Statistical approaches vs. Subset-based based approaches Window-awareness on joins Memory-constrained environments Distributed environments 25/26

Questions? 26/26

Decoding Window Specifications Window Start Window Spec Keep Boundary To Do T * Open window. Set boundary = T (ω 1) Set spec = T (ω 1) 0-1 Within Open window. Set boundary = t + ω (if >) 0-1 Beyond Skip window. T * Set boundary = T (ω 1) Set spec = T (ω 1) Mark as fake tuple. 0 * Mark as fake tuple. -1 * Ignore. 27/26