Data Canopy. Accelerating Exploratory Statistical Analysis. Abdul Wasay Xinding Wei Niv Dayan Stratos Idreos

Similar documents
NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

Large-Scale Behavioral Targeting

An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets

Wavelets for Efficient Querying of Large Multidimensional Datasets

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

FPGA Implementation of a Predictive Controller

In-Database Factorised Learning fdbresearch.github.io

RAID+: Deterministic and Balanced Data Distribution for Large Disk Enclosures

Window-aware Load Shedding for Aggregation Queries over Data Streams

1 Approximate Quantiles and Summaries

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

416 Distributed Systems

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

django in the real world

ArcGIS Enterprise: What s New. Philip Heede Shannon Kalisky Melanie Summers Shreyas Shinde

QR Decomposition in a Multicore Environment

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing

P Q1 Q2 Q3 Q4 Q5 Tot (60) (20) (20) (20) (60) (20) (200) You are allotted a maximum of 4 hours to complete this exam.

Progressive & Algorithms & Systems

ECEN 689 Special Topics in Data Science for Communications Networks

Chart types and when to use them

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

CSC 1700 Analysis of Algorithms: Warshall s and Floyd s algorithms

Weather Prediction Using Historical Data

Behavioral Simulations in MapReduce

Using Oracle Rdb Partitioned Lock Trees. Norman Lastovica Oracle Rdb Engineering November 13, 06

The Design Procedure. Output Equation Determination - Derive output equations from the state table

Statistical Clustering of Vesicle Patterns Practical Aspects of the Analysis of Large Datasets with R

Data analysis of massive data sets a Planck example

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. SAMs - Detailed outline. Spatial Access Methods - problem

Where to Find My Next Passenger?

RESEARCH ON THE DISTRIBUTED PARALLEL SPATIAL INDEXING SCHEMA BASED ON R-TREE

Multi-Approximate-Keyword Routing Query

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Line Of Balance. Dr. Ahmed Elyamany

Summarizing Measured Data

Correlated subqueries. Query Optimization. Magic decorrelation. COUNT bug. Magic example (slide 2) Magic example (slide 1)

The conceptual view. by Gerrit Muller University of Southeast Norway-NISE

Applied Cartography and Introduction to GIS GEOG 2017 EL. Lecture-2 Chapters 3 and 4

Introduction to Column Stores with MonetDB and Benchmark

Scikit-learn. scikit. Machine learning for the small and the many Gaël Varoquaux. machine learning in Python

LECTURE 04: LINEAR REGRESSION PT. 2. September 20, 2017 SDS 293: Machine Learning

Direct and Incomplete Cholesky Factorizations with Static Supernodes

Large-scale Linear RankSVM

Pysynphot: A Python Re Implementation of a Legacy App in Astronomy

Gridless DSMC. Spencer Olson. University of Michigan Naval Research Laboratory Now at: Air Force Research Laboratory

Sparse BLAS-3 Reduction

Non-Preemptive and Limited Preemptive Scheduling. LS 12, TU Dortmund

In-Database Learning with Sparse Tensors

Dictionary: an abstract data type

A Blackbox Polynomial System Solver on Parallel Shared Memory Computers

SPATIAL INDEXING. Vaibhav Bajpai

ArcGIS Deployment Pattern. Azlina Mahad

Lecture 11 Linear programming : The Revised Simplex Method

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

An Integrative Model for Parallelism

High Performance Computing

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.

Towards Indexing Functions: Answering Scalar Product Queries Arijit Khan, Pouya Yanki, Bojana Dimcheva, Donald Kossmann

INF2220: algorithms and data structures Series 1

Combinational Logic Design Combinational Functions and Circuits

Statistics I Exercises Lesson 3 Academic year 2015/16

CS246 Final Exam, Winter 2011

Quiz 2. Due November 26th, CS525 - Advanced Database Organization Solutions

MANAGING STORAGE STRUCTURES FOR SPATIAL DATA IN DATABASES ABSTRACT

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Hardware Design I Chap. 4 Representative combinational logic

CHAPTER 4: DATASETS AND CRITERIA FOR ALGORITHM EVALUATION

STAT 520: Forecasting and Time Series. David B. Hitchcock University of South Carolina Department of Statistics

AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis

6. Iterative Methods for Linear Systems. The stepwise approach to the solution...

ArcGIS GeoAnalytics Server: An Introduction. Sarah Ambrose and Ravi Narayanan

Reliability and Risk Analysis. Time Series, Types of Trend Functions and Estimates of Trends

Ensemble Methods and Random Forests

Stochastic Gradient Descent. CS 584: Big Data Analytics

Global Optimization of Common Subexpressions for Multiplierless Synthesis of Multiple Constant Multiplications

Data Exploration and Unsupervised Learning with Clustering

Enhancing Reuse of Constraint Solutions to Improve Symbolic Execution

CS6220: DATA MINING TECHNIQUES

Scalable 3D Spatial Queries for Analytical Pathology Imaging with MapReduce

Randomized Selection on the GPU. Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Digital Logic: Boolean Algebra and Gates. Textbook Chapter 3

Query Optimization: Exercise

META: An Efficient Matching-Based Method for Error-Tolerant Autocompletion

Predicting New Search-Query Cluster Volume

Complex Dynamics of Microprocessor Performances During Program Execution

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Patent Searching using Bayesian Statistics

SHIFT-SPLIT: I/O Efficient Maintenance of Wavelet-Transformed Multidimensional Data

Belief Update in CLG Bayesian Networks With Lazy Propagation

CSE 4201, Ch. 6. Storage Systems. Hennessy and Patterson

Concurrent Divide-and-Conquer Library

Database Design and Normalization

4.8 Efficiency Experts A Solidify Understanding Task

Transcription:

Accelerating Exploratory Statistical Analysis Abdul Wasay inding Wei Niv Dayan Stratos Idreos

Statistics are everywhere! Algorithms Systems Analytic Pipelines

80 Temperature 60 40 20 0 May 2017

80 Temperature 60 40 20 0 May 2017 Mean

80 Temperature 60 40 20 0 May 2017 Variance

+ Lemonade Sale 8 6 4 2 0 Correlation - Temperature Hot Choc. Sale 80 60 40 20 0 70 53 35 18 0 Correlation

Repetitive Statistics

Repetition takes multiple forms Query Range Q1 Time Q2 Q3 Column Sub-range

Repetition takes multiple forms Query Range Q1 Time Q2 Q3 Column Overlap

Repetition takes multiple forms Query Range Q1 S1 Time Q2 S2 Q3 S3 Column Different Statistics

Repetition takes multiple forms Query Range Q1 S1 Time Q2 S2 Q3 S3 Column Mixed

Exploratory Workloads Exhibit Repetition Column set repeats* Template repeats* Exactly repeats* 100 99.80 99.70 97.00 Queries (%) 50 54.65 36.93 0 SQLShare 4.00 SDSS * at least once SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment Shrainik Jain, Dominik Moritz, Bill Howe, Ed Lazowska. SIGMOD 2016

Exploratory Workloads Exhibit Repetition Column set repeats* Template repeats* Exactly repeats* 100 99.80 99.70 97.00 Queries (%) Repetition 54.65is everywhere - between 50% to 99% 50 36.93 0 SQLShare 4.00 SDSS * at least once SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment Shrainik Jain, Dominik Moritz, Bill Howe, Ed Lazowska. SIGMOD 2016

How Do Existing Tools Perform? NumPy (Python) ModelTools (R) MonetDB

How Do Existing Tools Perform? NumPy (Python) ModelTools (R) MonetDB Sequence of Queries Q1- Mean Q2- Var. Q3- Cov. Q4- Mean Q5- Var. Q6- Cov.

How Do Existing Tools Perform? NumPy (Python) ModelTools (R) MonetDB Sequence of Queries Q1- Mean Q2- Var. Q3- Cov. Q4- Mean Q5- Var. Q6- Cov.

How Do Existing Tools Perform? NumPy Modeltools MonetDB Normalized execution time 1 0.75 0.5 0.25 0 Q1- Mean Q2- Var. Q3- Cov. Q4- Mean Q5- Var. Q6- Cov.

How Do Existing Tools Perform? NumPy Modeltools MonetDB Normalized execution time 1 0.75 0.5 0.25 0 Q1- Mean Q2- Var. Q3- Cov. Q4- Mean Q5- Var. Q6- Cov.

Data

Existing systems always compute statistics from scratch Data

Statistical queries Library of building blocks Data

Statistical queries Library of building blocks Data

Statistical queries Library of building blocks Data

Statistical queries Avoid redundant data access to accelerate statistical analysis Library of building blocks Data

Statistic Basic Aggregates { { { { a chunk Data

Q: Monthly Variance Variance = 1 n 2 1 n 2 2 2 2 2 2 { { { { t Chunk size: 7 (a week)

Q: Monthly Variance Chunk size: 7 (a week) 2 2 2 2 { { { { t

1 n 2 1 2 Q: Monthly Variance n Chunk size: 7 (a week) 2 2 2 2 { { { { t

Var (first week) 1 7 2 i ( 1 7 i ) 2 Reuse between ranges Monthly mean 1 31 i Reuse between statistics Mean (first week) 1 7 i Mixed Chunk size: 7 (a week) 2 2 2 2

Var (first week) 1 7 2 i ( 1 7 i ) 2 Reuse between ranges Monthly mean 1 31 i Reuse between statistics Mean (first week) 1 7 i Mixed Chunk size: 7 (a week) 2 2 2 2

Var (first week) 1 7 2 i ( 1 7 i ) 2 Reuse between ranges Monthly mean 1 31 i Reuse between statistics Mean (first week) 1 7 i Mixed Chunk size: 7 (a week) 2 2 2 2

Var (first week) 1 7 2 i ( 1 7 i ) 2 Reuse between ranges Monthly mean 1 31 i Reuse between statistics Mean (first week) 1 7 i Mixed Chunk size: 7 (a week) 2 2 2 2

? Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk? Updates Data? Build? Memory Pressure

? Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk? Updates Data? Build? Memory Pressure

? Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk Data

Statistic Basic Aggregates

A basic aggregate Data column

A basic aggregate τ Data column

A basic aggregate f τ Data column

A basic aggregate f τ Data column τ(x) = x τ(x) = x 2

A basic aggregate f τ Data column f( ) f( ) f( ) f( )

A basic aggregate f τ Data column f( ) f( ) f( ) f( ) Example: max, min, sum, product

2 1 1 16 4 126 36 τ 6 64 8 9 3 f( ) = ( ) = 2

2 1 1 16 4 126 36 τ 6 64 8 9 3 f( ) = 1 1 ( ) = 2 max 4 4 8 6 τ 6 8 8 3 3 f( ) =max{ } ( ) =

Statistics are decomposed into building blocks f F τ τ f

1 1 4 4 22 + 6 τ 6 8 8 1 3 3 n ( ) = Arithmetic Mean 1 1 4.4 1 4 n 5 + 1 τ 6 1 8 1 3 ( ) =1

1 1 Y 4 4 576 * 6 τ 6 8 8 3 3 ( Y ) 1 n Geometric Mean ( ) = 3.6 1 1 1 4 n 5 + 1 τ 6 1 8 1 3 ( ) =1

f τ Data Column

f Chunk f f τ Data Column

Basic Aggregates Chunks Aggregate function f F Statistic function f f τ τ Data column Chunk Transformation f Basic aggregates

Basic Aggregates Chunks Aggregate function f Statistic type F Statistic function f f τ τ Data column Chunk Transformation f Basic aggregates

Basic Aggregates Chunks Aggregate function f Statistic type Overlapping ranges F Statistic function f f τ τ Data column Chunk Sub-ranges Transformation f Basic aggregates

Basic Aggregates Chunks Aggregate function f Statistic type Overlapping ranges Mixed F Statistic function f f τ τ Data column Chunk Sub-ranges Transformation f Basic aggregates

Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk Data

Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk Data

f f τ τ Data column Chunk f Transformation f Basic aggregates

f Segment trees f τ τ Data column Chunk f Transformation f Basic aggregates

5 4 2 5 4 4 6 8 0 1 7 2 2 1 3 5 Data Logical chunk

Basic aggregates Leaves 9 7 8 14 1 9 3 8 5 4 2 5 4 4 6 8 0 1 7 2 2 1 3 5 Data Logical chunk

59 38 21 16 22 10 11 Leaves 9 7 8 14 1 9 3 8 5 4 2 5 4 4 6 8 0 1 7 2 2 1 3 5 Data Logical chunk

59 38 21 16 22 10 11 Leaves 9 7 8 14 1 9 3 8 5 4 2 5 4 4 6 8 0 1 7 2 2 1 3 5 Data Logical chunk

59 38 21 16 22 10 11 Leaves 9 7 8 14 1 9 3 8 5 4 2 5 4 4 6 8 0 1 7 2 2 1 3 5 Data

59 38 21 16 22 10 11 Leaves 9 7 8 14 1 9 3 8 5 4 2 5 4 4 6 8 0 1 7 2 2 1 3 5 Data

7 22 10 3 42

One segment tree per basic aggregate per column Column a Column b

One segment tree per basic aggregate per column 2 2 Column a Column b

One segment tree per basic aggregate per column 2 2 Column a Column b

Decompose Statistic Basic Aggregates? Synthesize Store? Chunk size { { { { a chunk Data

Decompose Statistic Basic Aggregates? Synthesize Store? Chunk size { { { { a chunk Data

Calculate variance in temperature between May 15 and Oct. 15

Calculate variance in temperature between May 15 and Oct. 15 variance temperature May 15 and Oct. 15

Calculate variance in temperature between May 15 and Oct. 15 variance 2 N Recipe temperature May 15 and Oct. 15

Calculate variance in temperature between May 15 and Oct. 15 variance 2 Recipe N May 15 and Oct. 15 May a chunk June temperature July August Sept. Oct base data segment trees base data Chunk size = 1 month

Calculate variance in temperature between May 15 and Oct. 15 Plan 2 N Recipe temperature May June July August Sept. Oct base data segment trees base data fractal size = 1 month

Decompose Statistic Basic Aggregates Synthesize Store? Chunk size { { { { a chunk Data

Decompose Statistic Basic Aggregates Synthesize Store? Chunk size { { { { a chunk Data

base data Segment trees base data Chunk size = 1 month

base data Segment trees base data Chunk size = 1 month

Segment tree traversal O(log(n/c)) base data Segment trees base data n column size; c chunk size

Segment tree traversal Residual range scan O(log(n/c)) O(c) base data Segment trees base data n column size; c chunk size

Total cost = O(log(n/c) + c) Segment tree traversal Residual range scan O(log(n/c)) O(c) base data Segment trees base data n column size; c chunk size

Total cost = O(log(n/c) + c) Query cost Chunk size n column size; c chunk size

Total cost = O(log(n/c) + c) Query cost Chunk size n column size; c chunk size

Total cost = O(log(n/c) + c) Query cost Chunk size n column size; c chunk size

Total cost = O(log(n/c) + c) Query cost Chunk size n column size; c chunk size

Total cost = O(log(n/c) + c) Query cost Chunk size n column size; c chunk size

Total cost = O(log(n/c) + c) Query cost Cache size I/O Chunk size n column size; c chunk size

Experimental Analysis

Range Query range distribution Range-Uniform Range-Zoom-in 1 thread In-memory 40M rows, 100 columns, 5 statistic types

Online

Online Offline

Range Uniform Workload 1000 MonetDB Latency (ms) 100 10 Modeltools (R) NumPy (Python) Online 1 0 210 410 610 810 1010 1210 1410 1610 1810 2000 Query Sequence 1 thread In-memory 40M rows, 100 columns, 5 statistic types

Range Uniform Workload 1000 MonetDB Latency (ms) 100 Modeltools (R) NumPy (Python) DC is up to 2 orders of magnitude faster 10 Online 1 0 210 410 610 810 1010 1210 1410 1610 1810 2000 Query Sequence 1 thread In-memory 40M rows, 100 columns, 5 statistic types

1000 432.53 Total execution time (s) 100 10 217.44 45.89 31.42 Range-Uniform* Range-Zoom-in* 1 1.12 1.05 No DC Online Offline * 2000 Queries 1 thread In-memory 40M rows, 100 columns, 5 statistic types

1000 432.53 Total execution time (s) 100 217.44 45.89 31.42 Range-Uniform* Range-Zoom-in* DC benefits 10 regardless of the exploration scenario 1 1.12 1.05 No DC Online Offline 1 thread In-memory 40M rows, 100 columns, 5 statistic types

12000 Total execution time (s) 10000 8000 6000 4000 2000 Collaborative Filtering Bayesian Classification Simple Linear Regression 0 Online No DC Individual Sequential 1 thread In-memory 40M rows, 100 columns, 5 statistic types

: Accelerating Exploratory Statistical Analysis Statistics are everywhere! Repetitive statistics and data access synthesizes statistics from basic ingredients

UERIOSITY Accelerate by synthesis Provide hints daslab.seas.harvard.edu/queriosity

daslab.seas.harvard.edu/ data-canopy queriosity Thank You!

Range-Zoom-in 10000 Response time (ms) 100 1 MonetDB NumPy (Python) Modeltools(R) Online DC 0.01 Range-Zoom-in 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Query Sequence 1 thread In-memory 40M rows, 100 columns, 5 statistic types

Segment Tree in f({f( 1 ),f( 2 )}, {f( 3 ),f( 4 )}) f({f( 1 ),f( 2 )}) f({f( 3 ),f( 4 )}) i = { ( i )} f( 1 ) f( 2 ) f( 3 ) f( 4 ) Chunk i Leaves: basic aggregate on every chunk Internal node: Aggregate function applied to its two children *f() = f({ f(1), f(2), f(n) })

Operation Modes Offline Online Speculative Time Prep Now Future Prep Now Future Prep Now Future

sub-tree. needed by incoming queries. Column Existing rows New rows Query range Extend segment trees as needed Response time (s) 70 60 50 40 30 20 10 0 Insert Reconstruct x2 rows x2 columns x2 both 0 1x10 3 2x10 3 3x10 3 4x10 3 5x10 3 6x10 3 7x10 3 8x10 3 Total execution time (s) 25 20 15 10 5 0 Query Update 0 1 2 5 25 50 75 100 Query sequence Point updates (%) Inserts Updates

Performance under Memory Pressure Average response time (ms) 90 80 70 60 50 40 30 20 10 0 0 2.0x10 4 4.0x10 4 6.0x10 4 8.0x10 4 1.0x10 5 Query sequence

Handling Memory Pressure Base data in memory (%) Total execution time (s) 2500 2000 1500 1000 500 0 Phase 1 100 100 100 75 50 25 DC StatSys Phase 2 2.4 3.9 5.7 6.6 7.9 9.6 Number of rows (M)

Memory Feasibility 8GB 10 8 In-memory bivariate statistics 10 7 10 6 10 5 10 4 1M rows 10M rows 100M rows 1T rows 1 100 200 300 400 500 600 700 800 900 1000 Number of columns

Memory Footprint 100 s=s o Univariate DC Memory footprint (GB) 10 1 0.1 0.01 0.001 Bivariate DC s=64kb Max U workload Max U workload

Scaling with Queries Average response time (ms) 5 4 3 2 1 0 U Z 0 2x10 5 4x10 5 6x10 5 8x10 5 1x10 6 Query sequence

Selecting the Chunk Size 256 Total Execution time (s) 128 64 32 16 8 4 1M 10M 100M 1B 16 64 256 1024 4096 16384 65536 Chunk size (bytes)