Data Canopy. Accelerating Exploratory Statistical Analysis. Abdul Wasay Xinding Wei Niv Dayan Stratos Idreos

Accelerating Exploratory Statistical Analysis Abdul Wasay inding Wei Niv Dayan Stratos Idreos

Statistics are everywhere! Algorithms Systems Analytic Pipelines

80 Temperature 60 40 20 0 May 2017

80 Temperature 60 40 20 0 May 2017 Mean

80 Temperature 60 40 20 0 May 2017 Variance

+ Lemonade Sale 8 6 4 2 0 Correlation - Temperature Hot Choc. Sale 80 60 40 20 0 70 53 35 18 0 Correlation

Repetitive Statistics

Repetition takes multiple forms Query Range Q1 Time Q2 Q3 Column Sub-range

Repetition takes multiple forms Query Range Q1 Time Q2 Q3 Column Overlap

Repetition takes multiple forms Query Range Q1 S1 Time Q2 S2 Q3 S3 Column Different Statistics

Repetition takes multiple forms Query Range Q1 S1 Time Q2 S2 Q3 S3 Column Mixed

Exploratory Workloads Exhibit Repetition Column set repeats* Template repeats* Exactly repeats* 100 99.80 99.70 97.00 Queries (%) 50 54.65 36.93 0 SQLShare 4.00 SDSS * at least once SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment Shrainik Jain, Dominik Moritz, Bill Howe, Ed Lazowska. SIGMOD 2016

Exploratory Workloads Exhibit Repetition Column set repeats* Template repeats* Exactly repeats* 100 99.80 99.70 97.00 Queries (%) Repetition 54.65is everywhere - between 50% to 99% 50 36.93 0 SQLShare 4.00 SDSS * at least once SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment Shrainik Jain, Dominik Moritz, Bill Howe, Ed Lazowska. SIGMOD 2016

How Do Existing Tools Perform? NumPy (Python) ModelTools (R) MonetDB

How Do Existing Tools Perform? NumPy (Python) ModelTools (R) MonetDB Sequence of Queries Q1- Mean Q2- Var. Q3- Cov. Q4- Mean Q5- Var. Q6- Cov.

How Do Existing Tools Perform? NumPy Modeltools MonetDB Normalized execution time 1 0.75 0.5 0.25 0 Q1- Mean Q2- Var. Q3- Cov. Q4- Mean Q5- Var. Q6- Cov.

Data

Existing systems always compute statistics from scratch Data

Statistical queries Library of building blocks Data

Statistical queries Avoid redundant data access to accelerate statistical analysis Library of building blocks Data

Statistic Basic Aggregates { { { { a chunk Data

Q: Monthly Variance Variance = 1 n 2 1 n 2 2 2 2 2 2 { { { { t Chunk size: 7 (a week)

Q: Monthly Variance Chunk size: 7 (a week) 2 2 2 2 { { { { t

1 n 2 1 2 Q: Monthly Variance n Chunk size: 7 (a week) 2 2 2 2 { { { { t

Var (first week) 1 7 2 i ( 1 7 i ) 2 Reuse between ranges Monthly mean 1 31 i Reuse between statistics Mean (first week) 1 7 i Mixed Chunk size: 7 (a week) 2 2 2 2

? Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk? Updates Data? Build? Memory Pressure

? Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk Data

Statistic Basic Aggregates

A basic aggregate Data column

A basic aggregate τ Data column

A basic aggregate f τ Data column

A basic aggregate f τ Data column τ(x) = x τ(x) = x 2

A basic aggregate f τ Data column f( ) f( ) f( ) f( )

A basic aggregate f τ Data column f( ) f( ) f( ) f( ) Example: max, min, sum, product

2 1 1 16 4 126 36 τ 6 64 8 9 3 f( ) = ( ) = 2

2 1 1 16 4 126 36 τ 6 64 8 9 3 f( ) = 1 1 ( ) = 2 max 4 4 8 6 τ 6 8 8 3 3 f( ) =max{ } ( ) =

Statistics are decomposed into building blocks f F τ τ f

1 1 4 4 22 + 6 τ 6 8 8 1 3 3 n ( ) = Arithmetic Mean 1 1 4.4 1 4 n 5 + 1 τ 6 1 8 1 3 ( ) =1

1 1 Y 4 4 576 * 6 τ 6 8 8 3 3 ( Y ) 1 n Geometric Mean ( ) = 3.6 1 1 1 4 n 5 + 1 τ 6 1 8 1 3 ( ) =1

f τ Data Column

f Chunk f f τ Data Column

Basic Aggregates Chunks Aggregate function f F Statistic function f f τ τ Data column Chunk Transformation f Basic aggregates

Basic Aggregates Chunks Aggregate function f Statistic type F Statistic function f f τ τ Data column Chunk Transformation f Basic aggregates

Basic Aggregates Chunks Aggregate function f Statistic type Overlapping ranges F Statistic function f f τ τ Data column Chunk Sub-ranges Transformation f Basic aggregates

Basic Aggregates Chunks Aggregate function f Statistic type Overlapping ranges Mixed F Statistic function f f τ τ Data column Chunk Sub-ranges Transformation f Basic aggregates

Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk Data

f f τ τ Data column Chunk f Transformation f Basic aggregates

f Segment trees f τ τ Data column Chunk f Transformation f Basic aggregates

5 4 2 5 4 4 6 8 0 1 7 2 2 1 3 5 Data Logical chunk

Basic aggregates Leaves 9 7 8 14 1 9 3 8 5 4 2 5 4 4 6 8 0 1 7 2 2 1 3 5 Data Logical chunk

59 38 21 16 22 10 11 Leaves 9 7 8 14 1 9 3 8 5 4 2 5 4 4 6 8 0 1 7 2 2 1 3 5 Data Logical chunk

59 38 21 16 22 10 11 Leaves 9 7 8 14 1 9 3 8 5 4 2 5 4 4 6 8 0 1 7 2 2 1 3 5 Data

7 22 10 3 42

One segment tree per basic aggregate per column Column a Column b

One segment tree per basic aggregate per column 2 2 Column a Column b

Decompose Statistic Basic Aggregates? Synthesize Store? Chunk size { { { { a chunk Data

Calculate variance in temperature between May 15 and Oct. 15

Calculate variance in temperature between May 15 and Oct. 15 variance temperature May 15 and Oct. 15

Calculate variance in temperature between May 15 and Oct. 15 variance 2 N Recipe temperature May 15 and Oct. 15

Calculate variance in temperature between May 15 and Oct. 15 variance 2 Recipe N May 15 and Oct. 15 May a chunk June temperature July August Sept. Oct base data segment trees base data Chunk size = 1 month

Calculate variance in temperature between May 15 and Oct. 15 Plan 2 N Recipe temperature May June July August Sept. Oct base data segment trees base data fractal size = 1 month

Decompose Statistic Basic Aggregates Synthesize Store? Chunk size { { { { a chunk Data

base data Segment trees base data Chunk size = 1 month

Segment tree traversal O(log(n/c)) base data Segment trees base data n column size; c chunk size

Segment tree traversal Residual range scan O(log(n/c)) O(c) base data Segment trees base data n column size; c chunk size

Total cost = O(log(n/c) + c) Segment tree traversal Residual range scan O(log(n/c)) O(c) base data Segment trees base data n column size; c chunk size

Total cost = O(log(n/c) + c) Query cost Chunk size n column size; c chunk size

Total cost = O(log(n/c) + c) Query cost Cache size I/O Chunk size n column size; c chunk size

Experimental Analysis

Range Query range distribution Range-Uniform Range-Zoom-in 1 thread In-memory 40M rows, 100 columns, 5 statistic types

Online

Online Offline

Range Uniform Workload 1000 MonetDB Latency (ms) 100 10 Modeltools (R) NumPy (Python) Online 1 0 210 410 610 810 1010 1210 1410 1610 1810 2000 Query Sequence 1 thread In-memory 40M rows, 100 columns, 5 statistic types

Range Uniform Workload 1000 MonetDB Latency (ms) 100 Modeltools (R) NumPy (Python) DC is up to 2 orders of magnitude faster 10 Online 1 0 210 410 610 810 1010 1210 1410 1610 1810 2000 Query Sequence 1 thread In-memory 40M rows, 100 columns, 5 statistic types

1000 432.53 Total execution time (s) 100 10 217.44 45.89 31.42 Range-Uniform* Range-Zoom-in* 1 1.12 1.05 No DC Online Offline * 2000 Queries 1 thread In-memory 40M rows, 100 columns, 5 statistic types

1000 432.53 Total execution time (s) 100 217.44 45.89 31.42 Range-Uniform* Range-Zoom-in* DC benefits 10 regardless of the exploration scenario 1 1.12 1.05 No DC Online Offline 1 thread In-memory 40M rows, 100 columns, 5 statistic types

12000 Total execution time (s) 10000 8000 6000 4000 2000 Collaborative Filtering Bayesian Classification Simple Linear Regression 0 Online No DC Individual Sequential 1 thread In-memory 40M rows, 100 columns, 5 statistic types

: Accelerating Exploratory Statistical Analysis Statistics are everywhere! Repetitive statistics and data access synthesizes statistics from basic ingredients

UERIOSITY Accelerate by synthesis Provide hints daslab.seas.harvard.edu/queriosity

daslab.seas.harvard.edu/ data-canopy queriosity Thank You!

Range-Zoom-in 10000 Response time (ms) 100 1 MonetDB NumPy (Python) Modeltools(R) Online DC 0.01 Range-Zoom-in 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Query Sequence 1 thread In-memory 40M rows, 100 columns, 5 statistic types

Segment Tree in f({f( 1 ),f( 2 )}, {f( 3 ),f( 4 )}) f({f( 1 ),f( 2 )}) f({f( 3 ),f( 4 )}) i = { ( i )} f( 1 ) f( 2 ) f( 3 ) f( 4 ) Chunk i Leaves: basic aggregate on every chunk Internal node: Aggregate function applied to its two children *f() = f({ f(1), f(2), f(n) })

Operation Modes Offline Online Speculative Time Prep Now Future Prep Now Future Prep Now Future

sub-tree. needed by incoming queries. Column Existing rows New rows Query range Extend segment trees as needed Response time (s) 70 60 50 40 30 20 10 0 Insert Reconstruct x2 rows x2 columns x2 both 0 1x10 3 2x10 3 3x10 3 4x10 3 5x10 3 6x10 3 7x10 3 8x10 3 Total execution time (s) 25 20 15 10 5 0 Query Update 0 1 2 5 25 50 75 100 Query sequence Point updates (%) Inserts Updates

Performance under Memory Pressure Average response time (ms) 90 80 70 60 50 40 30 20 10 0 0 2.0x10 4 4.0x10 4 6.0x10 4 8.0x10 4 1.0x10 5 Query sequence

Handling Memory Pressure Base data in memory (%) Total execution time (s) 2500 2000 1500 1000 500 0 Phase 1 100 100 100 75 50 25 DC StatSys Phase 2 2.4 3.9 5.7 6.6 7.9 9.6 Number of rows (M)

Memory Feasibility 8GB 10 8 In-memory bivariate statistics 10 7 10 6 10 5 10 4 1M rows 10M rows 100M rows 1T rows 1 100 200 300 400 500 600 700 800 900 1000 Number of columns

Memory Footprint 100 s=s o Univariate DC Memory footprint (GB) 10 1 0.1 0.01 0.001 Bivariate DC s=64kb Max U workload Max U workload

Scaling with Queries Average response time (ms) 5 4 3 2 1 0 U Z 0 2x10 5 4x10 5 6x10 5 8x10 5 1x10 6 Query sequence

Selecting the Chunk Size 256 Total Execution time (s) 128 64 32 16 8 4 1M 10M 100M 1B 16 64 256 1024 4096 16384 65536 Chunk size (bytes)