Accelerating Exploratory Statistical Analysis Abdul Wasay inding Wei Niv Dayan Stratos Idreos
Statistics are everywhere! Algorithms Systems Analytic Pipelines
80 Temperature 60 40 20 0 May 2017
80 Temperature 60 40 20 0 May 2017 Mean
80 Temperature 60 40 20 0 May 2017 Variance
+ Lemonade Sale 8 6 4 2 0 Correlation - Temperature Hot Choc. Sale 80 60 40 20 0 70 53 35 18 0 Correlation
Repetitive Statistics
Repetition takes multiple forms Query Range Q1 Time Q2 Q3 Column Sub-range
Repetition takes multiple forms Query Range Q1 Time Q2 Q3 Column Overlap
Repetition takes multiple forms Query Range Q1 S1 Time Q2 S2 Q3 S3 Column Different Statistics
Repetition takes multiple forms Query Range Q1 S1 Time Q2 S2 Q3 S3 Column Mixed
Exploratory Workloads Exhibit Repetition Column set repeats* Template repeats* Exactly repeats* 100 99.80 99.70 97.00 Queries (%) 50 54.65 36.93 0 SQLShare 4.00 SDSS * at least once SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment Shrainik Jain, Dominik Moritz, Bill Howe, Ed Lazowska. SIGMOD 2016
Exploratory Workloads Exhibit Repetition Column set repeats* Template repeats* Exactly repeats* 100 99.80 99.70 97.00 Queries (%) Repetition 54.65is everywhere - between 50% to 99% 50 36.93 0 SQLShare 4.00 SDSS * at least once SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment Shrainik Jain, Dominik Moritz, Bill Howe, Ed Lazowska. SIGMOD 2016
How Do Existing Tools Perform? NumPy (Python) ModelTools (R) MonetDB
How Do Existing Tools Perform? NumPy (Python) ModelTools (R) MonetDB Sequence of Queries Q1- Mean Q2- Var. Q3- Cov. Q4- Mean Q5- Var. Q6- Cov.
How Do Existing Tools Perform? NumPy (Python) ModelTools (R) MonetDB Sequence of Queries Q1- Mean Q2- Var. Q3- Cov. Q4- Mean Q5- Var. Q6- Cov.
How Do Existing Tools Perform? NumPy Modeltools MonetDB Normalized execution time 1 0.75 0.5 0.25 0 Q1- Mean Q2- Var. Q3- Cov. Q4- Mean Q5- Var. Q6- Cov.
How Do Existing Tools Perform? NumPy Modeltools MonetDB Normalized execution time 1 0.75 0.5 0.25 0 Q1- Mean Q2- Var. Q3- Cov. Q4- Mean Q5- Var. Q6- Cov.
Data
Existing systems always compute statistics from scratch Data
Statistical queries Library of building blocks Data
Statistical queries Library of building blocks Data
Statistical queries Library of building blocks Data
Statistical queries Avoid redundant data access to accelerate statistical analysis Library of building blocks Data
Statistic Basic Aggregates { { { { a chunk Data
Q: Monthly Variance Variance = 1 n 2 1 n 2 2 2 2 2 2 { { { { t Chunk size: 7 (a week)
Q: Monthly Variance Chunk size: 7 (a week) 2 2 2 2 { { { { t
1 n 2 1 2 Q: Monthly Variance n Chunk size: 7 (a week) 2 2 2 2 { { { { t
Var (first week) 1 7 2 i ( 1 7 i ) 2 Reuse between ranges Monthly mean 1 31 i Reuse between statistics Mean (first week) 1 7 i Mixed Chunk size: 7 (a week) 2 2 2 2
Var (first week) 1 7 2 i ( 1 7 i ) 2 Reuse between ranges Monthly mean 1 31 i Reuse between statistics Mean (first week) 1 7 i Mixed Chunk size: 7 (a week) 2 2 2 2
Var (first week) 1 7 2 i ( 1 7 i ) 2 Reuse between ranges Monthly mean 1 31 i Reuse between statistics Mean (first week) 1 7 i Mixed Chunk size: 7 (a week) 2 2 2 2
Var (first week) 1 7 2 i ( 1 7 i ) 2 Reuse between ranges Monthly mean 1 31 i Reuse between statistics Mean (first week) 1 7 i Mixed Chunk size: 7 (a week) 2 2 2 2
? Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk? Updates Data? Build? Memory Pressure
? Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk? Updates Data? Build? Memory Pressure
? Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk Data
Statistic Basic Aggregates
A basic aggregate Data column
A basic aggregate τ Data column
A basic aggregate f τ Data column
A basic aggregate f τ Data column τ(x) = x τ(x) = x 2
A basic aggregate f τ Data column f( ) f( ) f( ) f( )
A basic aggregate f τ Data column f( ) f( ) f( ) f( ) Example: max, min, sum, product
2 1 1 16 4 126 36 τ 6 64 8 9 3 f( ) = ( ) = 2
2 1 1 16 4 126 36 τ 6 64 8 9 3 f( ) = 1 1 ( ) = 2 max 4 4 8 6 τ 6 8 8 3 3 f( ) =max{ } ( ) =
Statistics are decomposed into building blocks f F τ τ f
1 1 4 4 22 + 6 τ 6 8 8 1 3 3 n ( ) = Arithmetic Mean 1 1 4.4 1 4 n 5 + 1 τ 6 1 8 1 3 ( ) =1
1 1 Y 4 4 576 * 6 τ 6 8 8 3 3 ( Y ) 1 n Geometric Mean ( ) = 3.6 1 1 1 4 n 5 + 1 τ 6 1 8 1 3 ( ) =1
f τ Data Column
f Chunk f f τ Data Column
Basic Aggregates Chunks Aggregate function f F Statistic function f f τ τ Data column Chunk Transformation f Basic aggregates
Basic Aggregates Chunks Aggregate function f Statistic type F Statistic function f f τ τ Data column Chunk Transformation f Basic aggregates
Basic Aggregates Chunks Aggregate function f Statistic type Overlapping ranges F Statistic function f f τ τ Data column Chunk Sub-ranges Transformation f Basic aggregates
Basic Aggregates Chunks Aggregate function f Statistic type Overlapping ranges Mixed F Statistic function f f τ τ Data column Chunk Sub-ranges Transformation f Basic aggregates
Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk Data
Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk Data
f f τ τ Data column Chunk f Transformation f Basic aggregates
f Segment trees f τ τ Data column Chunk f Transformation f Basic aggregates
5 4 2 5 4 4 6 8 0 1 7 2 2 1 3 5 Data Logical chunk
Basic aggregates Leaves 9 7 8 14 1 9 3 8 5 4 2 5 4 4 6 8 0 1 7 2 2 1 3 5 Data Logical chunk
59 38 21 16 22 10 11 Leaves 9 7 8 14 1 9 3 8 5 4 2 5 4 4 6 8 0 1 7 2 2 1 3 5 Data Logical chunk
59 38 21 16 22 10 11 Leaves 9 7 8 14 1 9 3 8 5 4 2 5 4 4 6 8 0 1 7 2 2 1 3 5 Data Logical chunk
59 38 21 16 22 10 11 Leaves 9 7 8 14 1 9 3 8 5 4 2 5 4 4 6 8 0 1 7 2 2 1 3 5 Data
59 38 21 16 22 10 11 Leaves 9 7 8 14 1 9 3 8 5 4 2 5 4 4 6 8 0 1 7 2 2 1 3 5 Data
7 22 10 3 42
One segment tree per basic aggregate per column Column a Column b
One segment tree per basic aggregate per column 2 2 Column a Column b
One segment tree per basic aggregate per column 2 2 Column a Column b
Decompose Statistic Basic Aggregates? Synthesize Store? Chunk size { { { { a chunk Data
Decompose Statistic Basic Aggregates? Synthesize Store? Chunk size { { { { a chunk Data
Calculate variance in temperature between May 15 and Oct. 15
Calculate variance in temperature between May 15 and Oct. 15 variance temperature May 15 and Oct. 15
Calculate variance in temperature between May 15 and Oct. 15 variance 2 N Recipe temperature May 15 and Oct. 15
Calculate variance in temperature between May 15 and Oct. 15 variance 2 Recipe N May 15 and Oct. 15 May a chunk June temperature July August Sept. Oct base data segment trees base data Chunk size = 1 month
Calculate variance in temperature between May 15 and Oct. 15 Plan 2 N Recipe temperature May June July August Sept. Oct base data segment trees base data fractal size = 1 month
Decompose Statistic Basic Aggregates Synthesize Store? Chunk size { { { { a chunk Data
Decompose Statistic Basic Aggregates Synthesize Store? Chunk size { { { { a chunk Data
base data Segment trees base data Chunk size = 1 month
base data Segment trees base data Chunk size = 1 month
Segment tree traversal O(log(n/c)) base data Segment trees base data n column size; c chunk size
Segment tree traversal Residual range scan O(log(n/c)) O(c) base data Segment trees base data n column size; c chunk size
Total cost = O(log(n/c) + c) Segment tree traversal Residual range scan O(log(n/c)) O(c) base data Segment trees base data n column size; c chunk size
Total cost = O(log(n/c) + c) Query cost Chunk size n column size; c chunk size
Total cost = O(log(n/c) + c) Query cost Chunk size n column size; c chunk size
Total cost = O(log(n/c) + c) Query cost Chunk size n column size; c chunk size
Total cost = O(log(n/c) + c) Query cost Chunk size n column size; c chunk size
Total cost = O(log(n/c) + c) Query cost Chunk size n column size; c chunk size
Total cost = O(log(n/c) + c) Query cost Cache size I/O Chunk size n column size; c chunk size
Experimental Analysis
Range Query range distribution Range-Uniform Range-Zoom-in 1 thread In-memory 40M rows, 100 columns, 5 statistic types
Online
Online Offline
Range Uniform Workload 1000 MonetDB Latency (ms) 100 10 Modeltools (R) NumPy (Python) Online 1 0 210 410 610 810 1010 1210 1410 1610 1810 2000 Query Sequence 1 thread In-memory 40M rows, 100 columns, 5 statistic types
Range Uniform Workload 1000 MonetDB Latency (ms) 100 Modeltools (R) NumPy (Python) DC is up to 2 orders of magnitude faster 10 Online 1 0 210 410 610 810 1010 1210 1410 1610 1810 2000 Query Sequence 1 thread In-memory 40M rows, 100 columns, 5 statistic types
1000 432.53 Total execution time (s) 100 10 217.44 45.89 31.42 Range-Uniform* Range-Zoom-in* 1 1.12 1.05 No DC Online Offline * 2000 Queries 1 thread In-memory 40M rows, 100 columns, 5 statistic types
1000 432.53 Total execution time (s) 100 217.44 45.89 31.42 Range-Uniform* Range-Zoom-in* DC benefits 10 regardless of the exploration scenario 1 1.12 1.05 No DC Online Offline 1 thread In-memory 40M rows, 100 columns, 5 statistic types
12000 Total execution time (s) 10000 8000 6000 4000 2000 Collaborative Filtering Bayesian Classification Simple Linear Regression 0 Online No DC Individual Sequential 1 thread In-memory 40M rows, 100 columns, 5 statistic types
: Accelerating Exploratory Statistical Analysis Statistics are everywhere! Repetitive statistics and data access synthesizes statistics from basic ingredients
UERIOSITY Accelerate by synthesis Provide hints daslab.seas.harvard.edu/queriosity
daslab.seas.harvard.edu/ data-canopy queriosity Thank You!
Range-Zoom-in 10000 Response time (ms) 100 1 MonetDB NumPy (Python) Modeltools(R) Online DC 0.01 Range-Zoom-in 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Query Sequence 1 thread In-memory 40M rows, 100 columns, 5 statistic types
Segment Tree in f({f( 1 ),f( 2 )}, {f( 3 ),f( 4 )}) f({f( 1 ),f( 2 )}) f({f( 3 ),f( 4 )}) i = { ( i )} f( 1 ) f( 2 ) f( 3 ) f( 4 ) Chunk i Leaves: basic aggregate on every chunk Internal node: Aggregate function applied to its two children *f() = f({ f(1), f(2), f(n) })
Operation Modes Offline Online Speculative Time Prep Now Future Prep Now Future Prep Now Future
sub-tree. needed by incoming queries. Column Existing rows New rows Query range Extend segment trees as needed Response time (s) 70 60 50 40 30 20 10 0 Insert Reconstruct x2 rows x2 columns x2 both 0 1x10 3 2x10 3 3x10 3 4x10 3 5x10 3 6x10 3 7x10 3 8x10 3 Total execution time (s) 25 20 15 10 5 0 Query Update 0 1 2 5 25 50 75 100 Query sequence Point updates (%) Inserts Updates
Performance under Memory Pressure Average response time (ms) 90 80 70 60 50 40 30 20 10 0 0 2.0x10 4 4.0x10 4 6.0x10 4 8.0x10 4 1.0x10 5 Query sequence
Handling Memory Pressure Base data in memory (%) Total execution time (s) 2500 2000 1500 1000 500 0 Phase 1 100 100 100 75 50 25 DC StatSys Phase 2 2.4 3.9 5.7 6.6 7.9 9.6 Number of rows (M)
Memory Feasibility 8GB 10 8 In-memory bivariate statistics 10 7 10 6 10 5 10 4 1M rows 10M rows 100M rows 1T rows 1 100 200 300 400 500 600 700 800 900 1000 Number of columns
Memory Footprint 100 s=s o Univariate DC Memory footprint (GB) 10 1 0.1 0.01 0.001 Bivariate DC s=64kb Max U workload Max U workload
Scaling with Queries Average response time (ms) 5 4 3 2 1 0 U Z 0 2x10 5 4x10 5 6x10 5 8x10 5 1x10 6 Query sequence
Selecting the Chunk Size 256 Total Execution time (s) 128 64 32 16 8 4 1M 10M 100M 1B 16 64 256 1024 4096 16384 65536 Chunk size (bytes)