Data Canopy. Accelerating Exploratory Statistical Analysis. Abdul Wasay Xinding Wei Niv Dayan Stratos Idreos

Size: px

Start display at page:

Download "Data Canopy. Accelerating Exploratory Statistical Analysis. Abdul Wasay Xinding Wei Niv Dayan Stratos Idreos"

Ashlee Harris
6 years ago
Views:

1 Accelerating Exploratory Statistical Analysis Abdul Wasay inding Wei Niv Dayan Stratos Idreos

2 Statistics are everywhere! Algorithms Systems Analytic Pipelines

3 80 Temperature May 2017

4 80 Temperature May 2017 Mean

5 80 Temperature May 2017 Variance

6 + Lemonade Sale Correlation - Temperature Hot Choc. Sale Correlation

7 Repetitive Statistics

8 Repetition takes multiple forms Query Range Q1 Time Q2 Q3 Column Sub-range

9 Repetition takes multiple forms Query Range Q1 Time Q2 Q3 Column Overlap

10 Repetition takes multiple forms Query Range Q1 S1 Time Q2 S2 Q3 S3 Column Different Statistics

11 Repetition takes multiple forms Query Range Q1 S1 Time Q2 S2 Q3 S3 Column Mixed

12 Exploratory Workloads Exhibit Repetition Column set repeats* Template repeats* Exactly repeats* Queries (%) SQLShare 4.00 SDSS * at least once SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment Shrainik Jain, Dominik Moritz, Bill Howe, Ed Lazowska. SIGMOD 2016

13 Exploratory Workloads Exhibit Repetition Column set repeats* Template repeats* Exactly repeats* Queries (%) Repetition 54.65is everywhere - between 50% to 99% SQLShare 4.00 SDSS * at least once SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment Shrainik Jain, Dominik Moritz, Bill Howe, Ed Lazowska. SIGMOD 2016

14 How Do Existing Tools Perform? NumPy (Python) ModelTools (R) MonetDB

15 How Do Existing Tools Perform? NumPy (Python) ModelTools (R) MonetDB Sequence of Queries Q1- Mean Q2- Var. Q3- Cov. Q4- Mean Q5- Var. Q6- Cov.

16 How Do Existing Tools Perform? NumPy (Python) ModelTools (R) MonetDB Sequence of Queries Q1- Mean Q2- Var. Q3- Cov. Q4- Mean Q5- Var. Q6- Cov.

17 How Do Existing Tools Perform? NumPy Modeltools MonetDB Normalized execution time Q1- Mean Q2- Var. Q3- Cov. Q4- Mean Q5- Var. Q6- Cov.

18 How Do Existing Tools Perform? NumPy Modeltools MonetDB Normalized execution time Q1- Mean Q2- Var. Q3- Cov. Q4- Mean Q5- Var. Q6- Cov.

19 Data

20 Existing systems always compute statistics from scratch Data

22 Statistical queries Library of building blocks Data

23 Statistical queries Library of building blocks Data

24 Statistical queries Library of building blocks Data

25 Statistical queries Avoid redundant data access to accelerate statistical analysis Library of building blocks Data

26 Statistic Basic Aggregates { { { { a chunk Data

27 Q: Monthly Variance Variance = 1 n 2 1 n { { { { t Chunk size: 7 (a week)

28 Q: Monthly Variance Chunk size: 7 (a week) { { { { t

29 1 n Q: Monthly Variance n Chunk size: 7 (a week) { { { { t

30 Var (first week) i ( 1 7 i ) 2 Reuse between ranges Monthly mean 1 31 i Reuse between statistics Mean (first week) 1 7 i Mixed Chunk size: 7 (a week)

31 Var (first week) i ( 1 7 i ) 2 Reuse between ranges Monthly mean 1 31 i Reuse between statistics Mean (first week) 1 7 i Mixed Chunk size: 7 (a week)

32 Var (first week) i ( 1 7 i ) 2 Reuse between ranges Monthly mean 1 31 i Reuse between statistics Mean (first week) 1 7 i Mixed Chunk size: 7 (a week)

33 Var (first week) i ( 1 7 i ) 2 Reuse between ranges Monthly mean 1 31 i Reuse between statistics Mean (first week) 1 7 i Mixed Chunk size: 7 (a week)

34 ? Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk? Updates Data? Build? Memory Pressure

35 ? Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk? Updates Data? Build? Memory Pressure

36 ? Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk Data

37 Statistic Basic Aggregates

38 A basic aggregate Data column

39 A basic aggregate τ Data column

40 A basic aggregate f τ Data column

41 A basic aggregate f τ Data column τ(x) = x τ(x) = x 2

42 A basic aggregate f τ Data column f( ) f( ) f( ) f( )

43 A basic aggregate f τ Data column f( ) f( ) f( ) f( ) Example: max, min, sum, product

44 τ f( ) = ( ) = 2

45 τ f( ) = 1 1 ( ) = 2 max τ f( ) =max{ } ( ) =

46 Statistics are decomposed into building blocks f F τ τ f

47 τ n ( ) = Arithmetic Mean n τ ( ) =1

48 1 1 Y * 6 τ ( Y ) 1 n Geometric Mean ( ) = n τ ( ) =1

49 f τ Data Column

50 f Chunk f f τ Data Column

51 Basic Aggregates Chunks Aggregate function f F Statistic function f f τ τ Data column Chunk Transformation f Basic aggregates

52 Basic Aggregates Chunks Aggregate function f Statistic type F Statistic function f f τ τ Data column Chunk Transformation f Basic aggregates

53 Basic Aggregates Chunks Aggregate function f Statistic type Overlapping ranges F Statistic function f f τ τ Data column Chunk Sub-ranges Transformation f Basic aggregates

54 Basic Aggregates Chunks Aggregate function f Statistic type Overlapping ranges Mixed F Statistic function f f τ τ Data column Chunk Sub-ranges Transformation f Basic aggregates

55 Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk Data

56 Decompose Statistic Basic Aggregates? Synthesize? Store Chunk size? { { { { a chunk Data

57 f f τ τ Data column Chunk f Transformation f Basic aggregates

58 f Segment trees f τ τ Data column Chunk f Transformation f Basic aggregates

59 Data Logical chunk

60 Basic aggregates Leaves Data Logical chunk

61 Leaves Data Logical chunk

62 Leaves Data Logical chunk

63 Leaves Data

64 Leaves Data

66 One segment tree per basic aggregate per column Column a Column b

67 One segment tree per basic aggregate per column 2 2 Column a Column b

68 One segment tree per basic aggregate per column 2 2 Column a Column b

69 Decompose Statistic Basic Aggregates? Synthesize Store? Chunk size { { { { a chunk Data

70 Decompose Statistic Basic Aggregates? Synthesize Store? Chunk size { { { { a chunk Data

71 Calculate variance in temperature between May 15 and Oct. 15

72 Calculate variance in temperature between May 15 and Oct. 15 variance temperature May 15 and Oct. 15

73 Calculate variance in temperature between May 15 and Oct. 15 variance 2 N Recipe temperature May 15 and Oct. 15

74 Calculate variance in temperature between May 15 and Oct. 15 variance 2 Recipe N May 15 and Oct. 15 May a chunk June temperature July August Sept. Oct base data segment trees base data Chunk size = 1 month

75 Calculate variance in temperature between May 15 and Oct. 15 Plan 2 N Recipe temperature May June July August Sept. Oct base data segment trees base data fractal size = 1 month

76 Decompose Statistic Basic Aggregates Synthesize Store? Chunk size { { { { a chunk Data

77 Decompose Statistic Basic Aggregates Synthesize Store? Chunk size { { { { a chunk Data

78 base data Segment trees base data Chunk size = 1 month

79 base data Segment trees base data Chunk size = 1 month

80 Segment tree traversal O(log(n/c)) base data Segment trees base data n column size; c chunk size

81 Segment tree traversal Residual range scan O(log(n/c)) O(c) base data Segment trees base data n column size; c chunk size

82 Total cost = O(log(n/c) + c) Segment tree traversal Residual range scan O(log(n/c)) O(c) base data Segment trees base data n column size; c chunk size

83 Total cost = O(log(n/c) + c) Query cost Chunk size n column size; c chunk size

84 Total cost = O(log(n/c) + c) Query cost Chunk size n column size; c chunk size

85 Total cost = O(log(n/c) + c) Query cost Chunk size n column size; c chunk size

86 Total cost = O(log(n/c) + c) Query cost Chunk size n column size; c chunk size

87 Total cost = O(log(n/c) + c) Query cost Chunk size n column size; c chunk size

88 Total cost = O(log(n/c) + c) Query cost Cache size I/O Chunk size n column size; c chunk size

89 Experimental Analysis

90 Range Query range distribution Range-Uniform Range-Zoom-in 1 thread In-memory 40M rows, 100 columns, 5 statistic types

91 Online

92 Online Offline

93 Range Uniform Workload 1000 MonetDB Latency (ms) Modeltools (R) NumPy (Python) Online Query Sequence 1 thread In-memory 40M rows, 100 columns, 5 statistic types

94 Range Uniform Workload 1000 MonetDB Latency (ms) 100 Modeltools (R) NumPy (Python) DC is up to 2 orders of magnitude faster 10 Online Query Sequence 1 thread In-memory 40M rows, 100 columns, 5 statistic types

95 Total execution time (s) Range-Uniform* Range-Zoom-in* No DC Online Offline * 2000 Queries 1 thread In-memory 40M rows, 100 columns, 5 statistic types

96 Total execution time (s) Range-Uniform* Range-Zoom-in* DC benefits 10 regardless of the exploration scenario No DC Online Offline 1 thread In-memory 40M rows, 100 columns, 5 statistic types

97 12000 Total execution time (s) Collaborative Filtering Bayesian Classification Simple Linear Regression 0 Online No DC Individual Sequential 1 thread In-memory 40M rows, 100 columns, 5 statistic types

98 : Accelerating Exploratory Statistical Analysis Statistics are everywhere! Repetitive statistics and data access synthesizes statistics from basic ingredients

99 UERIOSITY Accelerate by synthesis Provide hints daslab.seas.harvard.edu/queriosity

100 daslab.seas.harvard.edu/ data-canopy queriosity Thank You!

101 Range-Zoom-in Response time (ms) MonetDB NumPy (Python) Modeltools(R) Online DC 0.01 Range-Zoom-in Query Sequence 1 thread In-memory 40M rows, 100 columns, 5 statistic types

102 Segment Tree in f({f( 1 ),f( 2 )}, {f( 3 ),f( 4 )}) f({f( 1 ),f( 2 )}) f({f( 3 ),f( 4 )}) i = { ( i )} f( 1 ) f( 2 ) f( 3 ) f( 4 ) Chunk i Leaves: basic aggregate on every chunk Internal node: Aggregate function applied to its two children *f() = f({ f(1), f(2), f(n) })

103 Operation Modes Offline Online Speculative Time Prep Now Future Prep Now Future Prep Now Future

104 sub-tree. needed by incoming queries. Column Existing rows New rows Query range Extend segment trees as needed Response time (s) Insert Reconstruct x2 rows x2 columns x2 both 0 1x10 3 2x10 3 3x10 3 4x10 3 5x10 3 6x10 3 7x10 3 8x10 3 Total execution time (s) Query Update Query sequence Point updates (%) Inserts Updates

105 Performance under Memory Pressure Average response time (ms) x x x x x10 5 Query sequence

106 Handling Memory Pressure Base data in memory (%) Total execution time (s) Phase DC StatSys Phase Number of rows (M)

107 Memory Feasibility 8GB 10 8 In-memory bivariate statistics M rows 10M rows 100M rows 1T rows Number of columns

108 Memory Footprint 100 s=s o Univariate DC Memory footprint (GB) Bivariate DC s=64kb Max U workload Max U workload

109 Scaling with Queries Average response time (ms) U Z 0 2x10 5 4x10 5 6x10 5 8x10 5 1x10 6 Query sequence

110 Selecting the Chunk Size 256 Total Execution time (s) M 10M 100M 1B Chunk size (bytes)

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0 NEC PerforCache Influence on M-Series Disk Array Behavior and Performance. Version 1.0 Preface This document describes L2 (Level 2) Cache Technology which is a feature of NEC M-Series Disk Array implemented