Processing Aggregate Queries over Continuous Data Streams

Size: px

Start display at page:

Download "Processing Aggregate Queries over Continuous Data Streams"

Ginger Francis
6 years ago
Views:

1 Processing Aggregate Queries over Continuous Data Streams Alin Dobra Computer Science Department Cornell University April 15, 2003

2 Relational Database Systems did dname 15 Legal 17 Marketing 3 Development Departments did eid ename did 10 Nicole Emma 15 5 Dan 15 2 Mike 17 Employees Join operator: did dname eid ename 15 Legal 11 Emma 15 Legal 5 Dan 17 Marketing 10 Nicole 17 Marketing 2 Mike Subset of cross product that satisfies constraint Relations contain no redundant information. Joins put information back together 2

3 Relational vs Streaming Data Relational Database Systems Relations reside on disk Access to data under complete control Queries refer to current state and posed once New Applications: network monitoring, sensor networks, telecom call records Data arrives in the form of continuous stream Order and rate or arrival not under the control of the system Queries are long running Most assumptions in relational database systems break Computation over Data Streams Frequency moments [AGM96], distinct values [FM96], norms [I00] Stream processing engines: NIAGARA, STREAM, Aurora 3

4 Processing Network Data Streams Measurement Alarms Network Operations Center Data Stream Join Query SELECT COUNT(* FROM R 1, R 2, R 3 WHERE R 1.a = R 2.b = R 3.c R 3 Telco/LAN Router R 1 R2 Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Challenge: deal with large amounts of streaming data using small memory approximate answers often suffice (e.g., trend/pattern analysis Other applications: one pass queries, sensor networks 4

5 Streaming Computational Model Synopses for R 1... Synopses for R r... Memory Stream for R 1 Stream for R 2 Stream for R r Stream Query-Processing Engine Approximate answers to queries Q 1,...,Q q Query Workload Q 1 Q q Computational model: Single Pass: each record examined at most once Fixed Order: no assumption about the arrival order can be made Small Space: log or poly-log in data stream size main memory algorithms 5

6 Class of Queries: Aggregates over Joins Equi-join J specified by relations R 1,...,R r and join constraints R 1.a 1 = R 2.a 2,... J = R 1 R r = { } t R 1 R r t. R1.a = t. 1 R2.a,... 2 R 1 a 1 a 2 R 2 Queries specified by a join and an aggregate: COUNT, SUM COUNT(R 1 R r is the number of tuples of the join Self join size: SJ(R = COUNT(R R 6

7 Example: Two-Way Join COUNT Query Problem: Estimate result of query COUNT(F a G Example: Stream F: a , frequency vector f: Stream G: a , frequency vector g: i f i i g i COUNT(F a G = fg T = [ ] = = 13 Space requirement proportional with size of f and g Multi dimensional data space can be prohibitive E.g. Three attributes, each with domains of size words 7

8 Stream Data Synopses Frequency table maintenance over streams requires too much space Summarization required Conventional data summaries fall short Quantiles and 1-d histograms [MSL98], [GK 01], [GKMS02] Cannot capture attribute correlations Little support for approximation guarantees Samples (e.g., using Reservoir Sampling Perform poorly for non foreign key joins [AGMS99] Cannot handle deletions of records Multi-dimensional histograms/wavelets Construction requires multiple passes over the data Concurrent work [TGIK02] Our approach: use sketches [AMS96],[AGMS99] 8

9 Outline of the Talk Background Introduction to sketches Log space in the size of stream and size of domain of attributes Provable probabilistic guarantees Insertions and deletions possible Sketch partitioning Sketch sharing Future work 9

10 Recall Example Problem: Estimate result of query COUNT(F a G Example: Stream F: a , frequency vector f: Stream G: a , frequency vector g: i f i i g i COUNT(F a G= fg T = [ ] = = 13 10

11 Main Idea: Basic Sketching Technique (Alon, Gibbons, Matias, Szegedy 1999 Summarize information in the stream with a single number sketch Use sketches to recover query result Sketches: Choose random ξ 1,ξ 2,ξ 3 { 1,1} Stream F: a Stream G: a X F = ξ 1 + ξ 1 + ξ 2 + ξ 3 + ξ 1 + ξ 3 X G = ξ 3 + ξ 1 + ξ 3 + ξ 1 + ξ 1 Claim X = X F X G estimates COUNT(F a G 11

12 Analysis of Basic Sketching Technique Random vectors: ξ = [ ξ 1...ξ n ] random vector of ±1 values, called projection vector Sketches: Sketch of F, X F = t F ξ t.a = i f i ξ i = fξ T Sketch of G, X G = t G ξ t.a = i g i ξ i = gξ T X = X F X G estimates COUNT(F a G since if E[ξ T ξ ] = I i 1 i 2,E[ξ i1 ξ i2 ] = 0 X is not precise if i 1 i 2 i 3 i 4, E[ξ i1 ξ i2 ξ i3 ξ i4 ] = 0 E[X] = E[fξ T ξ g T ] = fe[ξ T ξ ]g T = fig T = fg T Var(X 2ff T gg T = 2 SJ(F SJ(G ξ need not have elements fully independent. 4-wise independence suffices ξ can be efficiently generated from small seeds [ABI86] ξ vector is not stored. Required elements generated on the fly from seeds 12

13 !!,-,- ( ( % %$ Sketch Error Reduction Standard Solution Estimation of COUNT(F a G from single sketches of F and G is too noisy Solution: Average 8 ε 2 Var(X E 2 [X] E 2 [X] Var(X independent copies of X to reduce error to ε Compute median of 2log1/δ such averages to increase confidence to 1 δ Stream F Sketches of F Seeds Stream G ξ ξ Sketches of G Independent copies of X Average Median COUNT(F G(1 ± ε with prob (1 δ Memory required weakly dependent on the size of the stream: logn + 1 bits/sketch 13

14 Properties of Sketches Advantages: Space requirement logarithmic in the size of the stream and of the domain of the attributes Provable probabilistic approximation guarantees Both insertion and deletion possible For deletion tuples simply decrement the sketch instead of incrementing Simplicity Disadvantages: They apply only to the query COUNT(F a G When Var(X E 2 [X] 1 a large amount of space required 14

15 Estimation of COUNT(R 1 R r (Dobra, Garofalakis, Gehrke, Rastogi, 2002 (1 (2 ξ ξ a 1 a 2 a m ξ (m R l X l = t RL ξ (1 t.a 1...ξ (m t.a m With X = r l=1 X l can show: E[X] = COUNT(R1 Rr Var(X C r l=1 SJ(R l Assign a distinct projection vector ξ to each join constraint 15

16 Our Contributions Unless distributions in F and G are similar Var(X 1 E 2 [X] Need to produce a large number of copies of X Contributions: Two novel methods to reduce the error of estimates Sketch partitioning (Dobra, Garofkalakis, Gehrke, Rastogi, 2002 Use extra information to intelligently partition the problem The partitioning results in error reduction Sketch sharing (Dobra, Garofkalakis, Gehrke, Rastogi, 2003 Applicable when multiple queries are simultaneously answered Share sketches between queries thus saving space Multi query optimization problem 16

17 Outline of the Talk Background Introduction to sketches Sketch partitioning Use statistical information to reduce error Sketch sharing Future work 17

18 Sketch Partitioning [DGGR02] Large variance means loose estimation guarantees Building estimate with smaller variance reduces error Consider query: COUNT(F a G i f i g i E[X] = 165, E 2 [X] = Var(X 2 SJ(F SJ(G = 2( ( = Idea: Split I = {1,2,3,4} into I 1 = {1,3} and I 2 = {2,4} F I 1 F 1 X 1 I 2 F 2 X = X 1 + X 2 G I 1 G 1 I 2 G 2 E[X ] = COUNT(F 1 G 1 + COUNT(F 2 G 2 = COUNT(F G X 2 18

19 Sketch Partitioning (cont. Estimation of COUNT(F 1 G 1 i f i g i Var(X 1 2 SJ(F 1 SJ(G 1 = 2( ( = Estimation of COUNT(F 2 G 2 i f i g i Var(X 2 2 SJ(F 2 SJ(G 2 = 2( ( = Var(X = Var(X 1 + Var(X 2 = Improvement Var(X/m Var(X /(m/2 = ε Intelligently splitting the domain of the join attribute reduction in error Question: Assuming we know f and g, what is the best way to split? 19

20 Binary Sketch Partitioning Optimization Problem: Find partition I = I 1 I 2 space allocation m 1 for X 1, m 2 for X 2 s.t. m 1 + m 2 = m that minimizes where Var(X 1 m 1 + Var(X 2 m 2, Var(X k 2 fi 2 i I k i I k g 2 i. Solution: Allocate space proportional to Var(X k Transformed optimization criterion: Ψ(I 1 = fi 2 g 2 i + i I 1 i I 1 Naive solution: check all 2 I ways to partition fi 2 g 2 i i I 2 i I 2 20

21 Breiman s Theorem[BFOS84]: the problem has the property that: Binary Sketch Partitioning argmin I1,I 2 i I 1 q i Φ For q i > 0,r i > 0 and Φ(x concave, an optimum of ( i I1 q i r i i I1 q i + i I 2 q i Φ i I 1, j I 2,r i < r j ( i I2 q i r i i I2 q i Corollary: The solution of the optimization problem can be found in O( I log I by sorting i I in increasing order of r i and considering splits only in this order. Application 1: Finding the split point for discrete attribute in classification tree construction. Taking Φ(x = 2x(1 x, q i = P[X = x i ], r i = P[C = c 0 X = x i ] the criterion is gini index (impurity after split. Application 2: Find the optimal binary partitioning: Φ(x = x, q i = f i 2 i I fi 2, r i = g2 i i f 2, criterion becomes Ψ(I 1 i I f 2 i 21

22 Algorithm: Binary Sketch Partitioning Algorithm Order elements of I in increasing order of g i f i Choose the partitioning in this order that minimizes Ψ(I 1 For the optimal partitioning take memory allocation proportional to Var(X k Example: Optimal partition i g i f i Optimal memory allocation 5 : 6 Ordering within I: 1,3,2,4 Optimal partition: I 1 = {1,3}, I 2 = {2,4} Considering splits only in this order matches the intuition 22

23 Sketch Partitioning: Extensions Using Imperfect Information: Can get good partitioning without knowledge of perfect frequency tables Use rough unidimensional histograms instead time and space dependency on number of buckets instead of I provable approximation quality Generalization: K-ary split partitioning Needs generalization of Breiman s Theorem Dynamic programming algorithm Multi-way joins NP-hard With independence assumption becomes tractable decomposes into independent unidimensional problems 23

24 Experimental Study Datasets: Synthetic data sets [Vitter & Wang 99]: Frequency Rectangular regions in multi-dimensional attribute space uniformly distributed. Domain size Attribute Attribute Each region has a Zipfian distribution Number of tuples in each region Zipf distributed Total number of tuples: 10M To simulate correlations we perturbed the placement of regions Census data set ( reported in [DGGR02] Comparison: Query load: estimation using basic method JOIN-COUNT queries Error metric: relative error = 100 actual approx actual % 24

25 Join of two independent unidimensional relations on a common attribute buckets 50 buckets 100 buckets 30 Relative error (% Number of partitions Space for sketches 4000 words. 45% - 65% improvement Partitioning in two good enough; finer histograms do not help much 25

26 Chain join of three bidimentional relations on two common attributes buckets 50 buckets 100 buckets 30 Relative error (% Number of partitions Space for sketches 9000 words. 30% - 50% improvement Same trends 26

27 Outline of the Talk Background Introduction to sketches Sketch partitioning Sketch sharing Exploit commonality in the queries to reduce error Future work 27

28 Answering multiple queries simultaneously [DGGR, under review] Can apply the basic technique for each join independently COUNT(R 1 R 2 R 1 a ξ (1 a R 2 t R1 ξ (1 t.a t R2 ξ (1 t.a COUNT(R 1 R 3 R 2 R 1 R 3 R a ξ (2 a b ξ (3 a 2 t R1 ξ (2 t.a t R3 ξ (2 t.a ξ (3 t.b t R2 ξ (3 t.a If sketches are reused fewer sketches have to be maintained The total available space is divided between fewer sketches Question: How and when can sketches be reused? 28

29 Sketch Sharing R 1 t R1 ξ (1 t.a a a ξ (1 a R 2 t R2 ξ (1 t.a COUNT(R 1 R 2 COUNT(R 1 R 3 R 2 ξ (1 a R 3 b ξ (2 a R 2 COUNT(R 1 R 3 R 2 t R3 ξ (1 t.a ξ (2 t.b t R2 ξ (2 t.a Sharing the sketch of R 1 for both joins Use ξ (1 to enforce both <R 1.a,R 2.a> and <R 1.a,R 3.a> constraints Number of sketches reduced from 5 to 4 Each sketch gets 25% more space 29

30 Sketch Sharing R 1 t R1 ξ (1 t.a COUNT(R 1 R 2 COUNT(R 1 R 3 R 2 a ξ (1 a a a ξ (1 a R 3 b ξ (1 R 2 t R2 ξ (1 t.a COUNT(R 1 R 3 R 2 t R3 ξ (1 t.a ξ (1 t.b Sketch sharing for R 1 ξ (1 on both <R 1.a,R 2.a> and <R 3.b,R 2.a> edges ξ (1 used on all edges Number of sketches reduced from 5 to 3 Each sketch gets 66% more space Simple sketch sharing algorithm 30

31 Sketch Sharing Problem: Estimate COUNT(R 1 R 2 R 3 = f R1 f T R 2 f T R 3 R 1 t R1 ξ (1 t.a COUNT(R 1 R 2 COUNT(R 1 R 3 R 2 a ξ (1 a a a ξ (1 a R 3 b ξ (1 R 2 t R2 ξ (1 t.a Φ = E [ COUNT(R 1 R 3 R 2 ξ t.a ξ t.a ξ t.b ξ t.a t R 1 t R 2 t R 3 = E[f R1 ξ T ξ f T R 2 ξ T ξ f T R 3 ] = f R1 E[ξ T ξ f T R 2 ξ T ξ ]f T R 3 ] t R3 ξ (1 t.a ξ (1 t.b f R1 f T R 2 f T R 3 since E[ξ T ξ f T R 2 ξ T ξ ] f T R 2 Φ COUNT(R 1 R 2 R 3 sharing sketch for R 2 results in wrong estimation Apparent reason: ξ (1 assigned to both constraints of red join 31

32 Estimation of COUNT(R 1 R r (1 (2 ξ ξ a 1 a 2 a m ξ (m R l X l = t RL ξ (1 t.a 1...ξ (m t.a m With X = r l=1 X l can show: E[X] = COUNT(R1 Rr Var(X C r l=1 SJ(R l Assign a distinct projection vector ξ to each join constraint 32

33 Sketch Sharing: Conflicts in Sharing Schemes Correctness: within each join a different ξ -family is assigned to each join constraint. R 1 t R1 ξ (1 t.a a a ξ (1 a R 2 t R2 ξ (1 t.a COUNT(R 1 R 2 COUNT(R 1 R 3 R 2 ξ (1 a R 3 b ξ (2 a R 2 No conflict. COUNT(R 1 R 3 R 2 t R3 ξ (1 t.a ξ (2 t.b t R2 ξ (2 t.a R 1 t R1 ξ (1 t.a COUNT(R 1 R 2 COUNT(R 1 R 3 R 2 a ξ (1 a a a ξ (1 a R 3 b ξ (1 R 2 t R2 ξ (1 t.a Conflict. COUNT(R 1 R 3 R 2 t R3 ξ (1 t.a ξ (1 t.b 33

34 Sketch Sharing: Algorithm Problem: Given a set of queries to be approximated compute, the sketch sharing scheme with no conflicts that has the smaller number of sketches. Characterization: Problem is NP-hard. Heuristic Algorithm: 1. Start with no sketch sharing 2. Find pair of sketches that can be shared (if any Avoid introducing conflicts 3. Repeat Step 2 until no sharing possible 34

35 Key Observation: Space Allocation Problem Allocating identical space to each sketch might not optimize average error/max error Relative square error for Q ε 2 Q Var(X M Q E 2 [X] = W Q M Q m v memory allocation of sketch v V, M Q memory allocation for query Q Q Space allocation problem: Find m v,v V and M Q,Q Q that minimizes either Average Error: W Q, or Q Q M Q W Q Max Error: max Q Q M Q such that v V, Q Q(v : M Q m v (sketch-query constraint v V m v = m (total memory constraint 35

36 Solving the Space Allocation Problem Can integrate space allocation problems into sketch sharing algorithm Heuristic Algorithm: 1. Start with no sketch sharing 2. Find pair of sketches that can be shared (if any Avoid introducing conflicts 3. Repeat Step 2 until no sharing possible In Step 2 of algorithm solve the space allocation problem for each pair pick among the candidate pairs the one that reduces error the most Minimize max error: reduces to classic optimization problem Can be solved in O( V log V Minimize average error: complex convex optimization problem Integer version NP-hard Continuous version gives solution with good approximation efficient custom solution 36

37 Continuous Average Error Problem Optimization problem: Strictly convex optimization problem min W Q Φ(M Q Q Q v V, Q Q(v : v V m v = m M Q m v Solvable in polynomial time with general convex solvers. Slow in practice Solution satisfies the Karush-Kuhn-Tucker conditions: Q Q : W Q Φ (M Q + v V (Q µ v,q = 0 v V : Q Q(v µ v,q = λ Q Q, v J : µ v,q (m v M Q = 0 m v = M v J 37

38 Continuous Average Error Problem KKT equations not directly solvable Key idea: define equivalence relation v V and Q Q(v are equivalent if m v = M Q in optimal solution + closure Induces equivalence classes C 1,C 2,... over V Q C 2 C 4 C 1 C 3 C 7 C 5 C 6 Using KKT conditions can show: Is enough to find the equivalence classes C 1,C 2,... to get the solution Can get conditions that equivalence classes satisfy Can setup a Max-flow problem that finds divider between classes 38

39 Experiments TPC-H Frequency PART Supplier Attribute PARTKEY 1 SUPPKEY NATIONKEY Attribute PARTKEY 9 PARTSUP SUPPKEY 8 CUSTOMER PARTKEY 7 SUPPKEY 6 CUSTKEY 4 LINEITEM ORDERKEY 5 ORDERS ORDERKEY 10 Q 1 1,2 Q 9 1 Q 17 8,9 Q 25 2,7 Q 2 4,5 Q 10 6,7 Q 18 5,9 Q 26 1,6 Q 3 3,4,5 Q 11 5,8 Q 19 6,8 Q 27 3,8 Q 4 4,5,8 Q Q 20 7,8 Q 28 1,2,3 Q 5 4,5,8,9 Q 13 4 Q 21 8 Q 29 2,3,4 Q 6 2 Q 14 3 Q 22 6 Q 7 5 Q 15 3,4 Q 23 7 Q 8 9 Q 16 5,8 Q 24 2,3 W 1 = Q 1 : Q 12, W 2 = Q 1 : Q 29 39

40 Average Error for Workload W sketch sharing no sharing 25 Relative Error(% Total Memory(words Relative error reduced by 35% Number of sketches reduced from 34 to 16 40

41 Average Error for Workload W sketch sharing no sharing 40 Relative Error(% Total Memory(words Relative error reduced by 45%. Larger reduction Number of sketches reduced from 82 to 25 41

42 Summary Basic sketch technique Summarize streams by projecting them on random vectors Log space, provable estimation guarantees, deletions Require a lot of space when variance is large Contributions: Sketch partitioning Uses extra information to intelligently split domain of attributes Results in better approximation scheme Sketch sharing Exploits commonality in queries Space and computation is shared among queries Both methods lead to interesting optimization problems 42

43 Future Work: Applications Data-mining Monitoring Statistics, Correlations Aggregates over Joins Aggregates over joins are building blocks for higher level applications Useful for correlating information from multiple sources Integrate approximate query processing with data-mining and monitoring 43

44 Future Work: Distributed Computation Sketches are linear: SK(D 1 D 2 = SK(D 1 + SK(D 2 Sketches are useful for distributed computation No need to move data. Send sketches instead Computation of aggregates over joins as easy as computation of SUM Measurement Alarms Network Operations Center Data Stream Join Query SELECT COUNT(* FROM R 1, R 2, R 3 WHERE R 1.a = R 2.b = R 3.c R 3 Telco/LAN Router R 1 R2 Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Network Monitoring Peer-to-peer/sensor network Gossip-style decentralized algorithms for aggregate queries (with Kempe & Gehrke 44

45 Future work Extensions Combination with other synopses data-structures Distinct value queries Retroactive queries Improvements in the presence of extra knowledge Types of extra knowledge Schema information: foreign keys, size of relations Statistical information Workload information Find novel ways to use extra knowledge Approximate stream query processing system Integration into existing system or new system 45

46 Questions 46

Sketching Sampled Data Streams

Sketching Sampled Data Streams Sketchng Sampled Data Streams Florn Rusu and Aln Dobra CISE Department Unversty of Florda March 31, 2009 Motvaton & Goal Motvaton Multcore processors How to use all the processng power? Parallel algorthms