Approximate Query Processing Using Wavelets

Approximate Query Processing Using Wavelets Kaushik Chakrabarti Minos Garofalakis Rajeev Rastogi Kyuseok Shim Presented by Guanghua Yan

Outline Approximate query processing: Problem and Prior solutions Another Solution wavelets Using wavelets to construct synopsis: 1D Haar Wavelets MultiD Haar Wavelets Construction of Synopsis Query processing in wavelets domain: Select Project Join Rendering the result Experimental Evaluation Conclusions 2

Why do we need Approximate Query Processing? Characteristics of DSS applications Huge Amount of Data(GB/TB) High Query Complexity Stringent responsetime requirement EXACT answer NOT always required Data Warehouse (GB/TB) Exploratory nature of DSS applications Aggregate query : Precision to penny?no Fast, approximate answer is preferable Approximate Query Processing Approximate answers Quick response SQL Query Exact Answers Problem: Long Response Time 3

How does Approximate Query Processing work? Data Warehouse (GB/TB) Construct Compact Relations (in advance) Compact Relations (MB) SQL Query Transformation Algebra Transformed SQL Query Fast Response Times Approximate Answers 4

Previous Work Construct compact relations using: Random Sampling (AQUA system) accurate for aggregate queries(count, SUM, AVG) not suitable when joins are involved (too few tuples) not suitable for nonaggregate queries Histograms (Ioannidis and Poosala) effectiveness at high dimensions is unclear construction is costly (And Storage, dimensionality curse) needs to expand for joins(join makes the Dim even higher) Wavelets (Vitter and Wang) effective for aggregate queries even at high dimensions limited in query processing scope (only rangesum queries) 5

Overview of the work in this paper Construct compact synopsis of interesting tables using multiresolution wavelet decomposition (done in advance) fast, takes just a single pass over the relation in the best case, otherwise logarithmic passes SQL queries are answered by working just on the compact relations i.e. entirely in the wavelet (compressed) domain fast response times results converted back to relational domain (rendering) at the end all types of queries supported: aggregate, nonaggregate Fast, accurate, general 6

Overview of the work the big picture Data Warehouse (GB/TB) Construct Compact Relations (in advance) Compact Relations (MB) Step 1 SQL Query Result Relation Transformation Algebra Step 2 Query Result Rendering (If needed) Step 3 Transformed SQL Query Fast Response Times Approximate Answers 7

Step1 : Construct synopsis with wavelets decomposition 1D Haar Wavelets MultiD Haar Wavelets Construction of Synopsis 8

What s decomposition? Vector Decomposition V = (1, 2, 3, 4) V = 1 * (1, 0, 0, 0) 2 * (0, 1, 0, 0) 3 * (0, 0, 1, 0) 4 * (0, 0, 0, 1) Basis Vectors 1, 2, 3, 4 called coefficients. b1 = (1, 0, 0, 0) called basis vector 3 = (1, 2, 3, 4) * (0, 0, 1, 0) Orthogonal : Given two basis vectors b i & b j No redundancy, regular, easy to reconstruct P dot = b i * b j = 1 i = j 0 otherwise Looks useless(from (1, 2, 3, 4) to (1, 2, 3, 4)) except the idea of decomp. 9

What s decomposition? Idea of Decomposition Fix a set of basis Compute a set of coefficients Multiplying the original data by one basis gives us one coefficient Dot product vs. Inner product # of basis = # of coefficients = # of elements(original data) Represent the original data(or function) by a set of coefficients in terms of a set of basis Motivation Find new features of data (Fourier) Compress data (Wavelets in this paper) The original data could be reconstructed (Easy for orthogonal basis) Multiply the coefficient by the corresponding basis Sum up all the products 10

What s decomposition? Function Decomposition Fourier Transformation and Inverse Trans. Basis functions Basis functions : cosine and sine functions. Widely used in Engineering Problem : 1. Losing time resolution, good for periodic signal 2. Basis functions fixed 11

What s decomposition? Wavelets Decomposition Share the idea with Fourier Transformation Time resolution added Wavelets function (Mother Wavelets) Basis functions Basis functions scaled & shifted version of mother wavelets Orthogonal Vanishing moments, Compact support, Regularity Wavelet decomposition generates compact representations that exploit the local structure of the function Wavelets decomposition Scaling function & wavelets function Problem : What wavelets decomposition to use? (Haar, CDF(2, X), CDF(3, X), Daubechies series) 12

Background on Wavelets: 1d Haar Wavelets Why Haar Wavelets? Simplest wavelets function Fast to compute( averaging & differencing ) Performing well in practice(image Compression) What does Haar Wavelets look like? First Example 56 40 8 24 48 48 40 16 48 8 16 8 48 0 28 12 48 16 48 28 8 8 0 12 32 16 38 10 8 8 0 12 Blue : Original or average coefficient Red : Detail coefficient 32 38 16 10 8 8 0 12 35 3 16 10 8 8 0 12 35 3 16 10 8 8 0 12 13

Haar Wavelets functions Scaling function ( Father Wavelets) h 0 (t) = 1 t in [0, 1] 0 otherwise 1 0 1 Scaling Scaled & Shifted 1 0 1 Scaled 1 0 1 Scaled & Shifted Wavelets function ( Mother Wavelets) 1 0 1 Wavelets Scaled & Shifted 1 0 1 Scaled 1 0 1 1 t in [0, ½] h 0 (t) = 1 t in [½, 1] 0 otherwise Scaled & Shifted 14

1d Haar basis functions (Daughter Wavelets) 1 0 1 1 0 1 Scaled and shifted version of mother wavelets 1 0 1 h : (1,1, 1, 1, 1, 1, 1, 1) h1 : (1,1, 1, 1, 1, 1, 1, 1) h2 : (1,1, 1, 1, 0, 0, 0, 0) h3 : (0,0, 0, 0, 1, 1, 1, 1) Scaling function 1 0 1 Wavelets function h4 : (1,1, 0, 0, 0, 0, 0, 0) h5 : (0,0, 1, 1, 0, 0, 0, 0) h6 : (0,0, 0, 0, 1, 1, 0, 0) h7 : (0,0, 0, 0, 0, 0, 1, 1) 1 0 1 1 0 1 1 0 1 1 0 1 Set of basis functions(complete decomp.) for signal S of length 8 Vector below each basis function is a sampling of the basis function Multiply S by each basis will give each coefficient(result : 8 coefficients) Connection with the First Example 56 40 8 24 48 48 40 16 35 3 16 10 8 8 0 12 15

Compute 1d Haar wavelets decomp. By linear algebra Decomp. Matrix M a ( Collecting the 8 basis vectors, put each one as a column) Dot product of any two columns is ZERO Normalizing each column is easy Decomp.(Complete) Given any signal S of length 8 Multiplying S by M a gives the wavelets decomp. Y = S * M a 1 1 1 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 0 1 0 0 Ma = 1 1 1 0 0 1 0 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 1 0 1 0 0 0 1 Decomp. Matrix Reconstruction Make M a orthogonal (M a 1 = M a T ) S = Y * M a 1 = Y * M a T 16

Compute 1d Haar wavelets decomp. Scale by scale Decomposition Pair wise averaging and differencing [One scale decomposition] Distribution, put average(approximate coefficient) together and put difference(detail coefficient) together Repeat above on average until only one average number left [Recursive, Complete decomposition] Result : Last average all detail coefficients Reconstruction Exactly the inverse of decomposition 17

How does 1d Haar Wavelet work? Example 56 40 8 24 48 48 40 16 48 8 16 8 48 0 28 12 48 16 48 28 8 8 0 12 32 16 38 10 8 8 0 12 32 38 16 10 8 8 0 12 35 3 16 10 8 8 0 12 35 3 16 10 8 8 0 12 Blue : Original or average coefficient Red : Detail coefficient Decomposition ( logn steps needed ) 3 Steps are used to do the complete decomposition Reconstruction Exact inverse of the above process 18

Where s the compression and Approximate? Thresholding Set a threshold value C Replace those wavelet coefficients whose absolute value less than C with ZERO More zero in the wavelet coefficients Compression store ONLY nonzero The more similar data we have, the more compression we get How much does this influence the original data? 56 40 8 24 48 48 40 16 35 3 16 10 8 8 0 12 56 40 8 24 48 48 40 16 Row 1 : original data Row 2 : coefficients Row 3 : Reconstructed data Threshold C = 4 56 40 8 24 48 48 40 16 35 0 16 10 8 8 0 12 59 43 11 27 45 45 37 13 Threshold C = 9 56 40 8 24 48 48 40 16 35 0 16 10 0 0 0 12 51 51 19 19 45 45 37 13 19

Haar wavelets compression and approximate Threshold C = 4 56 40 8 24 48 48 40 16 35 0 16 10 8 8 0 12 Threshold C = 9 56 40 8 24 48 48 40 16 35 0 16 10 0 0 0 12 59 43 11 27 45 45 37 13 51 51 19 19 45 45 37 13 Blue line : Original signal Red line : Reconstructed signal 20

Background on Wavelets: Multid Haar Wavelets Data cube has multi dimensions(of equallength) Standard decomposition Nonstandard decomposition Standard decomposition Fix an ordering for the data dimensions, say 1, 2,, d For each dimension k, fix other (d1) dimensions, we get an 1D row vector Perform complete 1D Haar wavelet decomposition on the ID vector Repeat the last two steps in the order fixed in step 1 Nonstandard decomposition Fix an ordering for the data dimensions, say 1, 2,, d In this order for each dimension, perform one scale of 1D Haar decomp Collect the averages together, repeat the last step on the averages Conceptualizing : using a hyperbox of size 2 X 2 X 2 X 2( = 2 d ) 21

Multid Haar Wavelets (nonstandard) c d One step along Dim 1 (x axis) ( c d ) / 2 ( c d ) / 2 a b ( a b ) / 2 ( a b ) / 2 rebuilding One step along Dim 2 (y axis) S = ( ( a b ) / 2 ( c d)/ 2 ) / 2 = ( a b c d ) / 4 a = S d1 d2 d3 d1, d2, d3 b = S d2 d1 d3 c = S d1 d2 d3 d 2 d 3 d = S d3 d1 d2 s d 1 Wavelets Coefficients S = (a b c d) / 4 d1 = (a c b d) / 4 d2 = (a b c d) / 4 d3 = (a d c d) / 4 22

Multid Haar Wavelets Example Bad Position 23

Multid Haar Coefficients: Semantics and Representation Question : What s the contribution of each coefficient (W) in rebuilding the data array?how to store a coefficient? Answer : W = <R, S, v> R : ddimensional support hyperrectangle of W S : sign information for all ddimensional cells of W.R V : magnitude of the coefficient of W R & S only depends on Haar basis function V depends on the original data 24

Multid Haar Coefficients: Semantics and Representation 3 A :2D Data Array 2 1 0 Wa: Wavelet Coefficients 0 1 2 3 W = Wa[1, 2] W.v = 2 W.R.bound[1].lo = 2 W.R.bound[1].hi = 3 W.R.bound[2].lo = 0 W.R.bound[2].hi = 1 W.S.sign[1].lo = W.S.sign[1].hi = W.S.sign[2].lo = W.S.sign[2].hi = W.S.schg[1] = 2 W.S.schg[2] = 1 A[0,1] = Wa[0,0]Wa[0,1]Wa[1,0]Wa[1,1]Wa[0,2]Wa[2,0]Wa[2,2]=2.5(1)(.5) = 3 25

Notation used in the paper 26

Construction of Compact Relations: Wavelet decomposition of JFD Matrix Relation (Numeric Attributes) Joint Frequency Distribution (JFD) Matrix 27

Thresholding Retain the k coefficients with largest absolute value after normalization Minimizes overall mean squared error The set of coefficients retained after thresholding is the waveletcoefficient synopsis All SQL queries will be on the synopsis 28

Summary of Step1 Wavelets Decomp. & Construction of synopsis 1D Haar wavelets Decomp. Simple & fast to compute Pair wise averaging & differencing Recursive fashion MD Haar wavelets Decomp. Nonstandard extension Alternate between dimensions Thresholding Thresholding smallest coefficients Lossy data compression approximation How to store coefficients Semantics of the notations W = (R, S, v) SQL will be on coefficients 29

Query Processing(Step 2) Entire processing in compressed (wavelet) domain Querying in Wavelet Domain Query Results in Wavelet Domain Compressed domain (FAST) Render Wavelet Synopses Final Approximate Results Relation domain (SLOW) Render Approximate Relations Querying in Relation Domain 30

Query Processing Set of tuples Each operator (e.g., select, project, join, aggregates etc.) input: set of coefficients render Set of coeffs output: set of coefficients Finally, rendering step join input: set of coefficients output: (multi)set of tuples Questions How to map query algebra? Can we maintain the semantics of the coefficients? project select Set of coeffs select Set of coeffs 31

Query algebra mapping Selection : Definition Select pred (W T ) ; T is a ddimensional relation W T is T s wavelets synopsis Pred = ( l i 1 D i1 h i1 ) ^ ^ (l ik D ik h ik ) Kdimensional range selection Range defined for k dimensions, D = {D i 1, D i2,, D ik } Range unspecified for remaining (d k) dimensions : 0 X D x Example 32

Query algebra mapping Selection : example Dim. D1 1 3 2 2 3 3 JFD Matrix 4 Dim. D2 1 6 3 7 8 6 Query Range Dim D1 (Attr1) Dim D2 (Attr2) 0 6 6 1 2 3 1 3 4 1 5 6 1 6 8 2 6 7 3 0 1 4 2 3 5 2 2 6 1 3 6 2 2 6 5 1 6 6 3 Count D 1 : (0, 7) D 2 : (0, 7) Pred = (1 D 1 4 ) ^ ( 2 D 2 6 ) D = { D 1, D 2 } In relation domain, interested in only those cells inside query range In wavelet domain, interested in only the coefficients that contribute to those cells 33

Query algebra mapping Selection : Mapping 1. For each W in W T do 2. If for every D i in D /* Check overlapping */ j l i j W.R.bound[i i].lo h i j or W.R.bound[i i ].lo l i j W.R.bound[i i].hi then goto 3 else goto 5 3. For all D i in D do j set /* Overlapping area is the new hyperrectangle*/ W.R.bound[i i ].lo := max{l i, W.R.bound[i j i ].lo} W.R.bound[i i ].hi := min {h, i j W.R.bound[i i ].hi} if W.R.bound[i i ].hi < W.R.schg[i i ] then set /* no sign change any more */ W.S.schg[i i ] := W.R.bound[i i ].lo W.S.sign[i i ] := [W.S.sign[i i ]. lo, W.S.sign[i i ]. lo] elseif W.R.bound[i i ].lo W.S.schg[i i ] then set /* no sign change any more */ W.S.schg[i i ] := W.R.bound[i i ].lo W.S.sign[i i ] := [W.S.sign[i i ].hi, W.S.sign[i i ].hi] 4. Output updated W, W s = W s W 5. Goto 1, select next W D1 D1 W4 W4 Query Range W3 W3 W2 W2 W1 D2 D2 34

Query algebra mapping Projection : Definition Project X i 1, Xi 2,, Xi k (W T ) ; T is a ddimensional relation W T is T s wavelets synopsis X i 1,,X i2,, X are the set of attributes we are interested ik Remaining (dk) dimensions will be projected out Project out (dk) dimensions one by one Example 35

Query algebra mapping Projection : example Retain this dim. (D1) 3 2 2 3 1 3 JFD Matrix 4 1 6 3 Eliminate this 7 8 6 dimension (D2) 9 2 3 1 7 21 6 Result of projection Dim D1 (Attr1) Dim D2 (Attr2) 6 1 3 6 2 2 6 5 1 6 6 3 Dim D1 (Attr1) 6 9 Project Count Count D 1 is to be retained, D 2 will be projected out In relation domain, sum elements in each row along eliminated dimension In wavelet domain, sum the contribution of coefficient along eliminated dimension 36

Query algebra mapping Projection : Mapping D1 X D1 Project on D1 X1 X2 W2 W 1.v = X * W 1.v W 2.v =( X2 X1 )* W 1.v W1 D2 1. For each D j in D (To be projected out) 2. For every W in W T do 2.1 Set W.v = W.v * P j where P j equals to (W.R.bound[j].hi W.S.schg[j] 1) * W.S.sign[j].hi (W.S.schg[j] W.R.bound[j].lo) * W.R.bound[j].lo 2.2 Discard dimension D j (Hyperrectangle and sign) from W 3. Goto 1, select next D j In Step 2, by summing up the contributions of W along D j, we are projecting out D j In a word we can simply do for each W W2 W.v := W.v * PROD Dj in D D P j Discard dimensions D D 37

Query algebra mapping EquiJoin : Definition Join pred (W T1,W T2 ) Dim(T 1 ) = d 1, Dim(T 2 ) = d 2 wavelets synopsis(t 1 ) = W T1, wavelets synopsis(t 2 ) = W T2 Pred = ( X 11 = X 2 1 ) ^ ^ ( X 1k = X 2k ) Pred is of kdim, k d 1 && k d 2 WLOG, assume they are the first k dimensions of both T 1 and T 2 Let D = (D 1, D 2,, D k ) Dimension of Result would be ( d 1 d 2 k ) Example 38

Query algebra mapping EquiJoin : example These two cells have the same value on D1 7 JFD Matrix of Relation1 Dim. D2 Join Dimension D1 JFD Matrix of Relation2 In relation domain, join count = 7*3 In wavelet domain, consider all pairs of coefficients and check joinability (and compute new coefficients) 3 Dim. D3 Dim D1 (Attr1) 6 Dim D1 (Attr1) Dim D2 (Attr2) 6 2 7 4 3 6 Dim D3 (Attr3) Count 6 3 3 Dim D1 (Attr1) Relation1 Relation2 Count Join along D1 Dim D2 (Attr2) Dim D3 (Attr3) 6 2 3 21 Count 39

Query algebra mapping EquiJoin : example Case 1 : no overlapping Output nothing Case 2: Overlapping Cell A(X 1, X 2 ) and Cell B(X 1, X 3 ) W 11 and W 12 cover A (W 12 not shown) W 21 and W 22 cover B (W 22 not shown) Calculate join result for (X 1, X 2, X 3 ) (W 11.v W 12.v) * (W 21.v W 22.v) = W 11.v * W 21.v W 11.v * W 22.v W 12.v * W 21.v W 12.v * W 22.v Consider each coefficient pair Join range along any dimension can contain at most one true sign change due to the complete containment property of the Haar wavelets decomposition X1 D1 D1 Join Dimension D1 D1 W11 D2 D2 D1 A(X1, X2) B(X1, X3) D3 NOTHING W.v =W11.v*W21.v W21 W21 W11 X2 X3 D3 40

EquiJoin : Mapping 1. For each pair (W 1,W 2 ) W 1 in W T1 && W 2 in W T2 do 2. If for every D i in D /* 2. Check overlapping in the k join dimensions*/ If ( W 1.R.bound[i].lo W 2.R.bound[i].lo W 1.R.bound[i].hi ) OR ( W 2.R.bound[i].lo W 1.R.bound[i].lo W 2.R.bound[i].hi ) then goto 3 else goto 7 3. For each join dimension D i in D do /* 3,4,5,6 build a new coefficient on join range */ 1.1 set W.R.bound[i].lo := max{w 1.R.bound[i].lo, W 2.R.bound[i].lo} /* set join boundary */ W.R.bound[i].hi := min {W 1.R.bound[i].hi, W 2.R.bound[i].hi} 1.2 For j = 1, 2 /*Let S j be a temporary signvector variable*/ /* compute sign info */ if W.R.bound[i].hi < W j.s.schg[i] then S j := [W j.s.sign[i].lo, W j.s.sign[i].lo]; elseif W.R.bound[i].lo W j.s.schg[i] then S j := [W j.s.sign[i].hi, W j.s.sign[i].hi]; else set S j := W j.s.sign[i]; 1.3 Set W.S.sign[i] := [S 1.lo * S 2.lo, S 1.hi * S 2.hi]; 1.4 If W.S.sign[i].lo == W.S.sign[i].hi then set W.S.schg[i] := W.R.bound[i].lo 1.5 else set W.S.schg[i] := max j=1,2 {W j.s.schg[i] : W j.s.schg[i] in [W.R.bound[i].lo, W.R.bound[i].hi] } 4. For each nonjoin dimension D i, i = k 1,, d 1 do /* 4,5 inherit nonjoin dimensions */ set W.R.bound[i] := W 1.R.bound[i], W.S.sign[i] := W 1.S.sign[I], W.S.schg[i] := W 1.S.schg[i] 5. For each nonjoin dimension D i, i = d 1 1,, d 1 d 2 k do set W.R.bound[i] := W 2.R.bound[i d 1 k], W.S.sign[i] := W 2.S.sign[i d1 k], W.S.schg[i] := W 2.S.schg[i d1 k] 6. Set W.v : = W 1.v * W 2.v and output W, Ws = Ws W 7. Goto 1, select another pair 41

Query algebra mapping EquiJoin : example D1 Join Dimension D1 D1 NOTHING D2 D3 D1 D1 val = val1*val2 D2 D3 42

Query algebra mapping EquiJoin : example D1 D1 val = D2 D3 val1*val2 D1 D1 val = D2 D3 val1*val2 43

Summary of Step2 Query algebra mapping(only nonaggregate) Selection Update those wavelets coefficients whose hyperrectangle overlapping the selection range Projection Sum up all wavelets coefficients along all dimensions to be projected out Join Create new wavelets coefficients Hyperrectangle equals to the join range plus nonjoin dimensions Compute sign information Results need to be rendered Output of above queries are wavelets coefficients Need to be converted to database relation 44

Rendering(Step 3) Go back from wavelets domain to database relations Semantics of wavelets coefficients unchanged Range, Sign, Signchange, Magnitude Inverse wavelets decomposition is easy Sum up the contributions of all coefficients to each cell 45

Experimental Results Compare waveletsbased technique With sampling and histograms In terms of efficiency and accuracy Measuring accuracy (Error Metrics) Aggregate : Absolute relative error Nonaggregate : EMD error Query types SELECT, SELECTSUM, SELECTJOIN, SELECTJOINSUM 46

Datasets and Queries Synthetic data set Real data set: CENSUS Population Survey (www.census.gov) 1992 & 1994 4d data: age (017), education level (046), income (041), hrs/week (013) JFD Matrix size: 2 million cells( 32 * 64 * 64 * 16) Relation sizes (2 relations) ~ 16,000 Density ~ 0.001 Queries: Selects: 5 age < 10 ^ 10 income < 15, selectivity ~ 6% Joins: join age on 1992 and 1994 data Sum : sum on age 47

Query Execution Time TwoD synthetic data set used Running time on base relation is 3.6 seconds (Enough memory used) Sampling is not counted here Giving too less tuples of join Wavelets runs faster (than Histograms) More than two orders of magnitude Histograms expanded to generate tuplevalue distribution Wavelets expanded at the very end 48

Query Execution Accuracy 49

Query Execution Accuracy 50

Conclusion Wavelets are an effective tool for general purpose approximate query answering fast query processing (entirely in wavelet (compressed) domain) low synopsis construction cost high accuracy even at high dimensions can handle all types of queries 51