Improving Retrieval Cost by Choosing the Best Wavelet Decomposition for Multidimensional Datasets

Size: px

Start display at page:

Download "Improving Retrieval Cost by Choosing the Best Wavelet Decomposition for Multidimensional Datasets"

Margery Spencer
5 years ago
Views:

1 Improving Retrieval Cost by Choosing the Best Wavelet Decomposition for Multidimensional Datasets Dimitris Sacharidis, Cyrus Shahabi, Huseyin Balli, Andreas Xeros, Antonio Ortega University of Southern California Los Angeles, CA {sacharid, shahabi, balli, Abstract Wavelets have been extensively used for approximate, progressive or even exact evaluation of queries. However, the complete wavelet transform is not always the optimal form to store the data. We exploit the properties of the full tree of the wavelet decomposition, in order to find a representation for the dataset that minimizes the retrieval cost assuming a given query workload. We develop techniques to answer queries of range size r with retrieval cost of O(log n r), while maintaining an update cost of O(log n S), for a n-dimensional cube of size S n. Furthermore, we investigate the space and time trade-off associated with storing extra coefficients to reduce the cost even further. 1 Introduction Wavelets have been widely accepted in the database community as a method for dealing with large multidimensional datasets. Their origin lies in signal processing applications, where they are mainly used as a tool for signal compression, due to their inherent multiresolution properties. By applying these ideas to multidimensional databases, we can achieve approximate, progressive or even fast exact answers to queries. Throughout this paper, we try to minimize the retrieval cost, which is the number of data values (coefficients) that must be retrieved from database to answer a query. To achieve this we must choose the best representation for the dataset, given some prior knowledge on the type of queries typically submitted. We would like to develop methods to fine-tune the database and towards this, we explore and exploit the properties of the full tree of the wavelet decomposition (wavelet packet transform). Our first contribution, comes from a simple observation: for point queries, i.e., a range query where size of the range is one, it is more efficient to query the raw data rather than using any transformation. On the other hand, for a very large range query, full wavelet transformation may be the best representation for data. In this paper we show that for general This research has been funded in part by NSF grants EEC (IMSC ERC), IIS (ITR), and IIS (CAREER), and unrestricted cash gifts from Okawa Foundation and Microsoft. database queries, the complete wavelet transform is not always the optimal form in which to store the data. This is because the objective, unlike for the traditional signal compression applications, is no longer to reconstruct the entire signal (or database); rather, the goal is to reconstruct an arbitrary subset of the data defined by a range query. For this reason, we consider all possible non-redundant representations for the dataset that exist in the full tree of the wavelet decomposition, only two of which are the untransformed dataset and its complete wavelet transformation. In this paper, also, we proceed one step further. We suggest techniques for cutting down retrieval cost in applications where a moderate increase in storage space is acceptable. We investigate the space and time trade-off associated with increasing the amount of stored data in order to significantly decrease the retrieval cost. Given a limit on the available storage space and some collected statistics on the type of typical queries, we select the best representation for a dataset. The idea, which again originates in signal processing, is to have an over-complete representation (multiple transformations) of the dataset. This means that online, at the time of query submission, there are alternative ways to answer the query, and consequently the one that minimizes the number of retrieves can be selected. Traditionally, wavelets have been used as an approximation tool; in this paper, however, we strongly focus on the inherent preaggregation properties of the wavelet decomposition. Let us note that even though the discussion proving the effectiveness and usefulness of our approaches is somewhat detailed and complicated, the actual proposed algorithms are elegant and straightforward to implement and can lead to significant improvements in the retrieval cost with a completely negligible online overhead per query. To be able to optimize the database and find the best form in which to store the data, we must have some knowledge about the characteristics of the queries typically submitted to the system. There is a significant number of publications about collecting and extracting query workload statistics and many DBMS support such techniques and utilize them to measure system performance, or even to automati- 1

cally apply administrator tasks, such as dropping/adding indexes [2]. In this paper, we use the term workload to refer to a set of queries submitted to the database in the past.

2 cally apply administrator tasks, such as dropping/adding indexes [2]. In this paper, we use the term workload to refer to a set of queries submitted to the database in the past. At first, we assume that the available workload can perfectly predict the characteristics of the future queries and we optimize the database in that way. Later, we investigate what happens when the workload used to calculate the best form is somewhat inaccurate, that is, it fails to capture the future query behavior. We begin our discussion with an overview of wavelets and how they are used to answer queries in Section 3. In Section 4 we provide an algorithm to select the best form for the data, given that there is no additional storage space available. Later, in Section 5 we propose two techniques for taking advantage of any available extra storage space. The first technique, Best Level-Set, is a direct extension of the algorithm introduced in Section 4, whereas the second one, Best Branch-Set, is more sophisticated and leads to improvements in the order of 25% for additional storage of 20%. Section 6 provides comparative results for all methods, using both synthetic and real-world query workloads and datasets. Finally, in Section 7 we conclude our discussion. 2 Related Work Online Analytical Processing (OLAP) applications have exhaustively dealt with the problem of providing fast and exact answers to range queries over large and multidimensional datasets that can be seen as hyperbuces. The main intuition behind all techniques is to store the dataset in some pre-aggregated form, so as to speed-up range aggregate queries. Improving query response time comes, of course, at the expense of increasing the update cost. Therefore, it becomes an issue of finding the appropriate balance between retrieval and update cost, with respect to the underlying application. Wavelets are used as a mathematical tool to create a multiresolution and pre-aggregated view of a multidimensional dataset [1, 5, 6, 4, 7]. They have been extensively studied in Signal Processing applications and recently appeared in the database community to replace Prefix-Sum like techniques. Consider, as an example such a Prefix-Sum like technique, the Dynamic Data Cube [?] where aggregates are pre-computed over blocks of increasing size by powers of 2, to provide a multi-resolution view of the dataset, exactly like wavelets. In our previous work [5] we have shown that Haar based wavelets can achieve perfect balance between retrieval cost and update cost. In this paper, we attempt to decrease retrieval cost even further with no increase on update cost, by taking under consideration the type of queries typically submitted. To the best of our knowledge, there has been no previous work on exploring the full wavelet packet transform to select the best wavelet decomposition given a query workload. 3 Background on Wavelets 3.1 Levels of Decomposition In this section we provide the necessary definitions and the terminology that we use throughout this paper. We first assume single-dimensional datasets. Later, in Appix B, we relax this assumption and discuss the extension to the multidimensional case. Let us assume that the dataset d is a single dimensional vector and that it has length N, which is a power of 2. Then, this dataset can be seen as a point in N-dimensional space IR N. We exploit the multiresolution property of the Discrete Wavelet Transform to decompose this vector space into a set of orthogonal subspaces. The original space V 0 = IR n is decomposed into V 1 which provides a coarse or averaged view of V 0, and into W 1 which is the orthogonal compliment of V 1 in V 0. Both of these spaces have dimension N/2 and thus contain half as many basis vectors. The basis vectors in V 1 and W 1 are orthogonal to each other and form a basis for the original space: V 0 = V 1 W 1. The projection of the data vector d on the basis vectors of V 1 and W 1 results in N coefficients which are the first level of decomposition. The space V 1 can be further decomposed into V 2 and W 2 to provide an even coarser and detailed view of V 0. The original space is decomposed into W 1, W 2 and V 2, which is the second level of decomposition: V 0 = V 2 W 1 W 2. In general, at the k-th level of decomposition we have V 0 = V k W 1 W 2 W k, which is an average view at resolution level k plus the details of previous levels. This chain of multiresolution analysis of the original space V 0 continues upto level logn and can be seen in Figure 1. At this point let us note that throughout this paper, we use Haar wavelets. Extension to other filters can be achieved similar to our work in [5]. Figure 1. Multiresolution Property of Wavelets In the remainder of the paper we will use the notation W D k (d) to refer to the k-th level of decomposition of vector d, or in the context where a vector is implied we will simply refer to W D k as the k-th level of decomposition. We will use the term branch to refer to all the basis vectors of a subspace, either V k or W k. This terminology is used in the context of signal processing applications, where filter banks composed of a low-pass and a high-pass filter are chained to perform the wavelet decomposition. Furthermore, we will use the term low-pass branch, average coefficients, or even averages to refer to the coefficients that result from the projection of the data vector in the basis vectors of a V subspace. Similarly, we will use the term high-pass branch, 2

detail coefficients, or even details to refer to the coefficients that result from the projection of the data vector in the basis vectors of a W subspace.

3 detail coefficients, or even details to refer to the coefficients that result from the projection of the data vector in the basis vectors of a W subspace. In total, there are 3N 2 distinct coefficients across all levels of decomposition, consisting of the N 1 details and 2N 1 averages. We will use the term full tree (FT) to refer to all of these coefficients; the name is because of the tree structure of the wavelet decomposition. Thus, F T = W D 0 W D 1 W D logn and F T = 3N 2. An example of a full tree of decomposition for a vector of size 8 is also shown in Figure 1. The averages, including the original vector, are shaded, whereas the details are shown in white. The last level of decomposition, which is the complete wavelet transform consists of the final average and all the details. 3.2 Answering Queries using Wavelets Let us assume that our dataset is stored in the vector d of size N. In this paper we restrict ourselves to the case of answering sum queries defined on a range R. Our techniques can be generalized to arbitrary polynomial range-sum queries similar to our previous work in [5]. A range-sum query Q(R, d) of range R on the data vector d is the summation of the values of the data vector that are contained in the range. The answer to such a query is given by the summation of the values of the data vector d for each cell ξ contained in the range: a = ξ R d(ξ). We can rewrite this summation as the dot product of two vectors, the data vector d and the query vector q. The query vector is a vector of the same size as the data vector that has the value of one in all cells within the range R and zero in all cells outside that range. Definition 1: A range-sum query Q(R, d) of range R on the data vector d corresponds to a query vector q and the answer to this query is given by the inner product of d and q. a = d(ξ) = q, d = q[i] d[i] ξ R i Let us provide a very useful lemma that applies for any transformation that preserves the Euclidean norm, including the Discrete Wavelet Transformation and more importantly the Wavelet Decomposition at any level. This lemma is essentially the Generalized Parseval Equality applied to our case. Lemma 1: If d is the Wavelet Decomposition at level k of the data vector d, i.e. d = W D k (d), and q is similarly the Wavelet Decomposition at level k of the query vector q, i.e. q = W D k (q), then q, d = q, d i q[i] d[i] = q[j] d[j] j This means that a range-sum query can be answered using either the untrasformed vectors or some orthogonal transformation of them. The reason for choosing to transform the data into the wavelet domain will become apparent with Theorem 1. The intuition is that we would like to find a representation for the query that has the least number of non-zero values, so that the number or required data from the database is minimized; this is formalized in Section 4. Therefore, it is logical to answer the question of what happens at each level of decomposition. Recall that the wavelet transform is a recursive procedure, where at each step, termed iteration, an input produces two outputs of half length, low-pass (averages) and high-pass (details), and the procedure continues using the low-pass output as the new input. Lemma 2: At any iteration of the wavelet transform, an input vector, whose all non-zeros have the same value and form a contiguous range 1, has a low-pass output of the same form but with half as many non-zeros and a high-pass output of at most 2 non-zeros. The Proof of this Lemma, as well as other proofs can be found in Appix A. Figure 2. Wavelets of a Range-Sum Query This means that because of the specific form of the query vector, an iteration of the wavelet transform halves the number of non-zeros required to be processed at the next iteration. Also, because of the fact that the query consists of equal and contiguous values, the only detail coefficients occur at the edges of the range, resulting in a maximum of 2 coefficients, each per edge; see Figure 2. Applying Lemma 2 recursively, one can easily derive a bound for the number of non-zero coefficients at each level of decomposition. Theorem 1: For a range-sum query vector of size N defined over a range of length r the number of non-zero coefficients in its Wavelet Decomposition at level k is not more than r + 2k k All previous work in this area, were considering only the last level of decomposition and their result was that a query vector of size N, defined over a range of size r, completely decomposed into the wavelet domain has less than 2logN +2 non-zero coefficients. One of our strongest contributions is that we have shown that the number of non-zero coefficients deps on the size of the range r and not on the domain size N, if we consider intermediate levels of decomposition. The next Theorem summarizes the results seen so far. 1 The lemma also applies to the case where the first and last value of the range is not the same as the rest of the range 3

4 Theorem 2: The answer to a range-sum query Q(R, d) of size R = r when the query q and data vectors d are both transformed to the same Level of Decomposition k is given by the( following equation: ) a = q, d and can be computed in O r + 2k retrieves from the database. 2 k Up to now, we were constricting ourselves to the one dimensional case, where the data is stored in a vector. In the most general case the dataset is of multidimensional nature and can be viewed as a multidimensional hypercube [3]. We use the tensor product multivariate wavelet transform to decompose the cube into the wavelet domain; that is, we apply the wavelet transform for each dimension in the data cube. As a result the following Theorem can be shown. Theorem 3: The cost of answering an n-dimensional sum query defined over a hyper-rectangle range is the product of the costs of answering each of the n single dimensional range-sum queries defining the hyper-rectangle range. Furthermore, the algorithms presented in this paper are easily exted into n dimensions by applying them indepently for each dimension; therefore, their time and space complexity scales linearly with respect to the dimensionality. An exted discussion can be found in Appix B. For the remainder of this paper we assume single dimensional vectors, for the sake of simplicity. 4 Optimal Wavelet Decomposition without Extra Storage Space 4.1 Finding the Optimal Level of Decomposition In this section we find the optimal Level of Decomposition to store the data vector. We assume a storage space with identical size to the size of the data vector. The optimal level is the one that results in the least number of non-zero query coefficients, and thus the least number of retrieves from the database, for a given query workload. Given a query q we define the cost of answering this query as the number of non-zero coefficients in the transformed query vector. Since we can have logn + 1 possible transformations of the query vector, we will use cost(q W D i ) to refer to the cost of query q transformed at the i-th level of decomposition. Definition 2: The cost of answering a query q in the i- th level of decomposition W D i is defined as the number of non-zero coefficients contained in the transformed query vector W D i (q) and will be denoted as cost(q W D i ). We have already seen that there is a strong correlation between the range of the query and the cost of a query at a particular level of decomposition in Theorem 1. It is therefore expected that we can always find that level of decomposition which minimizes the cost for a given query, just by examining each level. However, the next Theorem states that this is not necessary since the optimal level can be chosen among only three candidate levels. The proof is easily derived from Lemma 2. One can see that the interesting (candidate) levels are those where the gain of halving the non-zero averages is less than 2, which is the cost of going into the next level (because of the non-zero details). Theorem 4: The level that minimizes the cost of a query q = Q(R, d) of size r = R, is either the lowest level where there are at least 4 averages ( logr 1), the lowest level where there are at least 2 averages ( logr ) or the lowest level where there is exactly 1 average. The exact criteria are provided in the proof. Now, we have a way to identify the optimal level given a single query. The workload, however, comprises of more than one query, so we have to aggregate across all queries in the workload for every level of decomposition. In this paper we treat all queries as equal, so the aggregation degenerates to a summation. However, one can choose a different aggregation function, e.g. weighted average, to better reflect the importance for each individual query. Definition 3 : For a query workload W ork we define the Total Cost for the i-th level of decomposition as the summation of the costs for each individual query: C i = q j cost(q j W D i ) q i W ork The algorithm Optimal Level shown below, greedily selects the level of decomposition that has the least cost for a given Workload. By keeping the data vector in that optimal Level we guarantee that queries included in the Workload are answered in the most efficient way; in other words our database is better tuned for that workload. In the experimental section we investigate what happens when answering queries not belonging to the workload. Algorithm 1: Optimal Level Input: W ork // Workload Output: W D b // best Level Var: C // Total Cost array foreach q W ork do 1 for i 0 to logn do 2 C[i] C[i] + cost(q W D i ) C[b] minc // find the minimum cost return W D b // b is the best level The lines numbered 1 and 2 in Algorithm 1 can be executed in time O(logN) by using the methodology described in Theorem 1. This means that the algorithm for selecting the best Level has time complexity O(mlogN) and space complexity O(logN) for a workload of m queries. 4

5 5 Optimal Wavelet Decomposition with Extra Storage Space The objective of this section is to take advantage of the case where there is enough storage space to hold not only the decomposed data but some additional coefficients, in order to improve the system performance. In Section 5.1, we restrict ourselves to the case where the extra coefficients are all part of another Level of Decomposition. Later, in Section 5.2 we will relax this restriction. 5.1 Optimal Level-Set In this section, we find the levels of decomposition that minimize the cost of answering queries from the Workload, given a restriction on available storage size. We use the term Level-Set and the symbol SW D to refer to a set of levels of decompositions: SW D = i W D i. Before proceeding any further, let us provide with a definition of the cost of answering a query given a Level-Set. Definition 4: The cost of answering a query q using a Level- Set SW D cost(q SW D) is the minimum cost of answering the query using just one Level of Decomposition W D i SW D. cost(q SW D) = min cost(q W D i) W D i SW D This means that we assume that for a given query a Level- Set has the same cost as the best cost among all Levels included. The total cost of answering queries from a Workload for any Level-Set can be calculated using an aggregation function; here we simply use the summation Search Space for Finding the Optimal Level-Set We already know that there are logn + 1 levels of decomposition, so that there can be at most 2 logn+1 1 = 2N 1 Level-Sets. However, since there is a restriction on the available space we expect the search space to be significantly smaller. To prove that, we must first calculate the size of a Level-Set. One straightforward observation is that for any two Levels of Decomposition there is a number of common coefficients. These are all the detail coefficients of the lowest level of decomposition. The only exception is the 0-th level of decomposition (i.e. the raw data) that has no common coefficients with any other level. Lemma 3: Let us assume two Levels of Decomposition W D k and W D m, where W D k is the lowest Level, 0 < k < m. The number of common coefficients in these two levels is: k ( N W D k W D m = 2 k = 1 1 ) 2 k N i=1 whereas, the total number of coefficients included in the Level-Set of these levels is: W D k W D m = (1 + 1 ) 2 k N In Figure 3 we observe the sharing of coefficients shown with dark grey, whereas the non-common coefficients are shown with light grey for two levels, k and m. Figure 3. Common coefficients among two levels The extension of Lemma 3 to a Level-Set consisting of more than 2 levels leads to the following theorem. Theorem 5 : The size of a Level-Set SW D = {W D k1, W D k2,..., W D km } consisting of m > 1 Levels of Decomposition W D k1, W D k2,..., W D km in decreasing order (1 < k 1 < k 2 < < k m ) is: SW D = ( k ) 1 2 k N m 1 When the untransformed data (W D 0 ) is included the size always increases by N. We can easily verify the validity of this theorem if we consider a Level-Set that includes all levels of decomposition, except the 0-th one. Theorem 5 suggests that this Level-Set has size of logn i=1 W D i = 2N 2, as expected. Corollary 1: The size of a Level-Set SW D, where W D k1, k 1 > 0 is the lowest level of decomposition ( included ) in the Level-Set, is bounded by SW D < N 2 k 1 1 The previous corollary puts a bound on the size of a Level-Set, which can be used to prune the search space of Level-Sets given a restriction on available space. Let us use S to indicate the available space; then N < S 2N 2, that is, we do not consider the 0-th level of decomposition. Equivalently, the ( additional ] storage space is a fraction α of N, where α 0, 1 2 N. Therefore S = (1 + α)n. Theorem ( 6: Given ] a storage space of S = (1 + α)n where α 0, 1 2 N, a Level-Set that has space less than S can only contain levels after β, where β is the smallest integer greater than 1 + log 1 a. Levels before β can only comprise 2 singleton Level-Sets. There exist at most N such Level- 2 β Sets. 5

6 For example, assume that the extra storage available is N/4, that is α = 1/4. Theorem 6 suggests that we can only have levels after β. In our case β 3, which means that levels 1 and 2 cannot be included in any Level-Set that obeys the space restriction; of course, they can be considered as two single item Level-Sets Optimal Level-Set Algorithm To select the optimal Level-Set among those that have size less than S = (1 + α)n < 2N 2 for a workload of m queries, the following steps are required. 1. For each level find the cost cost(q W D i ) of answering each query of the workload 2. Calculate β = 1 + log 1 a 3. Create the candidate Level-Sets 4. For each SW D in candidates calculate cost(q SW D) from cost(q W D i ) for all W D i SW D 5. Select the Level-Set with the minimum total cost Algorithm 2, Optimal Level-Set, greedily selects the Level-Set among those that have size less than S = (1 + α)n < 2N 2 which minimizes the total cost of answering queries belonging to a workload of m queries. The first step of calculating cost(q W D i ) for 0 i logn takes time of O(mlogN). There are at most 2 2 β N candidate Level-Sets and each one has at most logn β+1 levels. For each Levelset and each query the best level must be chosen which leads to a time cost of m(logn β + 1) 2 2 β N = O(mNlogN). The space required is for storing the cost of answering m queries for logn β + 1 levels, that is m(logn β + 1) = O(mlogN) Algorithm 2: Optimal Level-Set Input: W ork // Query Workload Input: S = (1 + α)n // Storage Space Output: SW D b / Optimal Level-Set Var: β // β value Var: C // Total Cost array Var: candidates // array of candidate Level-Sets foreach q W ork do for i 0 to logn do calculate cost(q W D i ) β 1 + log a 1 candidates createcandidates(β) foreach SW D candidates do foreach q W ork do calculate cost(q SW D) C[SW D] C[SW D] + cost(q SW D) return SW D b minarg SW D C 5.2 Optimal Branch-Set In this section still we assume that the available storage space exceeds the size of the data vector, but we no longer constrict ourselves to complete levels of decomposition. Instead, we increase the granularity from levels to branches of the wavelet decomposition. We find the best set of branches to store and we show how this can increase the efficiency of our query answering system. We use the term Branch-Set and the symbol SB to refer to a set of branches. A Branch-Set can contain branches that form complete levels of decomposition, as well as branches that do not form complete levels because some required high-pass branches are missing. Recall that a level of decomposition W D k is composed of the low-pass branch lb k and all the previous high-pass branches hb l, 0 l k: W D k = lb k hb 0 hb 1 hb k The stray branches that do not form levels are low-pass branches and form what we call a pseudo-level of decomposition. Definition 5: The k-th pseudo-level of decomposition is defined by a set of branches that contains the k-th low-pass branch and any of the previous high-pass branches, hb l, 0 l k. Subsequently, a level of decomposition is also a pseudolevel, while the opposite is not generally true. Such a definition allows us to define a Branch-Set as a set of pseudolevels, just as a Level-Set is a set of complete levels. In addition, a Branch-Set must contain at least one full level of decomposition, so that there exists an orthogonal representation of the original vector (that is, the set of branches must form an overcomplete basis, a frame, for V 0 ; see Figure 1). Let W D m be the lowest such level of decomposition, then a Branch-Set contains a complete level, as well as a number of pseudo-levels: SB = W D m ( i wd i) Search Space for Finding the Optimal Branch-Set Since we know that there are 2logN +1 branches for a vector of size N, we expect that in the worst case there are potentially 2 (2logN+1) 1 = 2N 2 1 Branch-Sets. However, because of the storage space restriction, the number of candidate Branch-Sets is much smaller. To prove this, we must calculate the size of an arbitrary Branch-Set. Theorem 7: Let SB be a Branch-Set that contains the m-th level of decomposition, where m is the lowest ( present ) level. Then the size of SB is bounded by SB < N 2 m 1 Corollary 1 of Section 5.1 can also be derived from this general theorem. Assuming available storage space of S = (1 + α)n we can also prove the following theorem, which exactly determines the search space for Branch-Sets given a storage space restriction. Theorem 8: Let ( S = (1] + α)n be the available storage where α 0, 1 2 N. For a Branch-Set containing m-th level of decomposition as the lowest level present, the branches that do not belong in W D m can only belong to levels greater than β, where β is the smallest integer greater than 1 + log 1 1 a. There exist less than N 2 logn possible 2 2β Branch-Sets. 6

7 5.2.2 Answering a Query given a Branch-Set In Section 5.1 we used one of the available levels in the Level-Set to answer a query. We could do the same thing here using pseudo-levels instead of levels, but then there would be no improvement in query cost over Section 5.1. Instead, we take advantage of the higher granularity that pseudo-levels provide. First, we split the original query into subqueries, and then for each subquery we decide which pseudo-level to use. The gain in retrieval cost is twofold: (a) intra-query gain, because each subquery can be answered using the most suitable pseudo-level, (b) inter-query gain, in the case where two or more subqueries require common (shared) coefficients. The following lemma, which is a result of the inner product s linearity and of Lemma 1, formalizes query splitting and shows how each subquery can be answered in a different level of decomposition. A query vector q transformed into the k-th level of decomposition is denoted as q k. Lemma 4: The answer to a query vector q that can be written as the summation of n signed subqueries q = n i=1 s i q i, where s i = { 1, 1} can be calculated as n n q, d = s i q i, d = s i q k i i, d k i i=1 i=1 Among all possible ways to split a query, there is one that minimizes the retrieval cost for the query, given a Branch- Set. This is the optimal query splitting, and we use it to define the cost of a query for a given Branch-Set. Definition 6: Given a set of available branches, SB, the retrieval cost for answering a query q is the minimum retrieval cost among all possible query splittings. Finding the retrieval cost for a query given a Branch-Set is a daunting task. Therefore, we must seek alternatives to blind searching all possible query splittings. Let us assume, for a while, that there is no storage space restriction. Then, splitting the query in an optimal way is quite straightforward; we only need to store every low-pass branch resulting in a storage space of S = 2N 1. A query is split in subqueries so that each can be answered by a single average coefficient in a low-pass branch. Using the notion of buckets, an algorithm that splits the query would always select the bucket that better matches the query range. An example query of range r is shown in Figure 4. The best fitting buckets (subqueries) are drawn with thick lines, black when positive and light grey when negative. We claim that given a query range, selecting the bucket that leaves the least space to fill is optimal. Theorem 9: Given a range-sum query when the Full Tree of decomposition is available, an algorithm that greedily selects at each step the coefficient that corresponds to the bucket that leaves the least space to fill is optimal. Figure 4. Optimal Query Split given a Full Tree Now, let us return to the case where there is a restriction on the available storage space, so that we cannot include all low-pass branches in a Branch-Set. Splitting the query in the way that Theorem 9 suggests, would mean that some subqueries cannot be answered using a single average coefficient. For these subqueries there are two options, either reconstruct the missing low-pass branch from an available higher level low-pass branch, or from an available lower level low-pass branch together with all the highpass branches of the intermediate levels. The choice among the two reconstruction paths has to be made with respect to minimizing the total number of coefficients needed. Having in mind the fact that Theorem 9 suggests that a query is split in such a way that each subquery is essentially an average coefficient in a low-pass branch, let us look at an example, shown in Figure 5. Three subqueries marked with a cross cannot be answered by a single coefficient at their own low-pass branch. The available branches are shown with a solid line, whereas the missing branches are shown with a dotted line. Reconstruction is necessary either from low-pass branch lb a or from the high-pass branches and the low pass lb b. The main concern is to find a reconstruction scheme that minimizes total cost. Looking at Figure 5, one would assume that for each of the three subqueries there are two choices, either going up or down in the tree. This is not the case, however. For example, if sq 2 is reconstructed by going down rather than up, then it is easy to see that going down should be the choice for sq 3 as well. Besides the obvious reason that the distance to lb a is longer for sq 3 (the longer the reconstruction path, the more coefficients it involves), there can be some overlapping of coefficients which further reduces the cost of going down for subqueries sq 2 and sq 3. Figure 5. Reconstruction of Subqueries This observation generalizes to: for any two subqueries sq 1, sq 2, that fall between two subsequent low-pass branches, there can be no crossing between the chosen reconstruction path. We use this result to calculate the retrieval cost of a query, given a Branch-Set. We propose Algorithm 3, named Query-Split, that finds the best place to put a bar- 7

8 rier among the two available low-pass branches, such that all subqueries falling below the barrier are reconstructed from the branches below and all subqueries falling above the barrier are reconstructed from the branches above. This algorithm joins all subqueries that are reconstructed from going down in the tree to take into account the overlapping coefficients. The time complexity of the Query Split algorithm is O ( log 2 N ), as shown in Appix C. Algorithm 3: Query Split Input: q // Query Input: bs // Branch-Set Output: SQ = {sq i } / Set of Subqueries SQ split(q) // create original set foreach two consequent low-pass branches lb a and lb b do foreach missing low-pass branch lb k between lb a and lb b do barrier k lsubqs k // (left) subqueries between lb a and lb k rsubqs k // (right) subqueries between lb k and lb b rsubq k rsubqs k // join right subqueries to deal with the overlapping of coefficients lcost cost(lsubqs k lb a) // find cost for reconstructing left subqueries from lb a rcost cost(rq k lb a ) // find cost for reconstructing right subquery from lb b totalcost k lcost + rcost // add the costs select k that minimizes totalcost k add subqueries lsubqs k and query rsub k to SQ return SQ Optimal Branch-Set Algorithm To select the optimal Branch-Set among those that have size less than S = (1 + α)n < 2N 2 for a workload of m queries, the following steps are required. 1. Calculate β = 1 + log 1 a 2. Create the candidate Branch-Sets 3. For each Branch-Set BS in candidates calculate cost(q BS) using the Query-Split algorithm 4. Select the Branch-Set with the minimum total cost The previous are summarized in the Optimal Branch-Set algorithm shown below. In the worst case there can be 2N 2 possible Branch-Sets, so that the time complexity of this algorithm is 2N 2 m log 2 N = O ( m(nlogn) 2), since the dominating cost in the inner loop is the Query Split algorithm. Algorithm 4: Optimal Branch-Set Input: W ork // Query Workload Input: S = (1 + α)n // Storage Space Output: BS b / Optimal Level-Set Var: β // β value Var: C // Total Cost array Var: candidates // array of candidate Level-Sets β 1 + log a 1 candidates createcandidates(β) foreach BS candidates do foreach q W ork do SQ QuerySplit(q BS) // split query given the BranchSet cost(q BS) cost(sq BS) // sum the costs of each subquery sq SQ C[BS] C[BS] + cost(q BS) return BS b minarg BS C 6 Experimental Results We have conducted 7 series of experiments in order to determine the behavior of our proposed techniques. In Section 6.1 we are restricting ourselves to the case where no extra storage is available and we create synthetic workloads and we vary the range size distribution, to emphasize on the superiority of our approach over the full wavelet transformation. In Section 6.1, we investigate the performance of our techniques, as the available storage space increases. In Section 6.3, the available storage is fixed but the range size distribution is changed. Later, in Section 6.4 we use larger sized vectors, only to point out that the main observations still hold. In Section 6.5 we study the effect of noise in the workload. Finally, in the last two sections we measure the performance on a real-life workload. 6.1 Retrieval Cost with no Extra Storage First, we measure the improvement in retrieval cost by selecting the optimal level of decomposition instead of the complete wavelet transform when there is no additional storage space. Figure 6 shows the improvement of the best level compared to the wavelet level for workloads of varying range size. The horizontal axis shows the ratio of the average query range size over the database size. As shown in the figure, our method shows a constant improvement over traditional wavelet transform, which can be as high as 30%. The wavelet level is only good for very large ranges that request over 80% of the data per each dimension. Therefore, for a multidimensional dataset we advise the use of our techniques to select the best level of decomposition for each dimension indepently. Recall, that in the n-dimensional case, the cost of our techniques is only n times the cost for one dimension and at the same time the improvement for n dimensions is the multiplication of improvements for a single dimension. Figure 6. Retrieval Cost (no Extra Space) 6.2 Effect of the Available Storage Space We investigate how the proposed methods, Level-Sets and Branch-Sets, perform under different storage space restrictions. We used a single dimensional data vector of size 256 and a synthetic workload of 100 random queries. Based on this workload we selected the best level, the best Level-Set and the best Branch-Set and counted the retrieval cost (number of coefficients needed) for this workload. The available 8

Storage Gain Over Gain Over Gain Over Increase Untransformed Wavelet Best Level 0% 90.46% 5.83% 0.00% 20% 92.45% 25.54% 20.92% 40% 93.30% 33.87% 29.78% 60% 94.02% 40.96% 37.30% 80% 94.02% 40.96% 37.30% Table 1.

One straightforward observation is that the full Wavelet decomposition (the last level of decomposition), marked as Wavelet in Figure 7a, is as expected not the best level.

As we increase the space, the best Branch-Set approach clearly outperforms the best Level-Set, with more than 30% gain in the retrieval cost at 60% extra storage.

9 Storage Gain Over Gain Over Gain Over Increase Untransformed Wavelet Best Level 0% 90.46% 5.83% 0.00% 20% 92.45% 25.54% 20.92% 40% 93.30% 33.87% 29.78% 60% 94.02% 40.96% 37.30% 80% 94.02% 40.96% 37.30% Table 1. Gain of Branch-Set over other methods storage space was varied from 100% to 180% of the size of the data vector and the retrieval cost for each method is shown in Figure 7a. One straightforward observation is that the full Wavelet decomposition (the last level of decomposition), marked as Wavelet in Figure 7a, is as expected not the best level. At an available storage space of 100% we observe that the best Level-Set and Branch-Set behave exactly like the best level, since there is no additional space to take advantage of. As we increase the space, the best Branch-Set approach clearly outperforms the best Level-Set, with more than 30% gain in the retrieval cost at 60% extra storage. Figure 7a shows that there is no significant gain for any method after a certain percentage (160%) of available space. The gain in retrieval cost for the best Branch-Set compared to the wavelet level, the untransformed level and the best level is shown in Table 1. workloads has no meaning. Once more, one can observe that the wavelet level is almost never the best level to store the data; it is optimal when only very large range queries are submitted, in the order of magnitude of the data vector size (128), as is the case for the (100,8) distribution. The main observation, however, is that the Branch-Set always has a constant gain in retrieval cost over the Best Level. The Level-Set shows not much gain and performs like the best level. This happens because of the fact that the range size is somewhat fixed; any level included besides the best level does not help. Note that this is not the case for the randomly distributed workload. 6.4 Scaling to Larger Data Vectors In this section we investigate how each method performs when the size of the data vector increases. Figure 8a shows the retrieval cost for a workload of 100 queries, where the range size is fixed to 40% of the data vector size and the extra storage is fixed at 20% more than the data vector size. The data vector size is increased from 64 to Figure 8a shows that the relative performance of each method remains the same and is indifferent to the size of the data vector, for a workload that has a particular range size distribution. a. Domain Size Scaleup b. Noisy Workload a. Effect of Available Storage Space b. Effect of Range Distribution Figure 7. Space and Workload Distribution 6.3 Effect of the Query Range Size Distribution In this section we would like to see how the proposed methods cope with different types of query Workload. We have created a number of synthetic workloads and have compared the performance of each method. Each workload consists of 100 queries whose range size is a gaussian distribution with different values for mean and variance. The experiments were applied to a one dimensional vector of 128 data values, and the retrieval cost for each method is shown in Figure 7b. For both the Level-Set and Branch-Set methods, the available storage was fixed at 120% of the data vector size. The first workload has the range size uniformly distributed, whereas the rest have a gaussian range size distribution with mean and standard deviation shown in the parentheses, respectively. The comparison among methods must be restricted at each workload at a time, as comparing between different Figure 8. Scaleup and Noise Experiments 6.5 Effect of Noise in the Workload In this section we examine how the proposed methods deal with the cases that the workload is not a good estimation for predicting future queries. Assume that the workload gathered is work and the future queries submitted to the system form a workload work. Up to now we were implicitly assuming that work work, that is the workload gathered provides a good estimation for future queries. In this section we no longer make this assumption, and we model this inaccuracy of workload work as additive gaussian noise on the start and point of the range for each query. The variance of the gaussian noise is analogous to a percentage of the range of the query. Figure 8b shows the performance drop occurred for various values of this percentage. This performance drop is defined as the increase in retrieval cost compared to the retrieval cost when the workload is completely known (no noise) and is shown in black in Figure 8b. As expected the performance drop increases as the percentage of noise increases; however both the predicted best Level-Set 9

and Branch-Set still outperform the actual best level. In conclusion, we observe that the proposed techniques perform well even in the presence of moderate noise in the gathered workload.

10 and Branch-Set still outperform the actual best level. In conclusion, we observe that the proposed techniques perform well even in the presence of moderate noise in the gathered workload. When the noise is high, this means that the workload is insufficient to capture future queries, so it would be better to recalculate the best Level-Set or Branch-Set using a newer workload. 6.6 Performance with a Real Query Workload In this section we measure the performance for a real data set, using a history of submitted queries. Our data set TEM- PERATURE is 4 dimensional real-world dataset which measures the temperature at points all over the globe at different altitudes, sampled twice every day for 5 months. The 4 dimensions are latitude, longitude, altitude, time and the measure attribute is temperature. The corresponding sizes of these dimensions are 64, 128, 16 and 256, respectively, which leads to a data cube of more than 33 million cells. A history of 100 queries were used in this experiment. Half of them were randomly selected and used as the training workload, in order to select the best level, Level-Set and Branch- Set. The other half of the history was then used to test the performance of each method. The available storage space was fixed to 150% of the data cube by allowing extra space of around 11% per dimension ( ). Figure 9 shows the performance of each selected method (Calculated) compared to the method that would be selected if the second half of the queries were known (Best). The performance is measured by the retrieval cost which is in the order of billions of coefficients. The results show that the first half of the history was a very good approximation for the other half, thus the performance of each method is almost identical to the performance of the ideal one. Aside from that, Figure 9 shows a clear improvement of more than 30% over the best level, let alone the full wavelet transform. The total overhead for a Level-Set is 61ms and for a Branch- Set is 642ms. Although the overhead for a Branch-Set is 10 times as much as of a Level-Set, it is still negligible. 7 Conclusion We have seen that the complete wavelet transformation can behave suboptimally under certain query workloads. To that, we proposed algorithms that select the optimal form to store the data, in order to minimize the retrieval cost, by taking advantage of a given workload. In the case, where additional storage is available, it can be used to further reduce retrieval cost, by storing over-complete representations of the dataset. Our Branch-Set approach leads to great improvement with minimal online overhead as shown in the experimental section. References [1] K. Chakrabarti, M. N. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. In VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, pages , [2] S. Chaudhuri. Self-tuning database systems. In Proc. IDEAS, [3] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Datacube: A relational aggregation operator generalizing groupby, cross-tab, and sub-total. In Proc. of the 12th International Conference on Data Engineering, pages , [4] D. Lemire. Wavelet-based relative prefix sum methods for range sum queries in data cubes. In Proceedings of CASCON IBM, October [5] R. Schmidt and C. Shahabi. Propolyne: A fast waveletbased technique for progressive evaluation of polynomial range-sum queries. In In Conference on Exting Database Technology (EDBT 02),Lecture Notes in computer Science. Springer, [6] J. S. Vitter and M. Wang. Approximate computation of multidimensional aggregates of sparse data using wavelets. In SIGMOD 1999, Proceedings of the ACM SIGMOD International Conference on the Management of Data, pages ACM Press, [7] Y.-L. Wu, D. Agrawal, and A. E. Abbadi. Using wavelet decomposition to support progressive and approximate rangesum queries over data cubes. In CIKM 2000, Proceedings of the 9th Interntationall Conference on Information and Knowledge Management, pages ACM, Figure 9. Multidimensional Dataset 6.7 Online Overhead The response time for a query is dominated by the time associated with the retrieval of coefficients. Hence, in this paper we have focused on trying to minimize the retrieval cost. Our proposed techniques have an online overhead for each submitted query. However, this overhead is minimal and can be easily neglected. To prove this claim we have measured the overhead time for 1000 queries submitted on a data vector of size 1024, with available storage at 120%. 10

11 A Term Explanation F T Full Tree of Wavelet Decomposition lb Low-pass branch (averages) hb High-pass branch (details) W D k k-th level of decomposition wd k k-th pseudo-level of decomposition q, d untransformed query and data vector N domain size of data vector S available storage space, expressed as S = (1 + α)n q k k-th level of decomposition of q Q(R, d) Range-Sum query of size r = R cost(q X) Retrieval cost for query q given X work Workload of m queries Proof of Theorems Table 2. Table of Notation APPENDIX Theorem 1: For a range-sum query vector of size N defined over a range of length r the number of non-zero coefficients in its Wavelet Decomposition at level k is not more than r 2 k + 2k + 1 Proof At each iteration of the wavelet decomposition there are always at most 2 detail coefficients, one for each edge of the range. There are no details for the rest of the range, since it is composed of series of 1s. The average coefficients on the other hand, at each iteration are halved, until they become just two. Therefore at any level k, as long as the halving continues we have a total of at most r + 2k coefficients, r averages and 2k details. When 2 k 2 k the halving of averages stops, we may up with a worst case of two average coefficients for a number of levels. In general, at level k the number of averages can be no more than r 2 k + 1 and the details no more that 2k. Adding these we obtain the bound described in the theorem. Theorem 4: The level that minimizes the cost for a query q of range size r, is either logr 1, the lowest level where there are at least 4 averages, logr the lowest level where there are at least 2 averages or level p the lowest level where there is exactly 1 average. The exact criteria are given in the proof. Proof Let i s and i f be the start and finish indices of the range of the query. Then, let n 1 be the highest integer such that i s mod 2 n 1 = 0; similarly n 2 is the highest integer such that i s mod 2 n 2 = 0. Without loss of generality assume that n 1 n 2. Let us think of a level of decomposition as containing averaging buckets of equal size and that for each level down the tree of decomposition the size of the buckets is doubled. Figure 10 portrays such a view of a query, where n 1 is the highest level that the left edge of the query is perfectly aligned with a bucket; the same applied for the right edge with level n 2. Observe that the edges are aligned for levels before n x as well. Essentially, what this means is that for all levels before n 1 there are no detail coefficients, for levels between n 1 and n 2 there is exactly 1 detail, that for the left edge. For levels beyond n 2 we would expect to have always 2 details for both edges; however this is not the case as we see. Let p be the lowest level that contains one bucket that completely covers the required range; in the worst case p can be the highest level of decomposition (Level p is the lowest level such that l s div 2 p = l f div 2 p ). It should be clear that for all levels beyond p, we have exactly one detail and of course exactly one average coefficient. However in level p we may have either 1, or even 0 details. No details is only possible when the two averaging buckets of level p 1 have exactly the same number of elements. In other words, this anomaly happens only in the case when the range is symmetric with respect to the bucket of level p. To summarize, the number of nonzero details at level k is 0 if k < n 1, is 1 when n 1 k < n 2, is 2 when n 2 k < p, is {0, 1} when k = p and is 1 when k > p. Figure 10. Averaging Buckets Returning to the question of selecting the level that minimizes the nonzero coefficients, we can now argue that if we are at level k 1, going to level k means that deping on the position of k relative to n 1, n 2 and p we increase the number of nonzero details by at most 2 coefficients. In addition the nonzero averages are at the least halved; which means that the gain of going from level k 1 to k is not more than half the average coefficients at level k 1. By combining the observations for the detail and the average coefficients, we have that going to the next level is desired if the gain introduced by halving the averages is more than the loss of adding 1 or 2 coefficients. Actually, the only case we have to examine is when the halving results in a gain of only coefficient and in the same time we are at a level between n 2 and p. This can happen at the lowest level when the averages are exactly 4 and misaligned to the bucket edges, or there are less than 4 averages. Let x = logr 1, then at that level we have at least 4 average coefficients. If the number of averages is more than 4, r 2 x > 4, then it makes sense to stop at the next level x + 1 = logr, where there is a gain of at least 2 coefficients for any relative position of x. If the number of averages is exactly 4, r 2 x = 4, and additionally the number of details at the next level is 2, that is n 2 x + 1 < p, it is better to stop at level 11

Wavelets for Efficient Querying of Large Multidimensional Datasets

Wavelets for Efficient Querying of Large Multidimensional Datasets Cyrus Shahabi University of Southern California Integrated Media Systems Center (IMSC) and Dept. of Computer Science Los Angeles, CA 90089-0781