Improving Retrieval Cost by Choosing the Best Wavelet Decomposition for Multidimensional Datasets

Size: px
Start display at page:

Download "Improving Retrieval Cost by Choosing the Best Wavelet Decomposition for Multidimensional Datasets"

Transcription

1 Improving Retrieval Cost by Choosing the Best Wavelet Decomposition for Multidimensional Datasets Dimitris Sacharidis, Cyrus Shahabi, Huseyin Balli, Andreas Xeros, Antonio Ortega University of Southern California Los Angeles, CA {sacharid, shahabi, balli, Abstract Wavelets have been extensively used for approximate, progressive or even exact evaluation of queries. However, the complete wavelet transform is not always the optimal form to store the data. We exploit the properties of the full tree of the wavelet decomposition, in order to find a representation for the dataset that minimizes the retrieval cost assuming a given query workload. We develop techniques to answer queries of range size r with retrieval cost of O(log n r), while maintaining an update cost of O(log n S), for a n-dimensional cube of size S n. Furthermore, we investigate the space and time trade-off associated with storing extra coefficients to reduce the cost even further. 1 Introduction Wavelets have been widely accepted in the database community as a method for dealing with large multidimensional datasets. Their origin lies in signal processing applications, where they are mainly used as a tool for signal compression, due to their inherent multiresolution properties. By applying these ideas to multidimensional databases, we can achieve approximate, progressive or even fast exact answers to queries. Throughout this paper, we try to minimize the retrieval cost, which is the number of data values (coefficients) that must be retrieved from database to answer a query. To achieve this we must choose the best representation for the dataset, given some prior knowledge on the type of queries typically submitted. We would like to develop methods to fine-tune the database and towards this, we explore and exploit the properties of the full tree of the wavelet decomposition (wavelet packet transform). Our first contribution, comes from a simple observation: for point queries, i.e., a range query where size of the range is one, it is more efficient to query the raw data rather than using any transformation. On the other hand, for a very large range query, full wavelet transformation may be the best representation for data. In this paper we show that for general This research has been funded in part by NSF grants EEC (IMSC ERC), IIS (ITR), and IIS (CAREER), and unrestricted cash gifts from Okawa Foundation and Microsoft. database queries, the complete wavelet transform is not always the optimal form in which to store the data. This is because the objective, unlike for the traditional signal compression applications, is no longer to reconstruct the entire signal (or database); rather, the goal is to reconstruct an arbitrary subset of the data defined by a range query. For this reason, we consider all possible non-redundant representations for the dataset that exist in the full tree of the wavelet decomposition, only two of which are the untransformed dataset and its complete wavelet transformation. In this paper, also, we proceed one step further. We suggest techniques for cutting down retrieval cost in applications where a moderate increase in storage space is acceptable. We investigate the space and time trade-off associated with increasing the amount of stored data in order to significantly decrease the retrieval cost. Given a limit on the available storage space and some collected statistics on the type of typical queries, we select the best representation for a dataset. The idea, which again originates in signal processing, is to have an over-complete representation (multiple transformations) of the dataset. This means that online, at the time of query submission, there are alternative ways to answer the query, and consequently the one that minimizes the number of retrieves can be selected. Traditionally, wavelets have been used as an approximation tool; in this paper, however, we strongly focus on the inherent preaggregation properties of the wavelet decomposition. Let us note that even though the discussion proving the effectiveness and usefulness of our approaches is somewhat detailed and complicated, the actual proposed algorithms are elegant and straightforward to implement and can lead to significant improvements in the retrieval cost with a completely negligible online overhead per query. To be able to optimize the database and find the best form in which to store the data, we must have some knowledge about the characteristics of the queries typically submitted to the system. There is a significant number of publications about collecting and extracting query workload statistics and many DBMS support such techniques and utilize them to measure system performance, or even to automati- 1

2 cally apply administrator tasks, such as dropping/adding indexes [2]. In this paper, we use the term workload to refer to a set of queries submitted to the database in the past. At first, we assume that the available workload can perfectly predict the characteristics of the future queries and we optimize the database in that way. Later, we investigate what happens when the workload used to calculate the best form is somewhat inaccurate, that is, it fails to capture the future query behavior. We begin our discussion with an overview of wavelets and how they are used to answer queries in Section 3. In Section 4 we provide an algorithm to select the best form for the data, given that there is no additional storage space available. Later, in Section 5 we propose two techniques for taking advantage of any available extra storage space. The first technique, Best Level-Set, is a direct extension of the algorithm introduced in Section 4, whereas the second one, Best Branch-Set, is more sophisticated and leads to improvements in the order of 25% for additional storage of 20%. Section 6 provides comparative results for all methods, using both synthetic and real-world query workloads and datasets. Finally, in Section 7 we conclude our discussion. 2 Related Work Online Analytical Processing (OLAP) applications have exhaustively dealt with the problem of providing fast and exact answers to range queries over large and multidimensional datasets that can be seen as hyperbuces. The main intuition behind all techniques is to store the dataset in some pre-aggregated form, so as to speed-up range aggregate queries. Improving query response time comes, of course, at the expense of increasing the update cost. Therefore, it becomes an issue of finding the appropriate balance between retrieval and update cost, with respect to the underlying application. Wavelets are used as a mathematical tool to create a multiresolution and pre-aggregated view of a multidimensional dataset [1, 5, 6, 4, 7]. They have been extensively studied in Signal Processing applications and recently appeared in the database community to replace Prefix-Sum like techniques. Consider, as an example such a Prefix-Sum like technique, the Dynamic Data Cube [?] where aggregates are pre-computed over blocks of increasing size by powers of 2, to provide a multi-resolution view of the dataset, exactly like wavelets. In our previous work [5] we have shown that Haar based wavelets can achieve perfect balance between retrieval cost and update cost. In this paper, we attempt to decrease retrieval cost even further with no increase on update cost, by taking under consideration the type of queries typically submitted. To the best of our knowledge, there has been no previous work on exploring the full wavelet packet transform to select the best wavelet decomposition given a query workload. 3 Background on Wavelets 3.1 Levels of Decomposition In this section we provide the necessary definitions and the terminology that we use throughout this paper. We first assume single-dimensional datasets. Later, in Appix B, we relax this assumption and discuss the extension to the multidimensional case. Let us assume that the dataset d is a single dimensional vector and that it has length N, which is a power of 2. Then, this dataset can be seen as a point in N-dimensional space IR N. We exploit the multiresolution property of the Discrete Wavelet Transform to decompose this vector space into a set of orthogonal subspaces. The original space V 0 = IR n is decomposed into V 1 which provides a coarse or averaged view of V 0, and into W 1 which is the orthogonal compliment of V 1 in V 0. Both of these spaces have dimension N/2 and thus contain half as many basis vectors. The basis vectors in V 1 and W 1 are orthogonal to each other and form a basis for the original space: V 0 = V 1 W 1. The projection of the data vector d on the basis vectors of V 1 and W 1 results in N coefficients which are the first level of decomposition. The space V 1 can be further decomposed into V 2 and W 2 to provide an even coarser and detailed view of V 0. The original space is decomposed into W 1, W 2 and V 2, which is the second level of decomposition: V 0 = V 2 W 1 W 2. In general, at the k-th level of decomposition we have V 0 = V k W 1 W 2 W k, which is an average view at resolution level k plus the details of previous levels. This chain of multiresolution analysis of the original space V 0 continues upto level logn and can be seen in Figure 1. At this point let us note that throughout this paper, we use Haar wavelets. Extension to other filters can be achieved similar to our work in [5]. Figure 1. Multiresolution Property of Wavelets In the remainder of the paper we will use the notation W D k (d) to refer to the k-th level of decomposition of vector d, or in the context where a vector is implied we will simply refer to W D k as the k-th level of decomposition. We will use the term branch to refer to all the basis vectors of a subspace, either V k or W k. This terminology is used in the context of signal processing applications, where filter banks composed of a low-pass and a high-pass filter are chained to perform the wavelet decomposition. Furthermore, we will use the term low-pass branch, average coefficients, or even averages to refer to the coefficients that result from the projection of the data vector in the basis vectors of a V subspace. Similarly, we will use the term high-pass branch, 2

3 detail coefficients, or even details to refer to the coefficients that result from the projection of the data vector in the basis vectors of a W subspace. In total, there are 3N 2 distinct coefficients across all levels of decomposition, consisting of the N 1 details and 2N 1 averages. We will use the term full tree (FT) to refer to all of these coefficients; the name is because of the tree structure of the wavelet decomposition. Thus, F T = W D 0 W D 1 W D logn and F T = 3N 2. An example of a full tree of decomposition for a vector of size 8 is also shown in Figure 1. The averages, including the original vector, are shaded, whereas the details are shown in white. The last level of decomposition, which is the complete wavelet transform consists of the final average and all the details. 3.2 Answering Queries using Wavelets Let us assume that our dataset is stored in the vector d of size N. In this paper we restrict ourselves to the case of answering sum queries defined on a range R. Our techniques can be generalized to arbitrary polynomial range-sum queries similar to our previous work in [5]. A range-sum query Q(R, d) of range R on the data vector d is the summation of the values of the data vector that are contained in the range. The answer to such a query is given by the summation of the values of the data vector d for each cell ξ contained in the range: a = ξ R d(ξ). We can rewrite this summation as the dot product of two vectors, the data vector d and the query vector q. The query vector is a vector of the same size as the data vector that has the value of one in all cells within the range R and zero in all cells outside that range. Definition 1: A range-sum query Q(R, d) of range R on the data vector d corresponds to a query vector q and the answer to this query is given by the inner product of d and q. a = d(ξ) = q, d = q[i] d[i] ξ R i Let us provide a very useful lemma that applies for any transformation that preserves the Euclidean norm, including the Discrete Wavelet Transformation and more importantly the Wavelet Decomposition at any level. This lemma is essentially the Generalized Parseval Equality applied to our case. Lemma 1: If d is the Wavelet Decomposition at level k of the data vector d, i.e. d = W D k (d), and q is similarly the Wavelet Decomposition at level k of the query vector q, i.e. q = W D k (q), then q, d = q, d i q[i] d[i] = q[j] d[j] j This means that a range-sum query can be answered using either the untrasformed vectors or some orthogonal transformation of them. The reason for choosing to transform the data into the wavelet domain will become apparent with Theorem 1. The intuition is that we would like to find a representation for the query that has the least number of non-zero values, so that the number or required data from the database is minimized; this is formalized in Section 4. Therefore, it is logical to answer the question of what happens at each level of decomposition. Recall that the wavelet transform is a recursive procedure, where at each step, termed iteration, an input produces two outputs of half length, low-pass (averages) and high-pass (details), and the procedure continues using the low-pass output as the new input. Lemma 2: At any iteration of the wavelet transform, an input vector, whose all non-zeros have the same value and form a contiguous range 1, has a low-pass output of the same form but with half as many non-zeros and a high-pass output of at most 2 non-zeros. The Proof of this Lemma, as well as other proofs can be found in Appix A. Figure 2. Wavelets of a Range-Sum Query This means that because of the specific form of the query vector, an iteration of the wavelet transform halves the number of non-zeros required to be processed at the next iteration. Also, because of the fact that the query consists of equal and contiguous values, the only detail coefficients occur at the edges of the range, resulting in a maximum of 2 coefficients, each per edge; see Figure 2. Applying Lemma 2 recursively, one can easily derive a bound for the number of non-zero coefficients at each level of decomposition. Theorem 1: For a range-sum query vector of size N defined over a range of length r the number of non-zero coefficients in its Wavelet Decomposition at level k is not more than r + 2k k All previous work in this area, were considering only the last level of decomposition and their result was that a query vector of size N, defined over a range of size r, completely decomposed into the wavelet domain has less than 2logN +2 non-zero coefficients. One of our strongest contributions is that we have shown that the number of non-zero coefficients deps on the size of the range r and not on the domain size N, if we consider intermediate levels of decomposition. The next Theorem summarizes the results seen so far. 1 The lemma also applies to the case where the first and last value of the range is not the same as the rest of the range 3

4 Theorem 2: The answer to a range-sum query Q(R, d) of size R = r when the query q and data vectors d are both transformed to the same Level of Decomposition k is given by the( following equation: ) a = q, d and can be computed in O r + 2k retrieves from the database. 2 k Up to now, we were constricting ourselves to the one dimensional case, where the data is stored in a vector. In the most general case the dataset is of multidimensional nature and can be viewed as a multidimensional hypercube [3]. We use the tensor product multivariate wavelet transform to decompose the cube into the wavelet domain; that is, we apply the wavelet transform for each dimension in the data cube. As a result the following Theorem can be shown. Theorem 3: The cost of answering an n-dimensional sum query defined over a hyper-rectangle range is the product of the costs of answering each of the n single dimensional range-sum queries defining the hyper-rectangle range. Furthermore, the algorithms presented in this paper are easily exted into n dimensions by applying them indepently for each dimension; therefore, their time and space complexity scales linearly with respect to the dimensionality. An exted discussion can be found in Appix B. For the remainder of this paper we assume single dimensional vectors, for the sake of simplicity. 4 Optimal Wavelet Decomposition without Extra Storage Space 4.1 Finding the Optimal Level of Decomposition In this section we find the optimal Level of Decomposition to store the data vector. We assume a storage space with identical size to the size of the data vector. The optimal level is the one that results in the least number of non-zero query coefficients, and thus the least number of retrieves from the database, for a given query workload. Given a query q we define the cost of answering this query as the number of non-zero coefficients in the transformed query vector. Since we can have logn + 1 possible transformations of the query vector, we will use cost(q W D i ) to refer to the cost of query q transformed at the i-th level of decomposition. Definition 2: The cost of answering a query q in the i- th level of decomposition W D i is defined as the number of non-zero coefficients contained in the transformed query vector W D i (q) and will be denoted as cost(q W D i ). We have already seen that there is a strong correlation between the range of the query and the cost of a query at a particular level of decomposition in Theorem 1. It is therefore expected that we can always find that level of decomposition which minimizes the cost for a given query, just by examining each level. However, the next Theorem states that this is not necessary since the optimal level can be chosen among only three candidate levels. The proof is easily derived from Lemma 2. One can see that the interesting (candidate) levels are those where the gain of halving the non-zero averages is less than 2, which is the cost of going into the next level (because of the non-zero details). Theorem 4: The level that minimizes the cost of a query q = Q(R, d) of size r = R, is either the lowest level where there are at least 4 averages ( logr 1), the lowest level where there are at least 2 averages ( logr ) or the lowest level where there is exactly 1 average. The exact criteria are provided in the proof. Now, we have a way to identify the optimal level given a single query. The workload, however, comprises of more than one query, so we have to aggregate across all queries in the workload for every level of decomposition. In this paper we treat all queries as equal, so the aggregation degenerates to a summation. However, one can choose a different aggregation function, e.g. weighted average, to better reflect the importance for each individual query. Definition 3 : For a query workload W ork we define the Total Cost for the i-th level of decomposition as the summation of the costs for each individual query: C i = q j cost(q j W D i ) q i W ork The algorithm Optimal Level shown below, greedily selects the level of decomposition that has the least cost for a given Workload. By keeping the data vector in that optimal Level we guarantee that queries included in the Workload are answered in the most efficient way; in other words our database is better tuned for that workload. In the experimental section we investigate what happens when answering queries not belonging to the workload. Algorithm 1: Optimal Level Input: W ork // Workload Output: W D b // best Level Var: C // Total Cost array foreach q W ork do 1 for i 0 to logn do 2 C[i] C[i] + cost(q W D i ) C[b] minc // find the minimum cost return W D b // b is the best level The lines numbered 1 and 2 in Algorithm 1 can be executed in time O(logN) by using the methodology described in Theorem 1. This means that the algorithm for selecting the best Level has time complexity O(mlogN) and space complexity O(logN) for a workload of m queries. 4

5 5 Optimal Wavelet Decomposition with Extra Storage Space The objective of this section is to take advantage of the case where there is enough storage space to hold not only the decomposed data but some additional coefficients, in order to improve the system performance. In Section 5.1, we restrict ourselves to the case where the extra coefficients are all part of another Level of Decomposition. Later, in Section 5.2 we will relax this restriction. 5.1 Optimal Level-Set In this section, we find the levels of decomposition that minimize the cost of answering queries from the Workload, given a restriction on available storage size. We use the term Level-Set and the symbol SW D to refer to a set of levels of decompositions: SW D = i W D i. Before proceeding any further, let us provide with a definition of the cost of answering a query given a Level-Set. Definition 4: The cost of answering a query q using a Level- Set SW D cost(q SW D) is the minimum cost of answering the query using just one Level of Decomposition W D i SW D. cost(q SW D) = min cost(q W D i) W D i SW D This means that we assume that for a given query a Level- Set has the same cost as the best cost among all Levels included. The total cost of answering queries from a Workload for any Level-Set can be calculated using an aggregation function; here we simply use the summation Search Space for Finding the Optimal Level-Set We already know that there are logn + 1 levels of decomposition, so that there can be at most 2 logn+1 1 = 2N 1 Level-Sets. However, since there is a restriction on the available space we expect the search space to be significantly smaller. To prove that, we must first calculate the size of a Level-Set. One straightforward observation is that for any two Levels of Decomposition there is a number of common coefficients. These are all the detail coefficients of the lowest level of decomposition. The only exception is the 0-th level of decomposition (i.e. the raw data) that has no common coefficients with any other level. Lemma 3: Let us assume two Levels of Decomposition W D k and W D m, where W D k is the lowest Level, 0 < k < m. The number of common coefficients in these two levels is: k ( N W D k W D m = 2 k = 1 1 ) 2 k N i=1 whereas, the total number of coefficients included in the Level-Set of these levels is: W D k W D m = (1 + 1 ) 2 k N In Figure 3 we observe the sharing of coefficients shown with dark grey, whereas the non-common coefficients are shown with light grey for two levels, k and m. Figure 3. Common coefficients among two levels The extension of Lemma 3 to a Level-Set consisting of more than 2 levels leads to the following theorem. Theorem 5 : The size of a Level-Set SW D = {W D k1, W D k2,..., W D km } consisting of m > 1 Levels of Decomposition W D k1, W D k2,..., W D km in decreasing order (1 < k 1 < k 2 < < k m ) is: SW D = ( k ) 1 2 k N m 1 When the untransformed data (W D 0 ) is included the size always increases by N. We can easily verify the validity of this theorem if we consider a Level-Set that includes all levels of decomposition, except the 0-th one. Theorem 5 suggests that this Level-Set has size of logn i=1 W D i = 2N 2, as expected. Corollary 1: The size of a Level-Set SW D, where W D k1, k 1 > 0 is the lowest level of decomposition ( included ) in the Level-Set, is bounded by SW D < N 2 k 1 1 The previous corollary puts a bound on the size of a Level-Set, which can be used to prune the search space of Level-Sets given a restriction on available space. Let us use S to indicate the available space; then N < S 2N 2, that is, we do not consider the 0-th level of decomposition. Equivalently, the ( additional ] storage space is a fraction α of N, where α 0, 1 2 N. Therefore S = (1 + α)n. Theorem ( 6: Given ] a storage space of S = (1 + α)n where α 0, 1 2 N, a Level-Set that has space less than S can only contain levels after β, where β is the smallest integer greater than 1 + log 1 a. Levels before β can only comprise 2 singleton Level-Sets. There exist at most N such Level- 2 β Sets. 5

6 For example, assume that the extra storage available is N/4, that is α = 1/4. Theorem 6 suggests that we can only have levels after β. In our case β 3, which means that levels 1 and 2 cannot be included in any Level-Set that obeys the space restriction; of course, they can be considered as two single item Level-Sets Optimal Level-Set Algorithm To select the optimal Level-Set among those that have size less than S = (1 + α)n < 2N 2 for a workload of m queries, the following steps are required. 1. For each level find the cost cost(q W D i ) of answering each query of the workload 2. Calculate β = 1 + log 1 a 3. Create the candidate Level-Sets 4. For each SW D in candidates calculate cost(q SW D) from cost(q W D i ) for all W D i SW D 5. Select the Level-Set with the minimum total cost Algorithm 2, Optimal Level-Set, greedily selects the Level-Set among those that have size less than S = (1 + α)n < 2N 2 which minimizes the total cost of answering queries belonging to a workload of m queries. The first step of calculating cost(q W D i ) for 0 i logn takes time of O(mlogN). There are at most 2 2 β N candidate Level-Sets and each one has at most logn β+1 levels. For each Levelset and each query the best level must be chosen which leads to a time cost of m(logn β + 1) 2 2 β N = O(mNlogN). The space required is for storing the cost of answering m queries for logn β + 1 levels, that is m(logn β + 1) = O(mlogN) Algorithm 2: Optimal Level-Set Input: W ork // Query Workload Input: S = (1 + α)n // Storage Space Output: SW D b / Optimal Level-Set Var: β // β value Var: C // Total Cost array Var: candidates // array of candidate Level-Sets foreach q W ork do for i 0 to logn do calculate cost(q W D i ) β 1 + log a 1 candidates createcandidates(β) foreach SW D candidates do foreach q W ork do calculate cost(q SW D) C[SW D] C[SW D] + cost(q SW D) return SW D b minarg SW D C 5.2 Optimal Branch-Set In this section still we assume that the available storage space exceeds the size of the data vector, but we no longer constrict ourselves to complete levels of decomposition. Instead, we increase the granularity from levels to branches of the wavelet decomposition. We find the best set of branches to store and we show how this can increase the efficiency of our query answering system. We use the term Branch-Set and the symbol SB to refer to a set of branches. A Branch-Set can contain branches that form complete levels of decomposition, as well as branches that do not form complete levels because some required high-pass branches are missing. Recall that a level of decomposition W D k is composed of the low-pass branch lb k and all the previous high-pass branches hb l, 0 l k: W D k = lb k hb 0 hb 1 hb k The stray branches that do not form levels are low-pass branches and form what we call a pseudo-level of decomposition. Definition 5: The k-th pseudo-level of decomposition is defined by a set of branches that contains the k-th low-pass branch and any of the previous high-pass branches, hb l, 0 l k. Subsequently, a level of decomposition is also a pseudolevel, while the opposite is not generally true. Such a definition allows us to define a Branch-Set as a set of pseudolevels, just as a Level-Set is a set of complete levels. In addition, a Branch-Set must contain at least one full level of decomposition, so that there exists an orthogonal representation of the original vector (that is, the set of branches must form an overcomplete basis, a frame, for V 0 ; see Figure 1). Let W D m be the lowest such level of decomposition, then a Branch-Set contains a complete level, as well as a number of pseudo-levels: SB = W D m ( i wd i) Search Space for Finding the Optimal Branch-Set Since we know that there are 2logN +1 branches for a vector of size N, we expect that in the worst case there are potentially 2 (2logN+1) 1 = 2N 2 1 Branch-Sets. However, because of the storage space restriction, the number of candidate Branch-Sets is much smaller. To prove this, we must calculate the size of an arbitrary Branch-Set. Theorem 7: Let SB be a Branch-Set that contains the m-th level of decomposition, where m is the lowest ( present ) level. Then the size of SB is bounded by SB < N 2 m 1 Corollary 1 of Section 5.1 can also be derived from this general theorem. Assuming available storage space of S = (1 + α)n we can also prove the following theorem, which exactly determines the search space for Branch-Sets given a storage space restriction. Theorem 8: Let ( S = (1] + α)n be the available storage where α 0, 1 2 N. For a Branch-Set containing m-th level of decomposition as the lowest level present, the branches that do not belong in W D m can only belong to levels greater than β, where β is the smallest integer greater than 1 + log 1 1 a. There exist less than N 2 logn possible 2 2β Branch-Sets. 6

7 5.2.2 Answering a Query given a Branch-Set In Section 5.1 we used one of the available levels in the Level-Set to answer a query. We could do the same thing here using pseudo-levels instead of levels, but then there would be no improvement in query cost over Section 5.1. Instead, we take advantage of the higher granularity that pseudo-levels provide. First, we split the original query into subqueries, and then for each subquery we decide which pseudo-level to use. The gain in retrieval cost is twofold: (a) intra-query gain, because each subquery can be answered using the most suitable pseudo-level, (b) inter-query gain, in the case where two or more subqueries require common (shared) coefficients. The following lemma, which is a result of the inner product s linearity and of Lemma 1, formalizes query splitting and shows how each subquery can be answered in a different level of decomposition. A query vector q transformed into the k-th level of decomposition is denoted as q k. Lemma 4: The answer to a query vector q that can be written as the summation of n signed subqueries q = n i=1 s i q i, where s i = { 1, 1} can be calculated as n n q, d = s i q i, d = s i q k i i, d k i i=1 i=1 Among all possible ways to split a query, there is one that minimizes the retrieval cost for the query, given a Branch- Set. This is the optimal query splitting, and we use it to define the cost of a query for a given Branch-Set. Definition 6: Given a set of available branches, SB, the retrieval cost for answering a query q is the minimum retrieval cost among all possible query splittings. Finding the retrieval cost for a query given a Branch-Set is a daunting task. Therefore, we must seek alternatives to blind searching all possible query splittings. Let us assume, for a while, that there is no storage space restriction. Then, splitting the query in an optimal way is quite straightforward; we only need to store every low-pass branch resulting in a storage space of S = 2N 1. A query is split in subqueries so that each can be answered by a single average coefficient in a low-pass branch. Using the notion of buckets, an algorithm that splits the query would always select the bucket that better matches the query range. An example query of range r is shown in Figure 4. The best fitting buckets (subqueries) are drawn with thick lines, black when positive and light grey when negative. We claim that given a query range, selecting the bucket that leaves the least space to fill is optimal. Theorem 9: Given a range-sum query when the Full Tree of decomposition is available, an algorithm that greedily selects at each step the coefficient that corresponds to the bucket that leaves the least space to fill is optimal. Figure 4. Optimal Query Split given a Full Tree Now, let us return to the case where there is a restriction on the available storage space, so that we cannot include all low-pass branches in a Branch-Set. Splitting the query in the way that Theorem 9 suggests, would mean that some subqueries cannot be answered using a single average coefficient. For these subqueries there are two options, either reconstruct the missing low-pass branch from an available higher level low-pass branch, or from an available lower level low-pass branch together with all the highpass branches of the intermediate levels. The choice among the two reconstruction paths has to be made with respect to minimizing the total number of coefficients needed. Having in mind the fact that Theorem 9 suggests that a query is split in such a way that each subquery is essentially an average coefficient in a low-pass branch, let us look at an example, shown in Figure 5. Three subqueries marked with a cross cannot be answered by a single coefficient at their own low-pass branch. The available branches are shown with a solid line, whereas the missing branches are shown with a dotted line. Reconstruction is necessary either from low-pass branch lb a or from the high-pass branches and the low pass lb b. The main concern is to find a reconstruction scheme that minimizes total cost. Looking at Figure 5, one would assume that for each of the three subqueries there are two choices, either going up or down in the tree. This is not the case, however. For example, if sq 2 is reconstructed by going down rather than up, then it is easy to see that going down should be the choice for sq 3 as well. Besides the obvious reason that the distance to lb a is longer for sq 3 (the longer the reconstruction path, the more coefficients it involves), there can be some overlapping of coefficients which further reduces the cost of going down for subqueries sq 2 and sq 3. Figure 5. Reconstruction of Subqueries This observation generalizes to: for any two subqueries sq 1, sq 2, that fall between two subsequent low-pass branches, there can be no crossing between the chosen reconstruction path. We use this result to calculate the retrieval cost of a query, given a Branch-Set. We propose Algorithm 3, named Query-Split, that finds the best place to put a bar- 7

8 rier among the two available low-pass branches, such that all subqueries falling below the barrier are reconstructed from the branches below and all subqueries falling above the barrier are reconstructed from the branches above. This algorithm joins all subqueries that are reconstructed from going down in the tree to take into account the overlapping coefficients. The time complexity of the Query Split algorithm is O ( log 2 N ), as shown in Appix C. Algorithm 3: Query Split Input: q // Query Input: bs // Branch-Set Output: SQ = {sq i } / Set of Subqueries SQ split(q) // create original set foreach two consequent low-pass branches lb a and lb b do foreach missing low-pass branch lb k between lb a and lb b do barrier k lsubqs k // (left) subqueries between lb a and lb k rsubqs k // (right) subqueries between lb k and lb b rsubq k rsubqs k // join right subqueries to deal with the overlapping of coefficients lcost cost(lsubqs k lb a) // find cost for reconstructing left subqueries from lb a rcost cost(rq k lb a ) // find cost for reconstructing right subquery from lb b totalcost k lcost + rcost // add the costs select k that minimizes totalcost k add subqueries lsubqs k and query rsub k to SQ return SQ Optimal Branch-Set Algorithm To select the optimal Branch-Set among those that have size less than S = (1 + α)n < 2N 2 for a workload of m queries, the following steps are required. 1. Calculate β = 1 + log 1 a 2. Create the candidate Branch-Sets 3. For each Branch-Set BS in candidates calculate cost(q BS) using the Query-Split algorithm 4. Select the Branch-Set with the minimum total cost The previous are summarized in the Optimal Branch-Set algorithm shown below. In the worst case there can be 2N 2 possible Branch-Sets, so that the time complexity of this algorithm is 2N 2 m log 2 N = O ( m(nlogn) 2), since the dominating cost in the inner loop is the Query Split algorithm. Algorithm 4: Optimal Branch-Set Input: W ork // Query Workload Input: S = (1 + α)n // Storage Space Output: BS b / Optimal Level-Set Var: β // β value Var: C // Total Cost array Var: candidates // array of candidate Level-Sets β 1 + log a 1 candidates createcandidates(β) foreach BS candidates do foreach q W ork do SQ QuerySplit(q BS) // split query given the BranchSet cost(q BS) cost(sq BS) // sum the costs of each subquery sq SQ C[BS] C[BS] + cost(q BS) return BS b minarg BS C 6 Experimental Results We have conducted 7 series of experiments in order to determine the behavior of our proposed techniques. In Section 6.1 we are restricting ourselves to the case where no extra storage is available and we create synthetic workloads and we vary the range size distribution, to emphasize on the superiority of our approach over the full wavelet transformation. In Section 6.1, we investigate the performance of our techniques, as the available storage space increases. In Section 6.3, the available storage is fixed but the range size distribution is changed. Later, in Section 6.4 we use larger sized vectors, only to point out that the main observations still hold. In Section 6.5 we study the effect of noise in the workload. Finally, in the last two sections we measure the performance on a real-life workload. 6.1 Retrieval Cost with no Extra Storage First, we measure the improvement in retrieval cost by selecting the optimal level of decomposition instead of the complete wavelet transform when there is no additional storage space. Figure 6 shows the improvement of the best level compared to the wavelet level for workloads of varying range size. The horizontal axis shows the ratio of the average query range size over the database size. As shown in the figure, our method shows a constant improvement over traditional wavelet transform, which can be as high as 30%. The wavelet level is only good for very large ranges that request over 80% of the data per each dimension. Therefore, for a multidimensional dataset we advise the use of our techniques to select the best level of decomposition for each dimension indepently. Recall, that in the n-dimensional case, the cost of our techniques is only n times the cost for one dimension and at the same time the improvement for n dimensions is the multiplication of improvements for a single dimension. Figure 6. Retrieval Cost (no Extra Space) 6.2 Effect of the Available Storage Space We investigate how the proposed methods, Level-Sets and Branch-Sets, perform under different storage space restrictions. We used a single dimensional data vector of size 256 and a synthetic workload of 100 random queries. Based on this workload we selected the best level, the best Level-Set and the best Branch-Set and counted the retrieval cost (number of coefficients needed) for this workload. The available 8

9 Storage Gain Over Gain Over Gain Over Increase Untransformed Wavelet Best Level 0% 90.46% 5.83% 0.00% 20% 92.45% 25.54% 20.92% 40% 93.30% 33.87% 29.78% 60% 94.02% 40.96% 37.30% 80% 94.02% 40.96% 37.30% Table 1. Gain of Branch-Set over other methods storage space was varied from 100% to 180% of the size of the data vector and the retrieval cost for each method is shown in Figure 7a. One straightforward observation is that the full Wavelet decomposition (the last level of decomposition), marked as Wavelet in Figure 7a, is as expected not the best level. At an available storage space of 100% we observe that the best Level-Set and Branch-Set behave exactly like the best level, since there is no additional space to take advantage of. As we increase the space, the best Branch-Set approach clearly outperforms the best Level-Set, with more than 30% gain in the retrieval cost at 60% extra storage. Figure 7a shows that there is no significant gain for any method after a certain percentage (160%) of available space. The gain in retrieval cost for the best Branch-Set compared to the wavelet level, the untransformed level and the best level is shown in Table 1. workloads has no meaning. Once more, one can observe that the wavelet level is almost never the best level to store the data; it is optimal when only very large range queries are submitted, in the order of magnitude of the data vector size (128), as is the case for the (100,8) distribution. The main observation, however, is that the Branch-Set always has a constant gain in retrieval cost over the Best Level. The Level-Set shows not much gain and performs like the best level. This happens because of the fact that the range size is somewhat fixed; any level included besides the best level does not help. Note that this is not the case for the randomly distributed workload. 6.4 Scaling to Larger Data Vectors In this section we investigate how each method performs when the size of the data vector increases. Figure 8a shows the retrieval cost for a workload of 100 queries, where the range size is fixed to 40% of the data vector size and the extra storage is fixed at 20% more than the data vector size. The data vector size is increased from 64 to Figure 8a shows that the relative performance of each method remains the same and is indifferent to the size of the data vector, for a workload that has a particular range size distribution. a. Domain Size Scaleup b. Noisy Workload a. Effect of Available Storage Space b. Effect of Range Distribution Figure 7. Space and Workload Distribution 6.3 Effect of the Query Range Size Distribution In this section we would like to see how the proposed methods cope with different types of query Workload. We have created a number of synthetic workloads and have compared the performance of each method. Each workload consists of 100 queries whose range size is a gaussian distribution with different values for mean and variance. The experiments were applied to a one dimensional vector of 128 data values, and the retrieval cost for each method is shown in Figure 7b. For both the Level-Set and Branch-Set methods, the available storage was fixed at 120% of the data vector size. The first workload has the range size uniformly distributed, whereas the rest have a gaussian range size distribution with mean and standard deviation shown in the parentheses, respectively. The comparison among methods must be restricted at each workload at a time, as comparing between different Figure 8. Scaleup and Noise Experiments 6.5 Effect of Noise in the Workload In this section we examine how the proposed methods deal with the cases that the workload is not a good estimation for predicting future queries. Assume that the workload gathered is work and the future queries submitted to the system form a workload work. Up to now we were implicitly assuming that work work, that is the workload gathered provides a good estimation for future queries. In this section we no longer make this assumption, and we model this inaccuracy of workload work as additive gaussian noise on the start and point of the range for each query. The variance of the gaussian noise is analogous to a percentage of the range of the query. Figure 8b shows the performance drop occurred for various values of this percentage. This performance drop is defined as the increase in retrieval cost compared to the retrieval cost when the workload is completely known (no noise) and is shown in black in Figure 8b. As expected the performance drop increases as the percentage of noise increases; however both the predicted best Level-Set 9

10 and Branch-Set still outperform the actual best level. In conclusion, we observe that the proposed techniques perform well even in the presence of moderate noise in the gathered workload. When the noise is high, this means that the workload is insufficient to capture future queries, so it would be better to recalculate the best Level-Set or Branch-Set using a newer workload. 6.6 Performance with a Real Query Workload In this section we measure the performance for a real data set, using a history of submitted queries. Our data set TEM- PERATURE is 4 dimensional real-world dataset which measures the temperature at points all over the globe at different altitudes, sampled twice every day for 5 months. The 4 dimensions are latitude, longitude, altitude, time and the measure attribute is temperature. The corresponding sizes of these dimensions are 64, 128, 16 and 256, respectively, which leads to a data cube of more than 33 million cells. A history of 100 queries were used in this experiment. Half of them were randomly selected and used as the training workload, in order to select the best level, Level-Set and Branch- Set. The other half of the history was then used to test the performance of each method. The available storage space was fixed to 150% of the data cube by allowing extra space of around 11% per dimension ( ). Figure 9 shows the performance of each selected method (Calculated) compared to the method that would be selected if the second half of the queries were known (Best). The performance is measured by the retrieval cost which is in the order of billions of coefficients. The results show that the first half of the history was a very good approximation for the other half, thus the performance of each method is almost identical to the performance of the ideal one. Aside from that, Figure 9 shows a clear improvement of more than 30% over the best level, let alone the full wavelet transform. The total overhead for a Level-Set is 61ms and for a Branch- Set is 642ms. Although the overhead for a Branch-Set is 10 times as much as of a Level-Set, it is still negligible. 7 Conclusion We have seen that the complete wavelet transformation can behave suboptimally under certain query workloads. To that, we proposed algorithms that select the optimal form to store the data, in order to minimize the retrieval cost, by taking advantage of a given workload. In the case, where additional storage is available, it can be used to further reduce retrieval cost, by storing over-complete representations of the dataset. Our Branch-Set approach leads to great improvement with minimal online overhead as shown in the experimental section. References [1] K. Chakrabarti, M. N. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing using wavelets. In VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, pages , [2] S. Chaudhuri. Self-tuning database systems. In Proc. IDEAS, [3] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Datacube: A relational aggregation operator generalizing groupby, cross-tab, and sub-total. In Proc. of the 12th International Conference on Data Engineering, pages , [4] D. Lemire. Wavelet-based relative prefix sum methods for range sum queries in data cubes. In Proceedings of CASCON IBM, October [5] R. Schmidt and C. Shahabi. Propolyne: A fast waveletbased technique for progressive evaluation of polynomial range-sum queries. In In Conference on Exting Database Technology (EDBT 02),Lecture Notes in computer Science. Springer, [6] J. S. Vitter and M. Wang. Approximate computation of multidimensional aggregates of sparse data using wavelets. In SIGMOD 1999, Proceedings of the ACM SIGMOD International Conference on the Management of Data, pages ACM Press, [7] Y.-L. Wu, D. Agrawal, and A. E. Abbadi. Using wavelet decomposition to support progressive and approximate rangesum queries over data cubes. In CIKM 2000, Proceedings of the 9th Interntationall Conference on Information and Knowledge Management, pages ACM, Figure 9. Multidimensional Dataset 6.7 Online Overhead The response time for a query is dominated by the time associated with the retrieval of coefficients. Hence, in this paper we have focused on trying to minimize the retrieval cost. Our proposed techniques have an online overhead for each submitted query. However, this overhead is minimal and can be easily neglected. To prove this claim we have measured the overhead time for 1000 queries submitted on a data vector of size 1024, with available storage at 120%. 10

11 A Term Explanation F T Full Tree of Wavelet Decomposition lb Low-pass branch (averages) hb High-pass branch (details) W D k k-th level of decomposition wd k k-th pseudo-level of decomposition q, d untransformed query and data vector N domain size of data vector S available storage space, expressed as S = (1 + α)n q k k-th level of decomposition of q Q(R, d) Range-Sum query of size r = R cost(q X) Retrieval cost for query q given X work Workload of m queries Proof of Theorems Table 2. Table of Notation APPENDIX Theorem 1: For a range-sum query vector of size N defined over a range of length r the number of non-zero coefficients in its Wavelet Decomposition at level k is not more than r 2 k + 2k + 1 Proof At each iteration of the wavelet decomposition there are always at most 2 detail coefficients, one for each edge of the range. There are no details for the rest of the range, since it is composed of series of 1s. The average coefficients on the other hand, at each iteration are halved, until they become just two. Therefore at any level k, as long as the halving continues we have a total of at most r + 2k coefficients, r averages and 2k details. When 2 k 2 k the halving of averages stops, we may up with a worst case of two average coefficients for a number of levels. In general, at level k the number of averages can be no more than r 2 k + 1 and the details no more that 2k. Adding these we obtain the bound described in the theorem. Theorem 4: The level that minimizes the cost for a query q of range size r, is either logr 1, the lowest level where there are at least 4 averages, logr the lowest level where there are at least 2 averages or level p the lowest level where there is exactly 1 average. The exact criteria are given in the proof. Proof Let i s and i f be the start and finish indices of the range of the query. Then, let n 1 be the highest integer such that i s mod 2 n 1 = 0; similarly n 2 is the highest integer such that i s mod 2 n 2 = 0. Without loss of generality assume that n 1 n 2. Let us think of a level of decomposition as containing averaging buckets of equal size and that for each level down the tree of decomposition the size of the buckets is doubled. Figure 10 portrays such a view of a query, where n 1 is the highest level that the left edge of the query is perfectly aligned with a bucket; the same applied for the right edge with level n 2. Observe that the edges are aligned for levels before n x as well. Essentially, what this means is that for all levels before n 1 there are no detail coefficients, for levels between n 1 and n 2 there is exactly 1 detail, that for the left edge. For levels beyond n 2 we would expect to have always 2 details for both edges; however this is not the case as we see. Let p be the lowest level that contains one bucket that completely covers the required range; in the worst case p can be the highest level of decomposition (Level p is the lowest level such that l s div 2 p = l f div 2 p ). It should be clear that for all levels beyond p, we have exactly one detail and of course exactly one average coefficient. However in level p we may have either 1, or even 0 details. No details is only possible when the two averaging buckets of level p 1 have exactly the same number of elements. In other words, this anomaly happens only in the case when the range is symmetric with respect to the bucket of level p. To summarize, the number of nonzero details at level k is 0 if k < n 1, is 1 when n 1 k < n 2, is 2 when n 2 k < p, is {0, 1} when k = p and is 1 when k > p. Figure 10. Averaging Buckets Returning to the question of selecting the level that minimizes the nonzero coefficients, we can now argue that if we are at level k 1, going to level k means that deping on the position of k relative to n 1, n 2 and p we increase the number of nonzero details by at most 2 coefficients. In addition the nonzero averages are at the least halved; which means that the gain of going from level k 1 to k is not more than half the average coefficients at level k 1. By combining the observations for the detail and the average coefficients, we have that going to the next level is desired if the gain introduced by halving the averages is more than the loss of adding 1 or 2 coefficients. Actually, the only case we have to examine is when the halving results in a gain of only coefficient and in the same time we are at a level between n 2 and p. This can happen at the lowest level when the averages are exactly 4 and misaligned to the bucket edges, or there are less than 4 averages. Let x = logr 1, then at that level we have at least 4 average coefficients. If the number of averages is more than 4, r 2 x > 4, then it makes sense to stop at the next level x + 1 = logr, where there is a gain of at least 2 coefficients for any relative position of x. If the number of averages is exactly 4, r 2 x = 4, and additionally the number of details at the next level is 2, that is n 2 x + 1 < p, it is better to stop at level 11

Wavelets for Efficient Querying of Large Multidimensional Datasets

Wavelets for Efficient Querying of Large Multidimensional Datasets Wavelets for Efficient Querying of Large Multidimensional Datasets Cyrus Shahabi University of Southern California Integrated Media Systems Center (IMSC) and Dept. of Computer Science Los Angeles, CA 90089-0781

More information

SHIFT-SPLIT: I/O Efficient Maintenance of Wavelet-Transformed Multidimensional Data

SHIFT-SPLIT: I/O Efficient Maintenance of Wavelet-Transformed Multidimensional Data SHIFT-SPLIT: I/O Efficient aintenance of Wavelet-Transformed ultidimensional Data ehrdad Jahangiri University of Southern California Los Angeles, CA 90089-0781 jahangir@usc.edu Dimitris Sacharidis ational

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

SPARSE signal representations have gained popularity in recent

SPARSE signal representations have gained popularity in recent 6958 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Blind Compressed Sensing Sivan Gleichman and Yonina C. Eldar, Senior Member, IEEE Abstract The fundamental principle underlying

More information

Approximate Query Processing Using Wavelets

Approximate Query Processing Using Wavelets Approximate Query Processing Using Wavelets Kaushik Chakrabarti Minos Garofalakis Rajeev Rastogi Kyuseok Shim Presented by Guanghua Yan Outline Approximate query processing: Problem and Prior solutions

More information

CMSC 451: Lecture 7 Greedy Algorithms for Scheduling Tuesday, Sep 19, 2017

CMSC 451: Lecture 7 Greedy Algorithms for Scheduling Tuesday, Sep 19, 2017 CMSC CMSC : Lecture Greedy Algorithms for Scheduling Tuesday, Sep 9, 0 Reading: Sects.. and. of KT. (Not covered in DPV.) Interval Scheduling: We continue our discussion of greedy algorithms with a number

More information

Sparse Solutions of Systems of Equations and Sparse Modelling of Signals and Images

Sparse Solutions of Systems of Equations and Sparse Modelling of Signals and Images Sparse Solutions of Systems of Equations and Sparse Modelling of Signals and Images Alfredo Nava-Tudela ant@umd.edu John J. Benedetto Department of Mathematics jjb@umd.edu Abstract In this project we are

More information

A Polynomial-Time Algorithm for Pliable Index Coding

A Polynomial-Time Algorithm for Pliable Index Coding 1 A Polynomial-Time Algorithm for Pliable Index Coding Linqi Song and Christina Fragouli arxiv:1610.06845v [cs.it] 9 Aug 017 Abstract In pliable index coding, we consider a server with m messages and n

More information

On Two Class-Constrained Versions of the Multiple Knapsack Problem

On Two Class-Constrained Versions of the Multiple Knapsack Problem On Two Class-Constrained Versions of the Multiple Knapsack Problem Hadas Shachnai Tami Tamir Department of Computer Science The Technion, Haifa 32000, Israel Abstract We study two variants of the classic

More information

Wavelet Footprints: Theory, Algorithms, and Applications

Wavelet Footprints: Theory, Algorithms, and Applications 1306 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 51, NO. 5, MAY 2003 Wavelet Footprints: Theory, Algorithms, and Applications Pier Luigi Dragotti, Member, IEEE, and Martin Vetterli, Fellow, IEEE Abstract

More information

In the Name of God. Lectures 15&16: Radial Basis Function Networks

In the Name of God. Lectures 15&16: Radial Basis Function Networks 1 In the Name of God Lectures 15&16: Radial Basis Function Networks Some Historical Notes Learning is equivalent to finding a surface in a multidimensional space that provides a best fit to the training

More information

On the Optimality of the Greedy Heuristic in Wavelet Synopses for Range Queries

On the Optimality of the Greedy Heuristic in Wavelet Synopses for Range Queries On the Optimality of the Greedy Heuristic in Wavelet Synopses for Range Queries Yossi Matias School of Computer Science Tel Aviv University Tel Aviv 69978, Israel matias@tau.ac.il Daniel Urieli School

More information

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Indexes for Multimedia Data 14 Indexes for Multimedia

More information

Supplementary Material for Paper Submission 307

Supplementary Material for Paper Submission 307 Supplementary Material for Paper Submission 307 Anonymous Author(s) October 12, 2013 Contents 1 Image Formation Model 2 1.1 Parameterization of Deformation Field W (x; p)................... 2 1.2 The bases

More information

Cambridge University Press The Mathematics of Signal Processing Steven B. Damelin and Willard Miller Excerpt More information

Cambridge University Press The Mathematics of Signal Processing Steven B. Damelin and Willard Miller Excerpt More information Introduction Consider a linear system y = Φx where Φ can be taken as an m n matrix acting on Euclidean space or more generally, a linear operator on a Hilbert space. We call the vector x a signal or input,

More information

Can the sample being transmitted be used to refine its own PDF estimate?

Can the sample being transmitted be used to refine its own PDF estimate? Can the sample being transmitted be used to refine its own PDF estimate? Dinei A. Florêncio and Patrice Simard Microsoft Research One Microsoft Way, Redmond, WA 98052 {dinei, patrice}@microsoft.com Abstract

More information

Testing Problems with Sub-Learning Sample Complexity

Testing Problems with Sub-Learning Sample Complexity Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology

More information

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins 11 Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins Wendy OSBORN a, 1 and Saad ZAAMOUT a a Department of Mathematics and Computer Science, University of Lethbridge, Lethbridge,

More information

A Simple Left-to-Right Algorithm for Minimal Weight Signed Radix-r Representations

A Simple Left-to-Right Algorithm for Minimal Weight Signed Radix-r Representations A Simple Left-to-Right Algorithm for Minimal Weight Signed Radix-r Representations James A. Muir School of Computer Science Carleton University, Ottawa, Canada http://www.scs.carleton.ca/ jamuir 23 October

More information

Spanning and Independence Properties of Finite Frames

Spanning and Independence Properties of Finite Frames Chapter 1 Spanning and Independence Properties of Finite Frames Peter G. Casazza and Darrin Speegle Abstract The fundamental notion of frame theory is redundancy. It is this property which makes frames

More information

Searching Dimension Incomplete Databases

Searching Dimension Incomplete Databases IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO., JANUARY 3 Searching Dimension Incomplete Databases Wei Cheng, Xiaoming Jin, Jian-Tao Sun, Xuemin Lin, Xiang Zhang, and Wei Wang Abstract

More information

Digital Image Processing

Digital Image Processing Digital Image Processing Wavelets and Multiresolution Processing (Wavelet Transforms) Christophoros Nikou cnikou@cs.uoi.gr University of Ioannina - Department of Computer Science 2 Contents Image pyramids

More information

Ch. 10 Vector Quantization. Advantages & Design

Ch. 10 Vector Quantization. Advantages & Design Ch. 10 Vector Quantization Advantages & Design 1 Advantages of VQ There are (at least) 3 main characteristics of VQ that help it outperform SQ: 1. Exploit Correlation within vectors 2. Exploit Shape Flexibility

More information

The Count-Min-Sketch and its Applications

The Count-Min-Sketch and its Applications The Count-Min-Sketch and its Applications Jannik Sundermeier Abstract In this thesis, we want to reveal how to get rid of a huge amount of data which is at least dicult or even impossible to store in local

More information

Improved Algorithms for Module Extraction and Atomic Decomposition

Improved Algorithms for Module Extraction and Atomic Decomposition Improved Algorithms for Module Extraction and Atomic Decomposition Dmitry Tsarkov tsarkov@cs.man.ac.uk School of Computer Science The University of Manchester Manchester, UK Abstract. In recent years modules

More information

Defining the Discrete Wavelet Transform (DWT)

Defining the Discrete Wavelet Transform (DWT) Defining the Discrete Wavelet Transform (DWT) can formulate DWT via elegant pyramid algorithm defines W for non-haar wavelets (consistent with Haar) computes W = WX using O(N) multiplications brute force

More information

MATCHING-PURSUIT DICTIONARY PRUNING FOR MPEG-4 VIDEO OBJECT CODING

MATCHING-PURSUIT DICTIONARY PRUNING FOR MPEG-4 VIDEO OBJECT CODING MATCHING-PURSUIT DICTIONARY PRUNING FOR MPEG-4 VIDEO OBJECT CODING Yannick Morvan, Dirk Farin University of Technology Eindhoven 5600 MB Eindhoven, The Netherlands email: {y.morvan;d.s.farin}@tue.nl Peter

More information

COS597D: Information Theory in Computer Science October 19, Lecture 10

COS597D: Information Theory in Computer Science October 19, Lecture 10 COS597D: Information Theory in Computer Science October 9, 20 Lecture 0 Lecturer: Mark Braverman Scribe: Andrej Risteski Kolmogorov Complexity In the previous lectures, we became acquainted with the concept

More information

Distributed Data Fusion with Kalman Filters. Simon Julier Computer Science Department University College London

Distributed Data Fusion with Kalman Filters. Simon Julier Computer Science Department University College London Distributed Data Fusion with Kalman Filters Simon Julier Computer Science Department University College London S.Julier@cs.ucl.ac.uk Structure of Talk Motivation Kalman Filters Double Counting Optimal

More information

Lecture Notes on Inductive Definitions

Lecture Notes on Inductive Definitions Lecture Notes on Inductive Definitions 15-312: Foundations of Programming Languages Frank Pfenning Lecture 2 September 2, 2004 These supplementary notes review the notion of an inductive definition and

More information

Correlated subqueries. Query Optimization. Magic decorrelation. COUNT bug. Magic example (slide 2) Magic example (slide 1)

Correlated subqueries. Query Optimization. Magic decorrelation. COUNT bug. Magic example (slide 2) Magic example (slide 1) Correlated subqueries Query Optimization CPS Advanced Database Systems SELECT CID FROM Course Executing correlated subquery is expensive The subquery is evaluated once for every CPS course Decorrelate!

More information

Causality & Concurrency. Time-Stamping Systems. Plausibility. Example TSS: Lamport Clocks. Example TSS: Vector Clocks

Causality & Concurrency. Time-Stamping Systems. Plausibility. Example TSS: Lamport Clocks. Example TSS: Vector Clocks Plausible Clocks with Bounded Inaccuracy Causality & Concurrency a b exists a path from a to b Brad Moore, Paul Sivilotti Computer Science & Engineering The Ohio State University paolo@cse.ohio-state.edu

More information

High-Dimensional Indexing by Distributed Aggregation

High-Dimensional Indexing by Distributed Aggregation High-Dimensional Indexing by Distributed Aggregation Yufei Tao ITEE University of Queensland In this lecture, we will learn a new approach for indexing high-dimensional points. The approach borrows ideas

More information

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch 1 and Srikanta Tirthapura 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY

More information

An Introduction to Wavelets and some Applications

An Introduction to Wavelets and some Applications An Introduction to Wavelets and some Applications Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France An Introduction to Wavelets and some Applications p.1/54

More information

A Piggybacking Design Framework for Read-and Download-efficient Distributed Storage Codes

A Piggybacking Design Framework for Read-and Download-efficient Distributed Storage Codes A Piggybacing Design Framewor for Read-and Download-efficient Distributed Storage Codes K V Rashmi, Nihar B Shah, Kannan Ramchandran, Fellow, IEEE Department of Electrical Engineering and Computer Sciences

More information

Nearest Neighbor Search with Keywords in Spatial Databases

Nearest Neighbor Search with Keywords in Spatial Databases 776 Nearest Neighbor Search with Keywords in Spatial Databases 1 Sphurti S. Sao, 2 Dr. Rahila Sheikh 1 M. Tech Student IV Sem, Dept of CSE, RCERT Chandrapur, MH, India 2 Head of Department, Dept of CSE,

More information

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013 Machine Learning for Signal Processing Sparse and Overcomplete Representations Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013 1 Key Topics in this Lecture Basics Component-based representations

More information

Randomness-in-Structured Ensembles for Compressed Sensing of Images

Randomness-in-Structured Ensembles for Compressed Sensing of Images Randomness-in-Structured Ensembles for Compressed Sensing of Images Abdolreza Abdolhosseini Moghadam Dep. of Electrical and Computer Engineering Michigan State University Email: abdolhos@msu.edu Hayder

More information

State of the art Image Compression Techniques

State of the art Image Compression Techniques Chapter 4 State of the art Image Compression Techniques In this thesis we focus mainly on the adaption of state of the art wavelet based image compression techniques to programmable hardware. Thus, an

More information

Digital Image Processing

Digital Image Processing Digital Image Processing Wavelets and Multiresolution Processing () Christophoros Nikou cnikou@cs.uoi.gr University of Ioannina - Department of Computer Science 2 Contents Image pyramids Subband coding

More information

Detailed Derivation of Theory of Hierarchical Data-driven Descent

Detailed Derivation of Theory of Hierarchical Data-driven Descent Detailed Derivation of Theory of Hierarchical Data-driven Descent Yuandong Tian and Srinivasa G. Narasimhan Carnegie Mellon University 5000 Forbes Ave, Pittsburgh, PA 15213 {yuandong, srinivas}@cs.cmu.edu

More information

Topic Contents. Factoring Methods. Unit 3: Factoring Methods. Finding the square root of a number

Topic Contents. Factoring Methods. Unit 3: Factoring Methods. Finding the square root of a number Topic Contents Factoring Methods Unit 3 The smallest divisor of an integer The GCD of two numbers Generating prime numbers Computing prime factors of an integer Generating pseudo random numbers Raising

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

ECEN 689 Special Topics in Data Science for Communications Networks

ECEN 689 Special Topics in Data Science for Communications Networks ECEN 689 Special Topics in Data Science for Communications Networks Nick Duffield Department of Electrical & Computer Engineering Texas A&M University Lecture 5 Optimizing Fixed Size Samples Sampling as

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) Contents 1 Vector Spaces 1 1.1 The Formal Denition of a Vector Space.................................. 1 1.2 Subspaces...................................................

More information

CS383, Algorithms Spring 2009 HW1 Solutions

CS383, Algorithms Spring 2009 HW1 Solutions Prof. Sergio A. Alvarez http://www.cs.bc.edu/ alvarez/ 21 Campanella Way, room 569 alvarez@cs.bc.edu Computer Science Department voice: (617) 552-4333 Boston College fax: (617) 552-6790 Chestnut Hill,

More information

Synthesis of Saturating Counters Using Traditional and Non-traditional Basic Counters

Synthesis of Saturating Counters Using Traditional and Non-traditional Basic Counters Synthesis of Saturating Counters Using Traditional and Non-traditional Basic Counters Zhaojun Wo and Israel Koren Department of Electrical and Computer Engineering University of Massachusetts, Amherst,

More information

A Deterministic Fully Polynomial Time Approximation Scheme For Counting Integer Knapsack Solutions Made Easy

A Deterministic Fully Polynomial Time Approximation Scheme For Counting Integer Knapsack Solutions Made Easy A Deterministic Fully Polynomial Time Approximation Scheme For Counting Integer Knapsack Solutions Made Easy Nir Halman Hebrew University of Jerusalem halman@huji.ac.il July 3, 2016 Abstract Given n elements

More information

Multidimensional Divide and Conquer 1 Skylines

Multidimensional Divide and Conquer 1 Skylines Yufei Tao ITEE University of Queensland The next few lectures will be dedicated to an important technique: divide and conquer. You may have encountered the technique in an earlier algorithm course, but

More information

AS computer hardware technology advances, both

AS computer hardware technology advances, both 1 Best-Harmonically-Fit Periodic Task Assignment Algorithm on Multiple Periodic Resources Chunhui Guo, Student Member, IEEE, Xiayu Hua, Student Member, IEEE, Hao Wu, Student Member, IEEE, Douglas Lautner,

More information

Notes on the Dual Ramsey Theorem

Notes on the Dual Ramsey Theorem Notes on the Dual Ramsey Theorem Reed Solomon July 29, 2010 1 Partitions and infinite variable words The goal of these notes is to give a proof of the Dual Ramsey Theorem. This theorem was first proved

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

Configuring Spatial Grids for Efficient Main Memory Joins

Configuring Spatial Grids for Efficient Main Memory Joins Configuring Spatial Grids for Efficient Main Memory Joins Farhan Tauheed, Thomas Heinis, and Anastasia Ailamaki École Polytechnique Fédérale de Lausanne (EPFL), Imperial College London Abstract. The performance

More information

Linear Programming: Simplex

Linear Programming: Simplex Linear Programming: Simplex Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. IMA, August 2016 Stephen Wright (UW-Madison) Linear Programming: Simplex IMA, August 2016

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

Lecture 5: Introduction to (Robertson/Spärck Jones) Probabilistic Retrieval

Lecture 5: Introduction to (Robertson/Spärck Jones) Probabilistic Retrieval Lecture 5: Introduction to (Robertson/Spärck Jones) Probabilistic Retrieval Scribes: Ellis Weng, Andrew Owens February 11, 2010 1 Introduction In this lecture, we will introduce our second paradigm for

More information

Proclaiming Dictators and Juntas or Testing Boolean Formulae

Proclaiming Dictators and Juntas or Testing Boolean Formulae Proclaiming Dictators and Juntas or Testing Boolean Formulae Michal Parnas The Academic College of Tel-Aviv-Yaffo Tel-Aviv, ISRAEL michalp@mta.ac.il Dana Ron Department of EE Systems Tel-Aviv University

More information

Algorithms for pattern involvement in permutations

Algorithms for pattern involvement in permutations Algorithms for pattern involvement in permutations M. H. Albert Department of Computer Science R. E. L. Aldred Department of Mathematics and Statistics M. D. Atkinson Department of Computer Science D.

More information

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 41 Pulse Code Modulation (PCM) So, if you remember we have been talking

More information

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof Ganesh Ramakrishnan October 20, 2016 1 / 25 Decision Trees: Cascade of step

More information

Digital Image Processing

Digital Image Processing Digital Image Processing, 2nd ed. Digital Image Processing Chapter 7 Wavelets and Multiresolution Processing Dr. Kai Shuang Department of Electronic Engineering China University of Petroleum shuangkai@cup.edu.cn

More information

1 Approximate Quantiles and Summaries

1 Approximate Quantiles and Summaries CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity

More information

COS 598D - Lattices. scribe: Srdjan Krstic

COS 598D - Lattices. scribe: Srdjan Krstic COS 598D - Lattices scribe: Srdjan Krstic Introduction In the first part we will give a brief introduction to lattices and their relevance in some topics in computer science. Then we show some specific

More information

arxiv: v1 [cs.ds] 3 Feb 2018

arxiv: v1 [cs.ds] 3 Feb 2018 A Model for Learned Bloom Filters and Related Structures Michael Mitzenmacher 1 arxiv:1802.00884v1 [cs.ds] 3 Feb 2018 Abstract Recent work has suggested enhancing Bloom filters by using a pre-filter, based

More information

Wavelets and Multiresolution Processing

Wavelets and Multiresolution Processing Wavelets and Multiresolution Processing Wavelets Fourier transform has it basis functions in sinusoids Wavelets based on small waves of varying frequency and limited duration In addition to frequency,

More information

Nearest Neighbor Search with Keywords: Compression

Nearest Neighbor Search with Keywords: Compression Nearest Neighbor Search with Keywords: Compression Yufei Tao KAIST June 3, 2013 In this lecture, we will continue our discussion on: Problem (Nearest Neighbor Search with Keywords) Let P be a set of points

More information

Linear Algebra and Eigenproblems

Linear Algebra and Eigenproblems Appendix A A Linear Algebra and Eigenproblems A working knowledge of linear algebra is key to understanding many of the issues raised in this work. In particular, many of the discussions of the details

More information

CPSC 467: Cryptography and Computer Security

CPSC 467: Cryptography and Computer Security CPSC 467: Cryptography and Computer Security Michael J. Fischer Lecture 14 October 16, 2013 CPSC 467, Lecture 14 1/45 Message Digest / Cryptographic Hash Functions Hash Function Constructions Extending

More information

A Simple Left-to-Right Algorithm for Minimal Weight Signed Radix-r Representations

A Simple Left-to-Right Algorithm for Minimal Weight Signed Radix-r Representations IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. X, MONTH 2007 1 A Simple Left-to-Right Algorithm for Minimal Weight Signed Radix-r Representations James A. Muir Abstract We present a simple algorithm

More information

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 2, FEBRUARY

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 2, FEBRUARY IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL 54, NO 2, FEBRUARY 2006 423 Underdetermined Blind Source Separation Based on Sparse Representation Yuanqing Li, Shun-Ichi Amari, Fellow, IEEE, Andrzej Cichocki,

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,

More information

How to Optimally Allocate Resources for Coded Distributed Computing?

How to Optimally Allocate Resources for Coded Distributed Computing? 1 How to Optimally Allocate Resources for Coded Distributed Computing? Qian Yu, Songze Li, Mohammad Ali Maddah-Ali, and A. Salman Avestimehr Department of Electrical Engineering, University of Southern

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at

More information

A 2-Approximation Algorithm for Scheduling Parallel and Time-Sensitive Applications to Maximize Total Accrued Utility Value

A 2-Approximation Algorithm for Scheduling Parallel and Time-Sensitive Applications to Maximize Total Accrued Utility Value A -Approximation Algorithm for Scheduling Parallel and Time-Sensitive Applications to Maximize Total Accrued Utility Value Shuhui Li, Miao Song, Peng-Jun Wan, Shangping Ren Department of Engineering Mechanics,

More information

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way EECS 16A Designing Information Devices and Systems I Fall 018 Lecture Notes Note 1 1.1 Introduction to Linear Algebra the EECS Way In this note, we will teach the basics of linear algebra and relate it

More information

A strongly polynomial algorithm for linear systems having a binary solution

A strongly polynomial algorithm for linear systems having a binary solution A strongly polynomial algorithm for linear systems having a binary solution Sergei Chubanov Institute of Information Systems at the University of Siegen, Germany e-mail: sergei.chubanov@uni-siegen.de 7th

More information

Asymptotic redundancy and prolixity

Asymptotic redundancy and prolixity Asymptotic redundancy and prolixity Yuval Dagan, Yuval Filmus, and Shay Moran April 6, 2017 Abstract Gallager (1978) considered the worst-case redundancy of Huffman codes as the maximum probability tends

More information

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2018

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2018 CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis Ruth Anderson Winter 2018 Today Algorithm Analysis What do we care about? How to compare two algorithms Analyzing Code Asymptotic Analysis

More information

Designing Information Devices and Systems I Spring 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way

Designing Information Devices and Systems I Spring 2018 Lecture Notes Note Introduction to Linear Algebra the EECS Way EECS 16A Designing Information Devices and Systems I Spring 018 Lecture Notes Note 1 1.1 Introduction to Linear Algebra the EECS Way In this note, we will teach the basics of linear algebra and relate

More information

5742 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 12, DECEMBER /$ IEEE

5742 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 12, DECEMBER /$ IEEE 5742 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 12, DECEMBER 2009 Uncertainty Relations for Shift-Invariant Analog Signals Yonina C. Eldar, Senior Member, IEEE Abstract The past several years

More information

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification Robin Senge, Juan José del Coz and Eyke Hüllermeier Draft version of a paper to appear in: L. Schmidt-Thieme and

More information

MATH Linear Algebra

MATH Linear Algebra MATH 304 - Linear Algebra In the previous note we learned an important algorithm to produce orthogonal sequences of vectors called the Gramm-Schmidt orthogonalization process. Gramm-Schmidt orthogonalization

More information

Caesar s Taxi Prediction Services

Caesar s Taxi Prediction Services 1 Caesar s Taxi Prediction Services Predicting NYC Taxi Fares, Trip Distance, and Activity Paul Jolly, Boxiao Pan, Varun Nambiar Abstract In this paper, we propose three models each predicting either taxi

More information

MAA507, Power method, QR-method and sparse matrix representation.

MAA507, Power method, QR-method and sparse matrix representation. ,, and representation. February 11, 2014 Lecture 7: Overview, Today we will look at:.. If time: A look at representation and fill in. Why do we need numerical s? I think everyone have seen how time consuming

More information

Compute the Fourier transform on the first register to get x {0,1} n x 0.

Compute the Fourier transform on the first register to get x {0,1} n x 0. CS 94 Recursive Fourier Sampling, Simon s Algorithm /5/009 Spring 009 Lecture 3 1 Review Recall that we can write any classical circuit x f(x) as a reversible circuit R f. We can view R f as a unitary

More information

Feasibility Conditions for Interference Alignment

Feasibility Conditions for Interference Alignment Feasibility Conditions for Interference Alignment Cenk M. Yetis Istanbul Technical University Informatics Inst. Maslak, Istanbul, TURKEY Email: cenkmyetis@yahoo.com Tiangao Gou, Syed A. Jafar University

More information

arxiv: v1 [cs.dm] 22 Mar 2014

arxiv: v1 [cs.dm] 22 Mar 2014 Online Square-into-Square Packing Sándor P. Fekete Hella-Franziska Hoffmann arxiv:03.5665v [cs.dm] Mar 0 Abstract In 967, Moon and Moser proved a tight bound on the critical density of squares in squares:

More information

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2019

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2019 CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis Ruth Anderson Winter 2019 Today Algorithm Analysis What do we care about? How to compare two algorithms Analyzing Code Asymptotic Analysis

More information

On the Complexity of Partitioning Graphs for Arc-Flags

On the Complexity of Partitioning Graphs for Arc-Flags Journal of Graph Algorithms and Applications http://jgaa.info/ vol. 17, no. 3, pp. 65 99 (013) DOI: 10.7155/jgaa.0094 On the Complexity of Partitioning Graphs for Arc-Flags Reinhard Bauer Moritz Baum Ignaz

More information

Designing Information Devices and Systems I Spring 2018 Homework 11

Designing Information Devices and Systems I Spring 2018 Homework 11 EECS 6A Designing Information Devices and Systems I Spring 28 Homework This homework is due April 8, 28, at 23:59. Self-grades are due April 2, 28, at 23:59. Submission Format Your homework submission

More information

LECTURE 3. RATIONAL NUMBERS: AN EXAMPLE OF MATHEMATICAL CONSTRUCT

LECTURE 3. RATIONAL NUMBERS: AN EXAMPLE OF MATHEMATICAL CONSTRUCT ANALYSIS FOR HIGH SCHOOL TEACHERS LECTURE 3. RATIONAL NUMBERS: AN EXAMPLE OF MATHEMATICAL CONSTRUCT ROTHSCHILD CAESARIA COURSE, 2011/2 1. Rational numbers: how to define them? Rational numbers were discovered

More information

17.1 Correctness of First-Order Tableaux

17.1 Correctness of First-Order Tableaux Applied Logic Lecture 17: Correctness and Completeness of First-Order Tableaux CS 4860 Spring 2009 Tuesday, March 24, 2009 Now that we have introduced a proof calculus for first-order logic we have to

More information

Probabilistic Model Checking Michaelmas Term Dr. Dave Parker. Department of Computer Science University of Oxford

Probabilistic Model Checking Michaelmas Term Dr. Dave Parker. Department of Computer Science University of Oxford Probabilistic Model Checking Michaelmas Term 2011 Dr. Dave Parker Department of Computer Science University of Oxford Probabilistic model checking System Probabilistic model e.g. Markov chain Result 0.5

More information

1 Shortest Vector Problem

1 Shortest Vector Problem Lattices in Cryptography University of Michigan, Fall 25 Lecture 2 SVP, Gram-Schmidt, LLL Instructor: Chris Peikert Scribe: Hank Carter Shortest Vector Problem Last time we defined the minimum distance

More information

Very Sparse Random Projections

Very Sparse Random Projections Very Sparse Random Projections Ping Li, Trevor Hastie and Kenneth Church [KDD 06] Presented by: Aditya Menon UCSD March 4, 2009 Presented by: Aditya Menon (UCSD) Very Sparse Random Projections March 4,

More information