STATISTICAL APPROXIMATIONS OF DATABASE QUERIES WITH CONFIDENCE INTERVALS

Size: px

Start display at page:

Download "STATISTICAL APPROXIMATIONS OF DATABASE QUERIES WITH CONFIDENCE INTERVALS"

Catherine Constance Floyd
5 years ago
Views:

1 STATISTICAL APPROXIMATIONS OF DATABASE QUERIES WITH CONFIDENCE INTERVALS By LIXIA CHEN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2011

2 c 2011 Lixia Chen 2

3 To my family 3

4 ACKNOWLEDGMENTS First and foremost, I would like to express my deep and sincere gratitude to my PhD adviser, Dr. Alin Dobra. I thank him for his invaluable support, guidance and inspiration in my research. I have learned a lot from him in all aspects of my research: computer techniques, formulating problems and presenting the research results. This thesis would have not been possible without his help. I thank Dr. Arunava Banerjee, Dr. Sanjay Ranka, Dr. Tamer Kahveci and Dr. Ronald Randles for serving on my supervisory committee. Dr. Banerjee s solid theory background has enlightened me in exploring new possibilities. Dr. Randles provided helpful suggestions in my work. Dr. Kahveci and Dr. Ranka encouraged me in hard times. I would like to convey my utmost thanks to my family, who mean everything to me. My husband, Yuchu Tong has been with me through all the happy and hard times. He always brings fun to my life and encourages me to face challenges in my study. I enjoy every moment with him. I am very grateful to my parents for their love and encouragement. They are always been there and ready to provide supports to me. They are not only good parents but also excellent teachers to me. They have inspired my interests in science and opened my mind to it. Many thanks also goes to my sisters and my brother. Every moment with them is wonderful. I thank my new born daughter for making me more happy than ever. I also want to thank all my friends for their support in my study and life. An incomplete list includes Xuelian Xiao, Wei Peng, Yixi Ouyang, Jiangyan Xu and Wangyuan Zhang. We had a memorable time, which made my graduation study enriched and enjoyable. Finally, I thank National Science Foundation grants NSF-CAREER-IIS for the financial support in this work. 4

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS LIST OF TABLES LIST OF FIGURES ABSTRACT CHAPTER 1 INTRODUCTION Related Work Confidence Intervals from Moments Contributions HISTOGRAMS AS STATISTICAL ESTIMATORS FOR AGGREGATE QUERIES Background Problem Formulation Size of Join Problem Unidimensional Size of Join Estimation Problem Multidimensional Size of Join Estimation Problem Selectivity Estimation Problem Aggregates over Joins Problem Comments on Obtaining Error Guarantees from Expected Value and Variance Estimates Histograms as Function Approximators and Statistical Nonparametric Models Random Shuffling Assumption Definition of Uniform Random Shuffling Assumption Moments under Random Shuffling Assumption Histograms under Random Shuffling Assumption One-bucket Histograms Histograms with Aligned Buckets Comparison with Sampling and Sketches Sampling Sketches Comments on Comparison Histograms when Random Shuffling Assumption Does Not Hold Random Histograms on Arbitrary Problems Random Histograms for Self-join Size Computation Comments on End-biased histograms Generalization to Multidimensional Histograms

6 2.8.1 General Random Shuffling Properties Multidimensional Random Shuffling Distribution One-bucket Multidimensional Histogram Low Dimensional Histogram Estimating size of Star-Joins using One-bucket Histograms XSketch Estimator under Random Shuffling Assumption Introduction XSketches Shuffled XSketches Comments on Random Shuffling Assumption Summary AGGREGATION OVER PROBABILISTIC DATABASES WITH CONFIDENCE INTERVAL Background Preliminaries Queries and Equivalent Algebraic Expressions Probabilistic Database as a Description of a Probability Space Use of Kronecker Symbol δ ij Analysis for Multi-relation SUM Aggregates for Tuple-independent Model Dependence in Tuples after Aggregations M 2 Computation of Moments Reducing the time complexity General Framework of Computing Moments Notation General Framework for M log M Computation Computation of Moments InDB Techniques InDB M 2 Technique InDB M log M Technique Middleware M log M Technique Algorithms Extensions Non-linear Aggregates GROUPBY Queries Experiments Experiments on Tuple-independent Model Experiments on Graphical Model The Central Limit Theorem and Probabilistic Aggregates Summary PROBABILISTIC AGGREGATION WITH DUPLICATE ELIMINATION Backgroud Preliminaries

7 4.2.1 Queries and Equivalent Algebraic Expressions Query Evaluation on Probabilistic Databases Analysis of Multi-relation Sum Aggregation with Duplicate Removal Probability of Correlated Groups Correlations in PTIME Queries Moments of aggregates with duplicate elimination Hash-based Algorithm Experiments Implementation Results of Experiments Summary CONCLUSIONS AND FUTURE WORK Dissertation Summary Future Directions APPENDIX A GENERAL UNIDIMENSIONAL HISTOGRAMS B PROOF OF PROPOSITION C PROOF OF PROPOSITION D PROOF OF THEOREM E PROOF OF THEOREM F PROOF OF LEMMA G PROOF OF THEOREM REFERENCES BIOGRAPHICAL SKETCH

8 Table LIST OF TABLES page 2-1 Notations used in the chapter Relation R Relation R The result of customer, probability R 1 R The result of customer,product, probability R 1 R Group u Group v

9 Figure LIST OF FIGURES page 2-1 Frequency graph of relation F Frequency graph of relation G Histogram of the size of the join result The Layers of Sequences Total Time On TPC-H Queries, 1GB dataset Aggregate Time On TPC-H Queries, 1GB dataset Total Time On TPC-H Queries, 10GB data Aggregate Time On TPC-H Queries, 10GB data The graphic model used in the experiments The experiments results in graphical model PDF of sum of independent discrete random variables CDF of sum of independent discrete random variables PDF of sum of partly independent discrete random variables CDF of sum of partly independent discrete random variables Nested correlations in u v ū v Hierarchical structure of correlations λ = 1 Running time on Q λ = 1 Running time on Q λ = 2 Running time on Q λ = 2 Running time on Q λ = 4 Running time on Q λ = 4 Running time on Q

10 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy STATISTICAL APPROXIMATIONS OF DATABASE QUERIES WITH CONFIDENCE INTERVALS Chair: Alin Dobra Major: Computer Engineering By Lixia Chen December 2011 Explosive data growth in databases presents significant challenges for fast query processing. Computation of exact values of complex queries over large databases can take a long time due to the large volume of data they need to access. Approximate query processing provides an essential solution to this problem. Versatile approximation methods are available to be used in databases nowadays. In order to evaluate the effectiveness of these approximations, it is desirable to provide confidence intervals of estimators; the confidence intervals effectivelly provide error guarantees of estimated values. The focus of this dissertation is approximating database queries and providing corresponding confidence intervals. We first revisit the approximation method histograms as used in traditional databases and interpret as a statistical method. Then we estimate aggregates over probabilistic databases and provide efficient algorithms to compute them. The traditional assumption for interpreting histograms and justifying approximate query processing methods based on them is that all elements in a bucket have the same frequency this is called uniform distribution assumption. We show that a significantly less restrictive statistical assumption the elements within a bucket are randomly arranged even though they might have different frequencies leads to identical formulas for approximating aggregate queries using histograms. We analyze, under this assumption, the behaviors of the both unidimensional and multidimensional 10

11 histograms and we provide tight error guarantees for the quality of approximations. As an example of how the statistical theory of histograms can be extended, we show how XSketches an approximation technique for XML queries that uses histograms as building blocks can be statistically analyzed. The combination of the random shuffling assumption and the other statistical assumptions associated with XSketch estimators ensures a complete statistical model and error analysis for XSketches. Characterizing SUM-like aggregates over probabilistic databases is considered a hard problem because the size of the probability space is exponential in the database size. In this dissertation, we aim to compute aggregates in probabilistic databases with confidence intervals. Both methods distribution dependent and independent for computing confidence intervals require the expected value and variance of probabilistic aggregates. We develop a general framework to compute moments of aggregates in probabilistic databases. Based on this framework, we derive mathematical formulas of the first two moments of aggregates for graphical models and tuple-independent models. We then present InDB and Middleware algorithms to evaluate the aggregates efficiently. Our prototype implementation using Postgres suggests that our characterization of aggregates incurs little overhead for the tuple-independent model and manageable overhead for the graphical model. We also extend computation of the moments to AVERAGE-like aggregates and GROUP-BYs to show the usefulness of the proposed methods. To extend the above analysis of aggregates over probabilistic databases, we study the aggregates with duplicate removal. We analyze the properties of PTIME queries and conclude the correlations introduced by projections are hierarchical and can be factorized recursively. We derive the first two moments of estimators of probabilistic aggregates with duplicate removal and design a hash-based algorithm to implement them. Our comprehensive experiments show that this algorithm has a small overhead 11

12 when the correlation rate is small and a reasonable overhead when the correlation rate is large. 12

13 CHAPTER 1 INTRODUCTION With the advancements in science and technology, more and more information needs to be stored in databases. According to JASON report [50], the data volume is projected to be in the hundreds of petabytes by When the data growth rate is larger than hardware growth rate by Moore s law, processing large data may take even longer than today. Approximate query processing provides an essential method for fast query response time over large data set. In addition, exact answers are not necessary in some cases and approximate query processing is well suited. Many applications, such as data mining, online analytical processing and decision support systems, belong to this category due to their exploratory nature and associated imprecision in the query or the use of the query result. Probabilistic databases have attracted much attention in recent years because of the increasing demand of processing uncertain data. Since data in probabilistic database are uncertain by nature, the size of the probability space is exponential in the size of databases. Computing the exact answers of complex queries over such database is inefficient and challenging. Query approximation becomes an intuitive alternative. One natural question to about approximation methods is how effective the estimated values are and how much they vary. For example, it is very hard for a person to decide whether to buy one stock given a forecasting price 10, 000, because a 95% confidence interval of $10, 000 ± 10 and a 95% confidence interval of $10, 000 ± 10, 000 make a lot of difference. In both cases, the estimated values are same but the latter has the potential to be very risky whereas the first is safe. In this dissertation, we explore this question: how effective are the approximate methods. The prerequisite of this question is the statistical characterization of approximation methods. Our research revolves around these two questions and studies approximate query processing both for traditional databases and probabilistic databases from statistical point of view. 13

14 1.1 Related Work Versatile approximation methods have been proposed in the literature. Sampling is one of them and is widely studied and used in databases. Sampling methods use a small part of data to estimate the distribution of large dataset. There is a large body of research on sampling. We briefly introduce just some of them. Sample synopses can be computed online. Haas [36, 38] computed sample synopses of aggregation during the runtime and gradually refined them during processing. Later DBO [54] system extended this method and removed the limitation of approximation in main memory. Sample synopses can be precomputed. Acharya [5] precomputed synopses for foreign key join to compensate for losing uniformity. AQUA project [4, 5, 27 29] of Bell lab used samples to precompute synopses to approximate query answering. Of all the approximation methods, sampling techniques are flexible, most studied and have shown success in the context of databases. However, the uniform sampling requirement limits its applications. In addition, if sampling methods operate on queries with low selectivity or highly skewed data, large errors will be produced. Although Ganti [26] and Chaudhuri [13] aimed to tackle these problems, both solutions are heuristics. Histograms have better performance then sampling methods in queries with low selectivity. They are simple and easy to construct and operat. Because of these advantages, histograms have proven to be successful approximation methods in commercial databases and extensive research has been done on them. A nice survey of the histogram work can be found in Ioannidis s paper [45]. Types of histograms proposed in the literature include: Equi-width histograms [82], Equi-depth [70] and several other types. To capture all the classes of histograms, Poosala [81] provided a taxonomy which can represent different types of histograms such as Equi-sum histograms [70, 82], V-optimal histograms [81] and spline histograms [64]. Since approximation errors are inevitable for histograms, many techniques have been proposed to minimize different error measures. Some are focused on constructing 14

15 buckets offline. Poosala and his collaborators [48, 81] proposed V-optimal histograms to minimize the sum of squared errors. Guha [33, 34] constructed optimal histograms for range queries and minimized relative errors. Subsequent work [32] provided linear time approximation algorithms to construct histograms. Recently Cormode [15, 17] extended histogram research to probabilistic databases and proposed a general framework for different error metrics. Other techniques take advantage of feedback from execution to optimize histograms. Aboulnaga [3] first proposed Self-tuning histograms. Later, Lim [69] used Two-phrase methods to construct Self-tuning histograms. Bruno [11] and Srivastava [92] built multidimensional histograms by exploiting workload information and entropy maximization principle. Kaushik [59] addressed distinct value problems when constructing histograms from query feedback. Although there is abundant literature on histograms, the amount of theoretical work on characterizing histograms as approximation methods for database queries is surprisingly small. Piatetsky-Shapiro and Connell [76] provided the first theoretical characterization of histograms, by deriving worst case and average case error guarantees for Equi-width and Equi-depth histograms used for selectivity estimation. The other theoretical characterizations of histograms can be found in the work of Ioannidis and his collaborators [41 46, 79]. Their work is the only one applicable to estimation of aggregates over joins, Most of this work is concerned with optimality of histograms [41, 43, 44], for which, interestingly enough, the issue of computing the error of histograms can be cleverly avoided the technical means to do this is to rely on majorization theory instead of a direct optimization. Most of this work [42] and small parts of the other papers we mentioned are concerned with actual characterizing the error of histograms but most of the results apply only to One-bucket histograms the only exception is worst case error estimation which results in unrealistically large bounds. The uniform frequency assumption was never formalized when it was used to derive average error bounds for histograms [76]. Only implicit explanation about what 15

16 the assumption says were made throughout histograms literature. Even if the uniform frequency assumption would be formalized as a completely decorrelated placement of tuples in a bucket, it would ignore the skew within the bucket, thus, provide looser bounds on behavior. Recently, uncertain data have emerged in many areas, such as information extraction [25, 65], sensor network [84, 87]. Research work has combined histograms and wavelets with probabilistic databases. Cormode [17] studied the optimal histogram bucket boundary and wavelet coefficients within given error metric in probabilistic databases. Cormode [15] presented a dynamic programming frame to find the optimal probabilistic histograms for different error metric. Besides histogram method, other approximating work has been done on probabilistic databases. In order to allow uncertain data in databases, previous work [10, 12, 22, 25, 31, 66, 77] modeled the uncertain data and extended the standard relational algebra to the probabilistic algebra. Suciu[18, 19] further studied the complexity of evaluation of queries over probabilistic databases and proved that computing the probability of a Boolean query on a disjoint independent database has #P complexity. Aggregates over probabilistic databases perceived as a much harder problem by the community has attracted some attention in recent years. Ross[86] studied the aggregation over probabilistic databases. The focus is on probabilistic databases with attribute uncertainty and the probability of each attribute is in a bounded interval. [6, 95] developed the TRIO systems for managing uncertainty and lineage of data. Aggregation over TRIO systems is based on the possible worlds model and therefore operations are simple to implement but intractable for most situations. [51, 52] and [16] studied aggregates over probabilistic data streams. The problem in [51, 52] is to estimate the expected value of various aggregates over a single probabilistic data stream (or probabilistic relation). They derived an efficient method to estimate Average for One-relation case. [16] studied the same problem together with the estimation of the 16

17 size of the join of two relations. The analysis provided in these papers is restricted to expectation and variance for One-relation case and expectation for Two-relation case. Furthermore, the aggregate is restricted to COUNT (the work is only concerned with frequency moments). It is important to note that the problem solved in all these work is harder since the estimation has to be performed with small space (data streaming problem). It would be interesting to investigate how the formulas we derive could be approximated using small space, as well. [55 57, 88 90] used graphic model to represent the correlated tuples, but little work has been done on aggregates. [88] only presented a distribution of average query on 500 tuples. MayBMS[9, 30, 40, 61 63], MCDB[37, 49, 75] and PIP[60] are probabilistic DBMS that have implemented expected values of aggregates. Above research work focused on the expected value of aggregates. Inspired by the same observation that the expected value of aggregates cannot capture the distribution clearly, [85] studied the problem of dealing with HAVING predicates that necessarily use aggregates. The basic problem they consider is: compute the probability that, for a given group, the aggregate α is in relationship θ with the constant k, i.e. αθk. The types of aggregates considered are MIN, MAX,COUNT,SUM and the comparison operator θ is a comparison operator like >. Only integer constants k are supported since the operations are performed on the semiring S k+1. The probabilities of events α < k are in fact the cumulative distribution function(c.d.f) of aggregate α at the point k. The efficient computation of such probabilities can be readily used to compute confidence intervals for α by essentially inverting the c.d.f. This can be accomplished efficiently using binary search since the c.d.f is monotone. Unfortunately, most of the results in [85] are negative. For most queries, computing exactly the probability of event αθk has #P complexity. Even for the queries for which the computation is polynomial this is the case for MIN, MAX,COUNT,SUM(y) but only for α-safe plans and y a single attribute the complexity is linear in k, the constant 17

18 involved. This is especially troublesome for SUM aggregates since k can be as large as the product of the size of the domain of the aggregate and the size of the group. 1.2 Confidence Intervals from Moments Confidence intervals for estimators address the effectiveness of approximated values. By generating a lower and upper limit, confidence interval provides an error bound of the estimator which is essential for users. For example, it is very hard for a person to decide whether to buy one stock given a forecasting price 10, 000, because a 95% confidence interval with $10, 000 ± 10 and a 95% confidence interval with $10, 000 ± 10, 000 make a lot of differences. In both cases the estimate is the same but the latter has the potential to be very risky whereas the first is safe. The standard way to obtain confidence intervals for random variables X is to compute the first two central moments E [X ] and Var [X ], and then to use either a distribution dependent or distribution independent bound. The distribution dependent bounds assume the type of distribution is known and is one of the Two-parameter distributions. The most common situation is the application of the Central Limit Theorem, which states that the distribution of sums of independent random variables is asymptotically normal or a similar result. Irrespective of how the normality of the distribution is justified, the confidence interval with confidence 1 α based on moments and normality is: [ E [X ] z α 2 ] Var [X ], E [X ] + z α 2 Var [X ] with z α 2 the α 2 quantile of N(0, 1) distribution. An alternative is to use the distribution independent bounds based on the Chebychev inequality to provide conservative bounds (bounds are correct irrespective of the distribution but might be unnecessarily large). This bound requires E [X ] and Var [X ] as well. Let α be any real number, E [x] = µ and Var [X ] = σ 2, the bound is provided by P( X µ α) σ2 α 2 18

19 The two types of bounds we discussed above require the computation of E [X ] and Var [X ]. Usually, E [X ] is easy to compute but Var [X ] poses significant problems. Unfortunately, it is not possible to avoid the computation of Var [X ] and still obtain reasonable confidence intervals. If only E [X ] is known, only Markov s inequality or Hoeffding bounds can be produced. Both can be reasonably efficient if multiple copies of the random variable are available and averaged, but both are inefficient if this is not the case. As we will see in the thesis, we have only one copy of the random variable that characterizes the aggregates, thus Var [X ] is strictly required if reasonable confidence bounds are to be produced. For all the estimates in this dissertation, either distribution independent bounds could be used to obtain strict characterization of the results or the normal distribution based bounds since all estimates can be expressed as weighted sums of independent identically distributed (iid) random variables so the Central Limit Theorem applies. In view of the above discussion, in order to simplify the exposition and the comparison, throughout the dissertation, we will just provide results in the form of expected values and variances or squared errors the variance is equal to the squared error if the random variable is unbiased. Actual error guarantees can be obtained straightforwardly using the above mentioned techniques. 1.3 Contributions The common assumption in histograms is the uniform assumption which assumes that the histograms perform well only when the frequency values in one bucket are uniformly distributed. This uniform assumption turns out not to be strictly necessary in our work; histograms might work well even when the average frequency in a bucket is a very rough approximation of the actual frequency. The first problem we addressed is a general statistical assumption for histograms and the behavior of histograms under this assumption. The moments of Uni-dimensional and Multi-dimensional histograms are formulated under this new assumption. 19

20 Although extensive work has been done on probabilistic databases, most of them only provide the expected value of queries, which is not enough for users to make decisions. [73] approximated confidence computation in probabilistic databases, but it only estimates probabilities of DNFs. MCDB is the only system capable of computation of tight confidence intervals but, in the case or rare events, it required prohibitively expensive evaluation since it is based on sampling. The second problem and the third problem we tackled are estimating aggregates over probabilistic databases, especially for aggregates over multiple relations. We provide confidence intervals for our estimators and efficient algorithms to evaluate them. More precisely, we made the following contributions: We formulate a new statistical assumption, random shuffling of frequencies within a bucket, that is more general, thus more likely to hold, than the uniform frequency assumption. As we will show, this new assumption does not change the way the histograms are used for approximating results of queries thus it is consistent with all the previous work on histograms but, important from a practitioner s point of view, explains why and when histograms behave well as approximators. Statistically, random shuffling assumption holds when there is no correlation between the frequencies in the two relations being joined, so it is likely to hold in practice quite often. We provide tight minimum error guarantees for both unidimensional and multidimensional histograms when the random shuffling assumption holds. [42] is the only other work that provides tight error guarantees for estimation using histograms, in this case worst case guarantees. The errors we derive allow us to provide theoretical proof that, when the random shuffling assumption holds, histograms are well suited to estimate aggregation queries and strictly superior to sampling and sketching. At the same time, we provide compelling theoretical evidence that, when the random shuffling assumption does not hold, histograms are, on average, poor approximators when compared to sampling and sketching. We apply the random shuffling assumption to XSketch[78]. As it is the case for all uses of histograms in the literature, XSketches make the uniform frequency assumption for the histogram they are using as an ingredient. This prevents a full statistical model for XSketches to be developed with the result that the error cannot be analyzed. By combining the random shuffling assumption with the other statistical assumptions made in [78], we complete the statistical model and show that XSketches are unbiased estimators if the random shuffling assumption holds and we compute the error under the same assumption. This is an example of how 20

21 the theory we developed in this dissertation can be extended to methods that use histograms as a building block. We derive a general framework for computing the confidence bounds of SUM and AVEREAGE-like aggregates over joins of multiple relations for probabilistic databases. The framework only needs a loose assumption on the probabilistic model used. We apply the framework to multiple models: Tuple-independent model and graphical model. The applications are straightforward a testament of the power of the general framework. Applying the general framework to variations of above models and the majority of other models in the literature is as easy. Based on our general framework for aggregates over probabilistic databases, we present algorithms that remove the need to perform computation over the Cross-products of the M matching tuples. The main algorithm has time complexity O(M log(m)) and is applicable to a large class of probabilistic models. We implement the theoretical results both using query rewriting in pure SQL and using C++ and SQL combination for aggregates over probabilistic databases. We evaluate the performance of our algorithms on TPC-H dataset and show that they are competitive with the computation of aggregates in Non-probabilistic databases for the Tuple-independent model and reasonable for the graphical model. We study probabilistic aggregates with duplicate removal and analyze the properties of PTIME probabilistic queries. We conclude the conveyed correlations by projection are hierarchical and can be factorized recursively. We derive the first two moments of estimators of probabilistic aggregates with duplicate removal. We design an efficient hash based algorithm to implement estimators of probabilistic aggregates with duplicate removal. The rest of this dissertation is organized as follows. In Chapter 2, we present a random shuffling assumption for histograms and analyze the behaviors of histograms under this assumption. Chapter 3 formulates the aggregates over probabilistic databases and provides efficient algorithms to compute the estimates and corresponding confidence intervals. Chapter 4 discusses aggregates with duplicate removal, derives formulas of moments and then implements efficient algorithms to evaluate them. In the final chapter, conclusion is drawn and future work is presented. 21

22 CHAPTER 2 HISTOGRAMS AS STATISTICAL ESTIMATORS FOR AGGREGATE QUERIES Background Histograms are among the most widely used and extensively studied approximation techniques for aggregate queries [41, 42, 44]. The traditional interpretation of histograms irrespective of the type is that the frequencies of items in a bucket are approximated by the average frequency of the bucket, and this average will be used instead of the original frequencies in any computation. For example, histograms can be used to estimate the size of the join of two relations F and G. While this interpretation is intuitive and provides simple recipes for performing operations with histograms, it suggests that the histogram approximation of the frequency distribution will work well in the approximation process only if the frequency distribution is smooth and can be locally approximated using the uniform distribution assumption of histograms. This, as suggested by the following example and apparent from the histogram literature, turns out not to be strictly necessary; histograms might work well even when the average frequency in a bucket is a very rough approximation of the actual frequency. Example 1. Let F and G be two relations, each with a single attribute A with domain We generate both F and G to have Zipf distributions with Zipf coefficient 0.5 and with average frequency (as close to 100 but allowing all frequencies to be integers). In relation F the frequency is decreasing (see Figure 2-1); in relation G the frequencies are randomly shuffled (see Figure 2-2). Observe from these figures that the One-bucket histogram approximation of the frequency, the line at 99.54, is a very poor estimate of the frequencies, thus we would expect poor performance when we use 1 This chapter is submitted to Information Systems Databases: Their Creation, Management and Utilization for publication and reprinted with permission from Information Systems Databases: Their Creation, Management and Utilization. 22

23 Figure 2-1. Frequency graph of relation F. Figure 2-2. Frequency graph of relation G. Figure 2-3. Histogram of the size of the join result One-bucket histograms to estimate the size of the Equi-join of F and G. In the scenario described above, the One-bucket histogram prediction is always irrespective of the particular shuffling of the domain of G. In the particular case depicted in the figure, the true size of the join is , a mere 1% smaller than the prediction; to make sure this is not happening by chance, we picked 1000 random shufflings of G and plotted the distribution of F A G in Figure 2-3. Notice that the sizes of the joins are compactly distributed (within a 10% relative error) around the prediction using the One-bucket histograms. As the previous example suggests, even though within a bucket the uniform frequency approximation is rather crude, the result of approximating the size of the join is surprisingly good and statistically stable. This prompts the question: Why is this happening in spite of the uniform approximation not holding? As we will show in this chapter, the answer is that a more general statistical hypothesis holds, namely that the placement of the frequencies in the two relations is uncorrelated. This observation is the starting point for the current work. In the rest of the chapter, we first formalize the problem we are solving in Section 2.2, followed by the explanation of histograms as function approximators and statistical nonparametric models in Section 2.3. The formalization of the random shuffling assumption is made in Section 2.4. We analyze the behavior of unidimensional histograms under the random shuffling assumption in Section 2.5 and compare with the 23

24 behavior of sampling and sketching in Section 2.6. Section 2.7 analyzes the behavior of histograms when the random shuffling assumption does not hold. Section 2.8 generalizes the random shuffling assumption to multidimensional histograms. The random shuffling assumption is extended to XSketches in Section 2.9. Section 2.10 comments on the random shuffling assumption. Discussion is made in Section Problem Formulation The general problem we are trying to solve is approximating aggregates over joins. As we will show in this section, both the selectivity estimation and the general SUM-like aggregates over join problems can be rephrased as size of join estimation problems. Thus all the results developed for the latter can be extended straightforwardly to the other two problems Size of Join Problem Unidimensional Size of Join Estimation Problem Let F and G be two relations, each with a single attribute A with domain I. Furthermore, let f i and g i be the frequency of the value i in F and G, respectively. With this, the size of join problem is to estimate the quantity: F A G = i I f i g i (2 1) given synopses of relations F and G (if full information is available, we can simply compute the sum to get the exact answer) Multidimensional Size of Join Estimation Problem Let F 1 (A 1 ),, F m (A m ) and G(A 1,, A m ) be m + 1 relations. Let F i and G have a common attribute A i with domain I i. Let f ik, g i1,,i k be the frequencies of values i k, (i 1,, i m ) in F k and G, respectively. Then the size of the join is: F 1 A1 G A2 Am F m = f i1 g (i1,,i m ) f im (2 2) i 1 I 1 i m I m 24

25 Table 2-1. Notations used in the chapter Symbol(s) Meaning F, G Relations A Join Attribute A k Join Attribute I Domain of join attribute A I k Domain of join attribute A k N Size of domain I i, j, i, j Indices going over domain I f i, g i Frequencies of value i in relations F, G f, g Average frequencies in relations F, G SJ(F ) i I f i 2, the self join size of F SqErr(F) i I (f i f ) 2, the squared error of F I l The lth bucket of domain I I(C) Identity function: 1 when C true, 0 otherwise σ Uniform random permutations x σ Random variable modeling the frequency n Number of buckets N Size of the sample m Number of attributes P [p] Probability that predicate p holds E [X ] Expected value of random variable X Var [X ] Variance of random variable X, Var [X ] = E [ X 2] E [X ] 2 Cov(X, Y ) Covariance of random variables X and Y, Cov(X, Y ) = E [XY ] E [X ] E [Y ] (i 1,, i m ) Induces going over domain I with m dimensions T XML document nodes The problem of multidimensional size of the join is to approximate Equation 2 2 given synopses of relations F i s and G Selectivity Estimation Problem For selectivity estimation problems, we show how they can be reduced to size of join estimation problems in the case of Bi-dimensional selectivity, which can be easily generalized to multidimensional selectivity. This means that the results we are developing for size of join estimation readily apply for this problem as well. We also show an alternative reduction for arbitrary selectivity predicates that is not as efficient but works in all scenarios. Let G be a relation with two attributes A, B with domains I, J, 25

26 respectively. Given I I and J J, estimate the quantity: σ I J (G) = i I With I(C) the identity function, that takes value 1 if condition C is true and value 0 otherwise, by simply setting the relation F and H so that f i = I(i I ) and h j = I(j J ), j J g ij we have: F A G B H = f i g ij h j = I(i I )g ij I(j J ) i I j I i I j I = g ij = σ I J (G) i I j J Observe that the joins are with unidimensional relations; this is important since, in general, the smaller the dimensionality the more efficient the estimation. Essentially the same technique can be applied to the multidimensional case when the selection predicate can be rewritten as a conjunction of predicates, one for each attribute. In that case the selectivity estimation problem is reduced to the problem of estimating the size of a Star-join involving a unidimensional virtual relation for each attribute and the original relation. This can be generalized further by considering disjunctions of conjunctions of predicates involving individual predicates (or expressions that can be rewritten in this way). In this case, the expression can be rewritten as a disjunction of conjunctions with the extra property that the conjunctions do not contain common cases. This extra property ensures that the estimate of the selectivity of the initial predicate is simply the sum of the estimates for individual conjunctions. Note that the selectivity of the predicate is thus the size of the join of virtual relations with the original relation. This involves joins on multiple attributes between two relations that are more complicated than the joins on a single attribute. Moreover, constructing synopses of such virtual relations can be challenging (the tuples satisfying the selection predicate might actually have to be enumerated one by one). 26

27 2.2.3 Aggregates over Joins Problem In order to gain insight and to ease the understanding, we primarily focus on the problem of computing aggregates over the join of two relations. We make some comments towards the end of the section on how these ideas can be extended to larger joins. Let F and G be two relations that contain a join attribute A and possibly other attributes. Let us first look at aggregates of the form SUM FF F G (F A G) = F F (t F)F G (t G) t F A G with t F the part of the tuple in the join that comes from relation F (similarly for G) and F F and F G arbitrary functions. The only requirement for this to work is to be able to rewrite the expression summed up over the join as the product of expressions depending on attributes of the two relations. If such a rewriting is possible, we say that the aggregate is relation factorizable. To evaluate the sum of relation factorizable aggregates over the join of F and G, we observe that: SUM FF F G (F A G) = F F (t F)F G (t G) t F A G = F F (t F)F G (t G) i I t F A G,t.A=i = ( ) ( ) F F (t) F G (t) i I t F,t.A=i t G,t.A=i (2 3) = i I f i g i where f i and g i are just compact notations for expressions t F,t.A=i F F(t) and t G,t.A=i F G(t), respectively. The important observation is that we can use any method designed for size of join estimation to estimate this aggregate as well by simply replacing f i by f i and g i by g i since then the expression in Equation 2 1 is identical 27

28 to the last expression in Equation 2 3. Thus, computing such aggregates is as easy as computing sizes of joins; the complexity is in the join, not in the expression being summed up. With the ability to compute estimates of aggregates of the form SUM FF F G (F A G), we can immediately compute aggregates of the form AVG and STD as well. For example, to estimate AVG B (F(A, B) A G(A)) we can estimate SUM B (F(A, B) A G A ) and F A G and simply take their ratio. The ideas used to reduce aggregate estimation problems of the form SUM FF F G (F A G) to size of join estimation problems can be readily generalized to star joins involving multiple relations. The similar rewriting can be made. Since both the selectivity estimation and COUNT, SUM, AVG and STD aggregate estimation problems can be reduced to size of join problems, for the rest of the chapter we will focus only on the size of join problem. The problem of estimating MIN and MAX aggregates cannot be reduced to size of join problems but no approximate methods for estimating such aggregates exist either. The main problem with such developments is the fact that there is no way to predict extreme values using statistical methods unless very strong statistical assumptions are made (particular distributions of the data have to be assumed) Comments on Obtaining Error Guarantees from Expected Value and Variance Estimates The standard technique [8, 91] to obtain error guarantees, i.e. confidence intervals, for an estimate is to compute the expected value and variance and then to use either distribution independent bounds given by Chernoff s and Chebyshev s inequalities, or to use distribution dependent bounds. In the latter case, usually the Central Limit Theorem or one of its generalizations is used to argue that the distribution of the estimate is close to normal and then error bounds based on normal distributions with the same expected value and variance are produced. For all the estimates in this chapter, either distribution 28

29 independent bounds could be used to obtain strict characterization of the results or the normal distribution based bounds since all estimates can be expressed as weighted sums of independent identically distributed (iid) random variables so the Central Limit Theorem applies. In view of the above discussion, in order to simplify the exposition and the comparison, throughout the chapter we will just provide results in the form of expected values and variances or squared errors the variance is equal to the squared error if the random variable is unbiased. Actual error guarantees can be obtained straightforwardly using the above mentioned techniques. 2.3 Histograms as Function Approximators and Statistical Nonparametric Models Histograms were first introduced in Statistics as an alternative to parametric models 2. The main idea is to approximate, in a function approximation sense, the probability distribution function of an unknown distribution. The histogram can be used instead of the unknown probability density function to characterize the distribution. While results from Statistics clearly point out that no guarantee with respect to the goodness of the approximation of the p.d.f. can be given, good guarantees can be provided for the computation of the c.d.f. This is particularly useful when the cumulative distribution function has to be determined at various points (for example to produce confidence intervals). It is easier to explain why this is the case by translating this problem into a database problem: selectivity estimation. With our problem definition, if we have a selection predicate of the form G.A <= 10 and we want to estimate its selectivity, by constructing (or using a previously constructed) a histogram on this attribute, the estimate is simply the sum of the mass (i.e. total number of tuples) of the 2 Parametric models are models that depend on a small fixed set of parameters. For example normal distributions are parametric models since they depend only two parameters: mean and variance. 29

30 buckets completely included in the range (, 10] plus the proportional part of the bucket that overlaps the point 10. From the point of view of selectivity estimation by the above discussion the error comes only from the bucket that partially overlaps the interval. For the buckets that are fully included in the interval there is no source of error. This observation suggests that the error is kept under control since, in the absolute value, it is less than the mass of the overlapping bucket. In the context of statistics, there is an extra complication coming from the fact that the data provided form a sample, in which case natural statistical fluctuations would results in errors for the counts in each bucket. This is not a problem in databases since the histogram is constructed over the entire dataset not just a sample. 2.4 Random Shuffling Assumption Under the traditional uniform assumption, histograms should perform well only when the average frequency approximates well the frequencies in a bucket. As we have seen in the introduction (Example 1), the One-bucket histogram behaved exceptionally well even though the average frequency was a poor approximator of the frequencies. The uniform frequency assumption does not explain this good behavior. Instead of proposing a new type of histograms, the goal in this section is to find a better explanation and to explore it with statistical analysis. We want to formalize a statistical model that can be used to characterize the histograms. The starting point for our investigation is the random rearrangement of the elements of the domain for relation G in Example 1. The rearrangement did not change the skew of the distribution, it just decorrelated the matching of the frequencies in F and G. We formalize this random rearrangement in this section as the statistically well defined notion of random shuffling assumption. As we will see, this formalization leads to formulas for the error of the histograms and allows us to compare them with other approximation methods like sketching and sampling. Later in the chapter we ask the counterpart question: what is the histogram error behavior if the assumption is false. By 30

CORRELATION-AWARE STATISTICAL METHODS FOR SAMPLING-BASED GROUP BY ESTIMATES

CORRELATION-AWARE STATISTICAL METHODS FOR SAMPLING-BASED GROUP BY ESTIMATES By FEI XU A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS