STATISTICAL APPROXIMATIONS OF DATABASE QUERIES WITH CONFIDENCE INTERVALS

Size: px
Start display at page:

Download "STATISTICAL APPROXIMATIONS OF DATABASE QUERIES WITH CONFIDENCE INTERVALS"

Transcription

1 STATISTICAL APPROXIMATIONS OF DATABASE QUERIES WITH CONFIDENCE INTERVALS By LIXIA CHEN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2011

2 c 2011 Lixia Chen 2

3 To my family 3

4 ACKNOWLEDGMENTS First and foremost, I would like to express my deep and sincere gratitude to my PhD adviser, Dr. Alin Dobra. I thank him for his invaluable support, guidance and inspiration in my research. I have learned a lot from him in all aspects of my research: computer techniques, formulating problems and presenting the research results. This thesis would have not been possible without his help. I thank Dr. Arunava Banerjee, Dr. Sanjay Ranka, Dr. Tamer Kahveci and Dr. Ronald Randles for serving on my supervisory committee. Dr. Banerjee s solid theory background has enlightened me in exploring new possibilities. Dr. Randles provided helpful suggestions in my work. Dr. Kahveci and Dr. Ranka encouraged me in hard times. I would like to convey my utmost thanks to my family, who mean everything to me. My husband, Yuchu Tong has been with me through all the happy and hard times. He always brings fun to my life and encourages me to face challenges in my study. I enjoy every moment with him. I am very grateful to my parents for their love and encouragement. They are always been there and ready to provide supports to me. They are not only good parents but also excellent teachers to me. They have inspired my interests in science and opened my mind to it. Many thanks also goes to my sisters and my brother. Every moment with them is wonderful. I thank my new born daughter for making me more happy than ever. I also want to thank all my friends for their support in my study and life. An incomplete list includes Xuelian Xiao, Wei Peng, Yixi Ouyang, Jiangyan Xu and Wangyuan Zhang. We had a memorable time, which made my graduation study enriched and enjoyable. Finally, I thank National Science Foundation grants NSF-CAREER-IIS for the financial support in this work. 4

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS LIST OF TABLES LIST OF FIGURES ABSTRACT CHAPTER 1 INTRODUCTION Related Work Confidence Intervals from Moments Contributions HISTOGRAMS AS STATISTICAL ESTIMATORS FOR AGGREGATE QUERIES Background Problem Formulation Size of Join Problem Unidimensional Size of Join Estimation Problem Multidimensional Size of Join Estimation Problem Selectivity Estimation Problem Aggregates over Joins Problem Comments on Obtaining Error Guarantees from Expected Value and Variance Estimates Histograms as Function Approximators and Statistical Nonparametric Models Random Shuffling Assumption Definition of Uniform Random Shuffling Assumption Moments under Random Shuffling Assumption Histograms under Random Shuffling Assumption One-bucket Histograms Histograms with Aligned Buckets Comparison with Sampling and Sketches Sampling Sketches Comments on Comparison Histograms when Random Shuffling Assumption Does Not Hold Random Histograms on Arbitrary Problems Random Histograms for Self-join Size Computation Comments on End-biased histograms Generalization to Multidimensional Histograms

6 2.8.1 General Random Shuffling Properties Multidimensional Random Shuffling Distribution One-bucket Multidimensional Histogram Low Dimensional Histogram Estimating size of Star-Joins using One-bucket Histograms XSketch Estimator under Random Shuffling Assumption Introduction XSketches Shuffled XSketches Comments on Random Shuffling Assumption Summary AGGREGATION OVER PROBABILISTIC DATABASES WITH CONFIDENCE INTERVAL Background Preliminaries Queries and Equivalent Algebraic Expressions Probabilistic Database as a Description of a Probability Space Use of Kronecker Symbol δ ij Analysis for Multi-relation SUM Aggregates for Tuple-independent Model Dependence in Tuples after Aggregations M 2 Computation of Moments Reducing the time complexity General Framework of Computing Moments Notation General Framework for M log M Computation Computation of Moments InDB Techniques InDB M 2 Technique InDB M log M Technique Middleware M log M Technique Algorithms Extensions Non-linear Aggregates GROUPBY Queries Experiments Experiments on Tuple-independent Model Experiments on Graphical Model The Central Limit Theorem and Probabilistic Aggregates Summary PROBABILISTIC AGGREGATION WITH DUPLICATE ELIMINATION Backgroud Preliminaries

7 4.2.1 Queries and Equivalent Algebraic Expressions Query Evaluation on Probabilistic Databases Analysis of Multi-relation Sum Aggregation with Duplicate Removal Probability of Correlated Groups Correlations in PTIME Queries Moments of aggregates with duplicate elimination Hash-based Algorithm Experiments Implementation Results of Experiments Summary CONCLUSIONS AND FUTURE WORK Dissertation Summary Future Directions APPENDIX A GENERAL UNIDIMENSIONAL HISTOGRAMS B PROOF OF PROPOSITION C PROOF OF PROPOSITION D PROOF OF THEOREM E PROOF OF THEOREM F PROOF OF LEMMA G PROOF OF THEOREM REFERENCES BIOGRAPHICAL SKETCH

8 Table LIST OF TABLES page 2-1 Notations used in the chapter Relation R Relation R The result of customer, probability R 1 R The result of customer,product, probability R 1 R Group u Group v

9 Figure LIST OF FIGURES page 2-1 Frequency graph of relation F Frequency graph of relation G Histogram of the size of the join result The Layers of Sequences Total Time On TPC-H Queries, 1GB dataset Aggregate Time On TPC-H Queries, 1GB dataset Total Time On TPC-H Queries, 10GB data Aggregate Time On TPC-H Queries, 10GB data The graphic model used in the experiments The experiments results in graphical model PDF of sum of independent discrete random variables CDF of sum of independent discrete random variables PDF of sum of partly independent discrete random variables CDF of sum of partly independent discrete random variables Nested correlations in u v ū v Hierarchical structure of correlations λ = 1 Running time on Q λ = 1 Running time on Q λ = 2 Running time on Q λ = 2 Running time on Q λ = 4 Running time on Q λ = 4 Running time on Q

10 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy STATISTICAL APPROXIMATIONS OF DATABASE QUERIES WITH CONFIDENCE INTERVALS Chair: Alin Dobra Major: Computer Engineering By Lixia Chen December 2011 Explosive data growth in databases presents significant challenges for fast query processing. Computation of exact values of complex queries over large databases can take a long time due to the large volume of data they need to access. Approximate query processing provides an essential solution to this problem. Versatile approximation methods are available to be used in databases nowadays. In order to evaluate the effectiveness of these approximations, it is desirable to provide confidence intervals of estimators; the confidence intervals effectivelly provide error guarantees of estimated values. The focus of this dissertation is approximating database queries and providing corresponding confidence intervals. We first revisit the approximation method histograms as used in traditional databases and interpret as a statistical method. Then we estimate aggregates over probabilistic databases and provide efficient algorithms to compute them. The traditional assumption for interpreting histograms and justifying approximate query processing methods based on them is that all elements in a bucket have the same frequency this is called uniform distribution assumption. We show that a significantly less restrictive statistical assumption the elements within a bucket are randomly arranged even though they might have different frequencies leads to identical formulas for approximating aggregate queries using histograms. We analyze, under this assumption, the behaviors of the both unidimensional and multidimensional 10

11 histograms and we provide tight error guarantees for the quality of approximations. As an example of how the statistical theory of histograms can be extended, we show how XSketches an approximation technique for XML queries that uses histograms as building blocks can be statistically analyzed. The combination of the random shuffling assumption and the other statistical assumptions associated with XSketch estimators ensures a complete statistical model and error analysis for XSketches. Characterizing SUM-like aggregates over probabilistic databases is considered a hard problem because the size of the probability space is exponential in the database size. In this dissertation, we aim to compute aggregates in probabilistic databases with confidence intervals. Both methods distribution dependent and independent for computing confidence intervals require the expected value and variance of probabilistic aggregates. We develop a general framework to compute moments of aggregates in probabilistic databases. Based on this framework, we derive mathematical formulas of the first two moments of aggregates for graphical models and tuple-independent models. We then present InDB and Middleware algorithms to evaluate the aggregates efficiently. Our prototype implementation using Postgres suggests that our characterization of aggregates incurs little overhead for the tuple-independent model and manageable overhead for the graphical model. We also extend computation of the moments to AVERAGE-like aggregates and GROUP-BYs to show the usefulness of the proposed methods. To extend the above analysis of aggregates over probabilistic databases, we study the aggregates with duplicate removal. We analyze the properties of PTIME queries and conclude the correlations introduced by projections are hierarchical and can be factorized recursively. We derive the first two moments of estimators of probabilistic aggregates with duplicate removal and design a hash-based algorithm to implement them. Our comprehensive experiments show that this algorithm has a small overhead 11

12 when the correlation rate is small and a reasonable overhead when the correlation rate is large. 12

13 CHAPTER 1 INTRODUCTION With the advancements in science and technology, more and more information needs to be stored in databases. According to JASON report [50], the data volume is projected to be in the hundreds of petabytes by When the data growth rate is larger than hardware growth rate by Moore s law, processing large data may take even longer than today. Approximate query processing provides an essential method for fast query response time over large data set. In addition, exact answers are not necessary in some cases and approximate query processing is well suited. Many applications, such as data mining, online analytical processing and decision support systems, belong to this category due to their exploratory nature and associated imprecision in the query or the use of the query result. Probabilistic databases have attracted much attention in recent years because of the increasing demand of processing uncertain data. Since data in probabilistic database are uncertain by nature, the size of the probability space is exponential in the size of databases. Computing the exact answers of complex queries over such database is inefficient and challenging. Query approximation becomes an intuitive alternative. One natural question to about approximation methods is how effective the estimated values are and how much they vary. For example, it is very hard for a person to decide whether to buy one stock given a forecasting price 10, 000, because a 95% confidence interval of $10, 000 ± 10 and a 95% confidence interval of $10, 000 ± 10, 000 make a lot of difference. In both cases, the estimated values are same but the latter has the potential to be very risky whereas the first is safe. In this dissertation, we explore this question: how effective are the approximate methods. The prerequisite of this question is the statistical characterization of approximation methods. Our research revolves around these two questions and studies approximate query processing both for traditional databases and probabilistic databases from statistical point of view. 13

14 1.1 Related Work Versatile approximation methods have been proposed in the literature. Sampling is one of them and is widely studied and used in databases. Sampling methods use a small part of data to estimate the distribution of large dataset. There is a large body of research on sampling. We briefly introduce just some of them. Sample synopses can be computed online. Haas [36, 38] computed sample synopses of aggregation during the runtime and gradually refined them during processing. Later DBO [54] system extended this method and removed the limitation of approximation in main memory. Sample synopses can be precomputed. Acharya [5] precomputed synopses for foreign key join to compensate for losing uniformity. AQUA project [4, 5, 27 29] of Bell lab used samples to precompute synopses to approximate query answering. Of all the approximation methods, sampling techniques are flexible, most studied and have shown success in the context of databases. However, the uniform sampling requirement limits its applications. In addition, if sampling methods operate on queries with low selectivity or highly skewed data, large errors will be produced. Although Ganti [26] and Chaudhuri [13] aimed to tackle these problems, both solutions are heuristics. Histograms have better performance then sampling methods in queries with low selectivity. They are simple and easy to construct and operat. Because of these advantages, histograms have proven to be successful approximation methods in commercial databases and extensive research has been done on them. A nice survey of the histogram work can be found in Ioannidis s paper [45]. Types of histograms proposed in the literature include: Equi-width histograms [82], Equi-depth [70] and several other types. To capture all the classes of histograms, Poosala [81] provided a taxonomy which can represent different types of histograms such as Equi-sum histograms [70, 82], V-optimal histograms [81] and spline histograms [64]. Since approximation errors are inevitable for histograms, many techniques have been proposed to minimize different error measures. Some are focused on constructing 14

15 buckets offline. Poosala and his collaborators [48, 81] proposed V-optimal histograms to minimize the sum of squared errors. Guha [33, 34] constructed optimal histograms for range queries and minimized relative errors. Subsequent work [32] provided linear time approximation algorithms to construct histograms. Recently Cormode [15, 17] extended histogram research to probabilistic databases and proposed a general framework for different error metrics. Other techniques take advantage of feedback from execution to optimize histograms. Aboulnaga [3] first proposed Self-tuning histograms. Later, Lim [69] used Two-phrase methods to construct Self-tuning histograms. Bruno [11] and Srivastava [92] built multidimensional histograms by exploiting workload information and entropy maximization principle. Kaushik [59] addressed distinct value problems when constructing histograms from query feedback. Although there is abundant literature on histograms, the amount of theoretical work on characterizing histograms as approximation methods for database queries is surprisingly small. Piatetsky-Shapiro and Connell [76] provided the first theoretical characterization of histograms, by deriving worst case and average case error guarantees for Equi-width and Equi-depth histograms used for selectivity estimation. The other theoretical characterizations of histograms can be found in the work of Ioannidis and his collaborators [41 46, 79]. Their work is the only one applicable to estimation of aggregates over joins, Most of this work is concerned with optimality of histograms [41, 43, 44], for which, interestingly enough, the issue of computing the error of histograms can be cleverly avoided the technical means to do this is to rely on majorization theory instead of a direct optimization. Most of this work [42] and small parts of the other papers we mentioned are concerned with actual characterizing the error of histograms but most of the results apply only to One-bucket histograms the only exception is worst case error estimation which results in unrealistically large bounds. The uniform frequency assumption was never formalized when it was used to derive average error bounds for histograms [76]. Only implicit explanation about what 15

16 the assumption says were made throughout histograms literature. Even if the uniform frequency assumption would be formalized as a completely decorrelated placement of tuples in a bucket, it would ignore the skew within the bucket, thus, provide looser bounds on behavior. Recently, uncertain data have emerged in many areas, such as information extraction [25, 65], sensor network [84, 87]. Research work has combined histograms and wavelets with probabilistic databases. Cormode [17] studied the optimal histogram bucket boundary and wavelet coefficients within given error metric in probabilistic databases. Cormode [15] presented a dynamic programming frame to find the optimal probabilistic histograms for different error metric. Besides histogram method, other approximating work has been done on probabilistic databases. In order to allow uncertain data in databases, previous work [10, 12, 22, 25, 31, 66, 77] modeled the uncertain data and extended the standard relational algebra to the probabilistic algebra. Suciu[18, 19] further studied the complexity of evaluation of queries over probabilistic databases and proved that computing the probability of a Boolean query on a disjoint independent database has #P complexity. Aggregates over probabilistic databases perceived as a much harder problem by the community has attracted some attention in recent years. Ross[86] studied the aggregation over probabilistic databases. The focus is on probabilistic databases with attribute uncertainty and the probability of each attribute is in a bounded interval. [6, 95] developed the TRIO systems for managing uncertainty and lineage of data. Aggregation over TRIO systems is based on the possible worlds model and therefore operations are simple to implement but intractable for most situations. [51, 52] and [16] studied aggregates over probabilistic data streams. The problem in [51, 52] is to estimate the expected value of various aggregates over a single probabilistic data stream (or probabilistic relation). They derived an efficient method to estimate Average for One-relation case. [16] studied the same problem together with the estimation of the 16

17 size of the join of two relations. The analysis provided in these papers is restricted to expectation and variance for One-relation case and expectation for Two-relation case. Furthermore, the aggregate is restricted to COUNT (the work is only concerned with frequency moments). It is important to note that the problem solved in all these work is harder since the estimation has to be performed with small space (data streaming problem). It would be interesting to investigate how the formulas we derive could be approximated using small space, as well. [55 57, 88 90] used graphic model to represent the correlated tuples, but little work has been done on aggregates. [88] only presented a distribution of average query on 500 tuples. MayBMS[9, 30, 40, 61 63], MCDB[37, 49, 75] and PIP[60] are probabilistic DBMS that have implemented expected values of aggregates. Above research work focused on the expected value of aggregates. Inspired by the same observation that the expected value of aggregates cannot capture the distribution clearly, [85] studied the problem of dealing with HAVING predicates that necessarily use aggregates. The basic problem they consider is: compute the probability that, for a given group, the aggregate α is in relationship θ with the constant k, i.e. αθk. The types of aggregates considered are MIN, MAX,COUNT,SUM and the comparison operator θ is a comparison operator like >. Only integer constants k are supported since the operations are performed on the semiring S k+1. The probabilities of events α < k are in fact the cumulative distribution function(c.d.f) of aggregate α at the point k. The efficient computation of such probabilities can be readily used to compute confidence intervals for α by essentially inverting the c.d.f. This can be accomplished efficiently using binary search since the c.d.f is monotone. Unfortunately, most of the results in [85] are negative. For most queries, computing exactly the probability of event αθk has #P complexity. Even for the queries for which the computation is polynomial this is the case for MIN, MAX,COUNT,SUM(y) but only for α-safe plans and y a single attribute the complexity is linear in k, the constant 17

18 involved. This is especially troublesome for SUM aggregates since k can be as large as the product of the size of the domain of the aggregate and the size of the group. 1.2 Confidence Intervals from Moments Confidence intervals for estimators address the effectiveness of approximated values. By generating a lower and upper limit, confidence interval provides an error bound of the estimator which is essential for users. For example, it is very hard for a person to decide whether to buy one stock given a forecasting price 10, 000, because a 95% confidence interval with $10, 000 ± 10 and a 95% confidence interval with $10, 000 ± 10, 000 make a lot of differences. In both cases the estimate is the same but the latter has the potential to be very risky whereas the first is safe. The standard way to obtain confidence intervals for random variables X is to compute the first two central moments E [X ] and Var [X ], and then to use either a distribution dependent or distribution independent bound. The distribution dependent bounds assume the type of distribution is known and is one of the Two-parameter distributions. The most common situation is the application of the Central Limit Theorem, which states that the distribution of sums of independent random variables is asymptotically normal or a similar result. Irrespective of how the normality of the distribution is justified, the confidence interval with confidence 1 α based on moments and normality is: [ E [X ] z α 2 ] Var [X ], E [X ] + z α 2 Var [X ] with z α 2 the α 2 quantile of N(0, 1) distribution. An alternative is to use the distribution independent bounds based on the Chebychev inequality to provide conservative bounds (bounds are correct irrespective of the distribution but might be unnecessarily large). This bound requires E [X ] and Var [X ] as well. Let α be any real number, E [x] = µ and Var [X ] = σ 2, the bound is provided by P( X µ α) σ2 α 2 18

19 The two types of bounds we discussed above require the computation of E [X ] and Var [X ]. Usually, E [X ] is easy to compute but Var [X ] poses significant problems. Unfortunately, it is not possible to avoid the computation of Var [X ] and still obtain reasonable confidence intervals. If only E [X ] is known, only Markov s inequality or Hoeffding bounds can be produced. Both can be reasonably efficient if multiple copies of the random variable are available and averaged, but both are inefficient if this is not the case. As we will see in the thesis, we have only one copy of the random variable that characterizes the aggregates, thus Var [X ] is strictly required if reasonable confidence bounds are to be produced. For all the estimates in this dissertation, either distribution independent bounds could be used to obtain strict characterization of the results or the normal distribution based bounds since all estimates can be expressed as weighted sums of independent identically distributed (iid) random variables so the Central Limit Theorem applies. In view of the above discussion, in order to simplify the exposition and the comparison, throughout the dissertation, we will just provide results in the form of expected values and variances or squared errors the variance is equal to the squared error if the random variable is unbiased. Actual error guarantees can be obtained straightforwardly using the above mentioned techniques. 1.3 Contributions The common assumption in histograms is the uniform assumption which assumes that the histograms perform well only when the frequency values in one bucket are uniformly distributed. This uniform assumption turns out not to be strictly necessary in our work; histograms might work well even when the average frequency in a bucket is a very rough approximation of the actual frequency. The first problem we addressed is a general statistical assumption for histograms and the behavior of histograms under this assumption. The moments of Uni-dimensional and Multi-dimensional histograms are formulated under this new assumption. 19

20 Although extensive work has been done on probabilistic databases, most of them only provide the expected value of queries, which is not enough for users to make decisions. [73] approximated confidence computation in probabilistic databases, but it only estimates probabilities of DNFs. MCDB is the only system capable of computation of tight confidence intervals but, in the case or rare events, it required prohibitively expensive evaluation since it is based on sampling. The second problem and the third problem we tackled are estimating aggregates over probabilistic databases, especially for aggregates over multiple relations. We provide confidence intervals for our estimators and efficient algorithms to evaluate them. More precisely, we made the following contributions: We formulate a new statistical assumption, random shuffling of frequencies within a bucket, that is more general, thus more likely to hold, than the uniform frequency assumption. As we will show, this new assumption does not change the way the histograms are used for approximating results of queries thus it is consistent with all the previous work on histograms but, important from a practitioner s point of view, explains why and when histograms behave well as approximators. Statistically, random shuffling assumption holds when there is no correlation between the frequencies in the two relations being joined, so it is likely to hold in practice quite often. We provide tight minimum error guarantees for both unidimensional and multidimensional histograms when the random shuffling assumption holds. [42] is the only other work that provides tight error guarantees for estimation using histograms, in this case worst case guarantees. The errors we derive allow us to provide theoretical proof that, when the random shuffling assumption holds, histograms are well suited to estimate aggregation queries and strictly superior to sampling and sketching. At the same time, we provide compelling theoretical evidence that, when the random shuffling assumption does not hold, histograms are, on average, poor approximators when compared to sampling and sketching. We apply the random shuffling assumption to XSketch[78]. As it is the case for all uses of histograms in the literature, XSketches make the uniform frequency assumption for the histogram they are using as an ingredient. This prevents a full statistical model for XSketches to be developed with the result that the error cannot be analyzed. By combining the random shuffling assumption with the other statistical assumptions made in [78], we complete the statistical model and show that XSketches are unbiased estimators if the random shuffling assumption holds and we compute the error under the same assumption. This is an example of how 20

21 the theory we developed in this dissertation can be extended to methods that use histograms as a building block. We derive a general framework for computing the confidence bounds of SUM and AVEREAGE-like aggregates over joins of multiple relations for probabilistic databases. The framework only needs a loose assumption on the probabilistic model used. We apply the framework to multiple models: Tuple-independent model and graphical model. The applications are straightforward a testament of the power of the general framework. Applying the general framework to variations of above models and the majority of other models in the literature is as easy. Based on our general framework for aggregates over probabilistic databases, we present algorithms that remove the need to perform computation over the Cross-products of the M matching tuples. The main algorithm has time complexity O(M log(m)) and is applicable to a large class of probabilistic models. We implement the theoretical results both using query rewriting in pure SQL and using C++ and SQL combination for aggregates over probabilistic databases. We evaluate the performance of our algorithms on TPC-H dataset and show that they are competitive with the computation of aggregates in Non-probabilistic databases for the Tuple-independent model and reasonable for the graphical model. We study probabilistic aggregates with duplicate removal and analyze the properties of PTIME probabilistic queries. We conclude the conveyed correlations by projection are hierarchical and can be factorized recursively. We derive the first two moments of estimators of probabilistic aggregates with duplicate removal. We design an efficient hash based algorithm to implement estimators of probabilistic aggregates with duplicate removal. The rest of this dissertation is organized as follows. In Chapter 2, we present a random shuffling assumption for histograms and analyze the behaviors of histograms under this assumption. Chapter 3 formulates the aggregates over probabilistic databases and provides efficient algorithms to compute the estimates and corresponding confidence intervals. Chapter 4 discusses aggregates with duplicate removal, derives formulas of moments and then implements efficient algorithms to evaluate them. In the final chapter, conclusion is drawn and future work is presented. 21

22 CHAPTER 2 HISTOGRAMS AS STATISTICAL ESTIMATORS FOR AGGREGATE QUERIES Background Histograms are among the most widely used and extensively studied approximation techniques for aggregate queries [41, 42, 44]. The traditional interpretation of histograms irrespective of the type is that the frequencies of items in a bucket are approximated by the average frequency of the bucket, and this average will be used instead of the original frequencies in any computation. For example, histograms can be used to estimate the size of the join of two relations F and G. While this interpretation is intuitive and provides simple recipes for performing operations with histograms, it suggests that the histogram approximation of the frequency distribution will work well in the approximation process only if the frequency distribution is smooth and can be locally approximated using the uniform distribution assumption of histograms. This, as suggested by the following example and apparent from the histogram literature, turns out not to be strictly necessary; histograms might work well even when the average frequency in a bucket is a very rough approximation of the actual frequency. Example 1. Let F and G be two relations, each with a single attribute A with domain We generate both F and G to have Zipf distributions with Zipf coefficient 0.5 and with average frequency (as close to 100 but allowing all frequencies to be integers). In relation F the frequency is decreasing (see Figure 2-1); in relation G the frequencies are randomly shuffled (see Figure 2-2). Observe from these figures that the One-bucket histogram approximation of the frequency, the line at 99.54, is a very poor estimate of the frequencies, thus we would expect poor performance when we use 1 This chapter is submitted to Information Systems Databases: Their Creation, Management and Utilization for publication and reprinted with permission from Information Systems Databases: Their Creation, Management and Utilization. 22

23 Figure 2-1. Frequency graph of relation F. Figure 2-2. Frequency graph of relation G. Figure 2-3. Histogram of the size of the join result One-bucket histograms to estimate the size of the Equi-join of F and G. In the scenario described above, the One-bucket histogram prediction is always irrespective of the particular shuffling of the domain of G. In the particular case depicted in the figure, the true size of the join is , a mere 1% smaller than the prediction; to make sure this is not happening by chance, we picked 1000 random shufflings of G and plotted the distribution of F A G in Figure 2-3. Notice that the sizes of the joins are compactly distributed (within a 10% relative error) around the prediction using the One-bucket histograms. As the previous example suggests, even though within a bucket the uniform frequency approximation is rather crude, the result of approximating the size of the join is surprisingly good and statistically stable. This prompts the question: Why is this happening in spite of the uniform approximation not holding? As we will show in this chapter, the answer is that a more general statistical hypothesis holds, namely that the placement of the frequencies in the two relations is uncorrelated. This observation is the starting point for the current work. In the rest of the chapter, we first formalize the problem we are solving in Section 2.2, followed by the explanation of histograms as function approximators and statistical nonparametric models in Section 2.3. The formalization of the random shuffling assumption is made in Section 2.4. We analyze the behavior of unidimensional histograms under the random shuffling assumption in Section 2.5 and compare with the 23

24 behavior of sampling and sketching in Section 2.6. Section 2.7 analyzes the behavior of histograms when the random shuffling assumption does not hold. Section 2.8 generalizes the random shuffling assumption to multidimensional histograms. The random shuffling assumption is extended to XSketches in Section 2.9. Section 2.10 comments on the random shuffling assumption. Discussion is made in Section Problem Formulation The general problem we are trying to solve is approximating aggregates over joins. As we will show in this section, both the selectivity estimation and the general SUM-like aggregates over join problems can be rephrased as size of join estimation problems. Thus all the results developed for the latter can be extended straightforwardly to the other two problems Size of Join Problem Unidimensional Size of Join Estimation Problem Let F and G be two relations, each with a single attribute A with domain I. Furthermore, let f i and g i be the frequency of the value i in F and G, respectively. With this, the size of join problem is to estimate the quantity: F A G = i I f i g i (2 1) given synopses of relations F and G (if full information is available, we can simply compute the sum to get the exact answer) Multidimensional Size of Join Estimation Problem Let F 1 (A 1 ),, F m (A m ) and G(A 1,, A m ) be m + 1 relations. Let F i and G have a common attribute A i with domain I i. Let f ik, g i1,,i k be the frequencies of values i k, (i 1,, i m ) in F k and G, respectively. Then the size of the join is: F 1 A1 G A2 Am F m = f i1 g (i1,,i m ) f im (2 2) i 1 I 1 i m I m 24

25 Table 2-1. Notations used in the chapter Symbol(s) Meaning F, G Relations A Join Attribute A k Join Attribute I Domain of join attribute A I k Domain of join attribute A k N Size of domain I i, j, i, j Indices going over domain I f i, g i Frequencies of value i in relations F, G f, g Average frequencies in relations F, G SJ(F ) i I f i 2, the self join size of F SqErr(F) i I (f i f ) 2, the squared error of F I l The lth bucket of domain I I(C) Identity function: 1 when C true, 0 otherwise σ Uniform random permutations x σ Random variable modeling the frequency n Number of buckets N Size of the sample m Number of attributes P [p] Probability that predicate p holds E [X ] Expected value of random variable X Var [X ] Variance of random variable X, Var [X ] = E [ X 2] E [X ] 2 Cov(X, Y ) Covariance of random variables X and Y, Cov(X, Y ) = E [XY ] E [X ] E [Y ] (i 1,, i m ) Induces going over domain I with m dimensions T XML document nodes The problem of multidimensional size of the join is to approximate Equation 2 2 given synopses of relations F i s and G Selectivity Estimation Problem For selectivity estimation problems, we show how they can be reduced to size of join estimation problems in the case of Bi-dimensional selectivity, which can be easily generalized to multidimensional selectivity. This means that the results we are developing for size of join estimation readily apply for this problem as well. We also show an alternative reduction for arbitrary selectivity predicates that is not as efficient but works in all scenarios. Let G be a relation with two attributes A, B with domains I, J, 25

26 respectively. Given I I and J J, estimate the quantity: σ I J (G) = i I With I(C) the identity function, that takes value 1 if condition C is true and value 0 otherwise, by simply setting the relation F and H so that f i = I(i I ) and h j = I(j J ), j J g ij we have: F A G B H = f i g ij h j = I(i I )g ij I(j J ) i I j I i I j I = g ij = σ I J (G) i I j J Observe that the joins are with unidimensional relations; this is important since, in general, the smaller the dimensionality the more efficient the estimation. Essentially the same technique can be applied to the multidimensional case when the selection predicate can be rewritten as a conjunction of predicates, one for each attribute. In that case the selectivity estimation problem is reduced to the problem of estimating the size of a Star-join involving a unidimensional virtual relation for each attribute and the original relation. This can be generalized further by considering disjunctions of conjunctions of predicates involving individual predicates (or expressions that can be rewritten in this way). In this case, the expression can be rewritten as a disjunction of conjunctions with the extra property that the conjunctions do not contain common cases. This extra property ensures that the estimate of the selectivity of the initial predicate is simply the sum of the estimates for individual conjunctions. Note that the selectivity of the predicate is thus the size of the join of virtual relations with the original relation. This involves joins on multiple attributes between two relations that are more complicated than the joins on a single attribute. Moreover, constructing synopses of such virtual relations can be challenging (the tuples satisfying the selection predicate might actually have to be enumerated one by one). 26

27 2.2.3 Aggregates over Joins Problem In order to gain insight and to ease the understanding, we primarily focus on the problem of computing aggregates over the join of two relations. We make some comments towards the end of the section on how these ideas can be extended to larger joins. Let F and G be two relations that contain a join attribute A and possibly other attributes. Let us first look at aggregates of the form SUM FF F G (F A G) = F F (t F)F G (t G) t F A G with t F the part of the tuple in the join that comes from relation F (similarly for G) and F F and F G arbitrary functions. The only requirement for this to work is to be able to rewrite the expression summed up over the join as the product of expressions depending on attributes of the two relations. If such a rewriting is possible, we say that the aggregate is relation factorizable. To evaluate the sum of relation factorizable aggregates over the join of F and G, we observe that: SUM FF F G (F A G) = F F (t F)F G (t G) t F A G = F F (t F)F G (t G) i I t F A G,t.A=i = ( ) ( ) F F (t) F G (t) i I t F,t.A=i t G,t.A=i (2 3) = i I f i g i where f i and g i are just compact notations for expressions t F,t.A=i F F(t) and t G,t.A=i F G(t), respectively. The important observation is that we can use any method designed for size of join estimation to estimate this aggregate as well by simply replacing f i by f i and g i by g i since then the expression in Equation 2 1 is identical 27

28 to the last expression in Equation 2 3. Thus, computing such aggregates is as easy as computing sizes of joins; the complexity is in the join, not in the expression being summed up. With the ability to compute estimates of aggregates of the form SUM FF F G (F A G), we can immediately compute aggregates of the form AVG and STD as well. For example, to estimate AVG B (F(A, B) A G(A)) we can estimate SUM B (F(A, B) A G A ) and F A G and simply take their ratio. The ideas used to reduce aggregate estimation problems of the form SUM FF F G (F A G) to size of join estimation problems can be readily generalized to star joins involving multiple relations. The similar rewriting can be made. Since both the selectivity estimation and COUNT, SUM, AVG and STD aggregate estimation problems can be reduced to size of join problems, for the rest of the chapter we will focus only on the size of join problem. The problem of estimating MIN and MAX aggregates cannot be reduced to size of join problems but no approximate methods for estimating such aggregates exist either. The main problem with such developments is the fact that there is no way to predict extreme values using statistical methods unless very strong statistical assumptions are made (particular distributions of the data have to be assumed) Comments on Obtaining Error Guarantees from Expected Value and Variance Estimates The standard technique [8, 91] to obtain error guarantees, i.e. confidence intervals, for an estimate is to compute the expected value and variance and then to use either distribution independent bounds given by Chernoff s and Chebyshev s inequalities, or to use distribution dependent bounds. In the latter case, usually the Central Limit Theorem or one of its generalizations is used to argue that the distribution of the estimate is close to normal and then error bounds based on normal distributions with the same expected value and variance are produced. For all the estimates in this chapter, either distribution 28

29 independent bounds could be used to obtain strict characterization of the results or the normal distribution based bounds since all estimates can be expressed as weighted sums of independent identically distributed (iid) random variables so the Central Limit Theorem applies. In view of the above discussion, in order to simplify the exposition and the comparison, throughout the chapter we will just provide results in the form of expected values and variances or squared errors the variance is equal to the squared error if the random variable is unbiased. Actual error guarantees can be obtained straightforwardly using the above mentioned techniques. 2.3 Histograms as Function Approximators and Statistical Nonparametric Models Histograms were first introduced in Statistics as an alternative to parametric models 2. The main idea is to approximate, in a function approximation sense, the probability distribution function of an unknown distribution. The histogram can be used instead of the unknown probability density function to characterize the distribution. While results from Statistics clearly point out that no guarantee with respect to the goodness of the approximation of the p.d.f. can be given, good guarantees can be provided for the computation of the c.d.f. This is particularly useful when the cumulative distribution function has to be determined at various points (for example to produce confidence intervals). It is easier to explain why this is the case by translating this problem into a database problem: selectivity estimation. With our problem definition, if we have a selection predicate of the form G.A <= 10 and we want to estimate its selectivity, by constructing (or using a previously constructed) a histogram on this attribute, the estimate is simply the sum of the mass (i.e. total number of tuples) of the 2 Parametric models are models that depend on a small fixed set of parameters. For example normal distributions are parametric models since they depend only two parameters: mean and variance. 29

30 buckets completely included in the range (, 10] plus the proportional part of the bucket that overlaps the point 10. From the point of view of selectivity estimation by the above discussion the error comes only from the bucket that partially overlaps the interval. For the buckets that are fully included in the interval there is no source of error. This observation suggests that the error is kept under control since, in the absolute value, it is less than the mass of the overlapping bucket. In the context of statistics, there is an extra complication coming from the fact that the data provided form a sample, in which case natural statistical fluctuations would results in errors for the counts in each bucket. This is not a problem in databases since the histogram is constructed over the entire dataset not just a sample. 2.4 Random Shuffling Assumption Under the traditional uniform assumption, histograms should perform well only when the average frequency approximates well the frequencies in a bucket. As we have seen in the introduction (Example 1), the One-bucket histogram behaved exceptionally well even though the average frequency was a poor approximator of the frequencies. The uniform frequency assumption does not explain this good behavior. Instead of proposing a new type of histograms, the goal in this section is to find a better explanation and to explore it with statistical analysis. We want to formalize a statistical model that can be used to characterize the histograms. The starting point for our investigation is the random rearrangement of the elements of the domain for relation G in Example 1. The rearrangement did not change the skew of the distribution, it just decorrelated the matching of the frequencies in F and G. We formalize this random rearrangement in this section as the statistically well defined notion of random shuffling assumption. As we will see, this formalization leads to formulas for the error of the histograms and allows us to compare them with other approximation methods like sketching and sampling. Later in the chapter we ask the counterpart question: what is the histogram error behavior if the assumption is false. By 30

CORRELATION-AWARE STATISTICAL METHODS FOR SAMPLING-BASED GROUP BY ESTIMATES

CORRELATION-AWARE STATISTICAL METHODS FOR SAMPLING-BASED GROUP BY ESTIMATES CORRELATION-AWARE STATISTICAL METHODS FOR SAMPLING-BASED GROUP BY ESTIMATES By FEI XU A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

More information

Event Operators: Formalization, Algorithms, and Implementation Using Interval- Based Semantics

Event Operators: Formalization, Algorithms, and Implementation Using Interval- Based Semantics Department of Computer Science and Engineering University of Texas at Arlington Arlington, TX 76019 Event Operators: Formalization, Algorithms, and Implementation Using Interval- Based Semantics Raman

More information

Statistical inference

Statistical inference Statistical inference Contents 1. Main definitions 2. Estimation 3. Testing L. Trapani MSc Induction - Statistical inference 1 1 Introduction: definition and preliminary theory In this chapter, we shall

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

Topics in Probabilistic and Statistical Databases. Lecture 9: Histograms and Sampling. Dan Suciu University of Washington

Topics in Probabilistic and Statistical Databases. Lecture 9: Histograms and Sampling. Dan Suciu University of Washington Topics in Probabilistic and Statistical Databases Lecture 9: Histograms and Sampling Dan Suciu University of Washington 1 References Fast Algorithms For Hierarchical Range Histogram Construction, Guha,

More information

End-biased Samples for Join Cardinality Estimation

End-biased Samples for Join Cardinality Estimation End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison estan,naughton@cs.wisc.edu Abstract We present a new

More information

Correlated subqueries. Query Optimization. Magic decorrelation. COUNT bug. Magic example (slide 2) Magic example (slide 1)

Correlated subqueries. Query Optimization. Magic decorrelation. COUNT bug. Magic example (slide 2) Magic example (slide 1) Correlated subqueries Query Optimization CPS Advanced Database Systems SELECT CID FROM Course Executing correlated subquery is expensive The subquery is evaluated once for every CPS course Decorrelate!

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms

More information

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

ESTIMATING STATISTICAL CHARACTERISTICS UNDER INTERVAL UNCERTAINTY AND CONSTRAINTS: MEAN, VARIANCE, COVARIANCE, AND CORRELATION ALI JALAL-KAMALI

ESTIMATING STATISTICAL CHARACTERISTICS UNDER INTERVAL UNCERTAINTY AND CONSTRAINTS: MEAN, VARIANCE, COVARIANCE, AND CORRELATION ALI JALAL-KAMALI ESTIMATING STATISTICAL CHARACTERISTICS UNDER INTERVAL UNCERTAINTY AND CONSTRAINTS: MEAN, VARIANCE, COVARIANCE, AND CORRELATION ALI JALAL-KAMALI Department of Computer Science APPROVED: Vladik Kreinovich,

More information

On the Average Complexity of Brzozowski s Algorithm for Deterministic Automata with a Small Number of Final States

On the Average Complexity of Brzozowski s Algorithm for Deterministic Automata with a Small Number of Final States On the Average Complexity of Brzozowski s Algorithm for Deterministic Automata with a Small Number of Final States Sven De Felice 1 and Cyril Nicaud 2 1 LIAFA, Université Paris Diderot - Paris 7 & CNRS

More information

Processing Aggregate Queries over Continuous Data Streams

Processing Aggregate Queries over Continuous Data Streams Processing Aggregate Queries over Continuous Data Streams Alin Dobra Computer Science Department Cornell University April 15, 2003 Relational Database Systems did dname 15 Legal 17 Marketing 3 Development

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

7 RC Simulates RA. Lemma: For every RA expression E(A 1... A k ) there exists a DRC formula F with F V (F ) = {A 1,..., A k } and

7 RC Simulates RA. Lemma: For every RA expression E(A 1... A k ) there exists a DRC formula F with F V (F ) = {A 1,..., A k } and 7 RC Simulates RA. We now show that DRC (and hence TRC) is at least as expressive as RA. That is, given an RA expression E that mentions at most C, there is an equivalent DRC expression E that mentions

More information

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch 1 and Srikanta Tirthapura 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY

More information

Estimating the Selectivity of tf-idf based Cosine Similarity Predicates

Estimating the Selectivity of tf-idf based Cosine Similarity Predicates Estimating the Selectivity of tf-idf based Cosine Similarity Predicates Sandeep Tata Jignesh M. Patel Department of Electrical Engineering and Computer Science University of Michigan 22 Hayward Street,

More information

GAV-sound with conjunctive queries

GAV-sound with conjunctive queries GAV-sound with conjunctive queries Source and global schema as before: source R 1 (A, B),R 2 (B,C) Global schema: T 1 (A, C), T 2 (B,C) GAV mappings become sound: T 1 {x, y, z R 1 (x,y) R 2 (y,z)} T 2

More information

Summarizing Measured Data

Summarizing Measured Data Summarizing Measured Data 12-1 Overview Basic Probability and Statistics Concepts: CDF, PDF, PMF, Mean, Variance, CoV, Normal Distribution Summarizing Data by a Single Number: Mean, Median, and Mode, Arithmetic,

More information

Non-Work-Conserving Non-Preemptive Scheduling: Motivations, Challenges, and Potential Solutions

Non-Work-Conserving Non-Preemptive Scheduling: Motivations, Challenges, and Potential Solutions Non-Work-Conserving Non-Preemptive Scheduling: Motivations, Challenges, and Potential Solutions Mitra Nasri Chair of Real-time Systems, Technische Universität Kaiserslautern, Germany nasri@eit.uni-kl.de

More information

A lower bound for scheduling of unit jobs with immediate decision on parallel machines

A lower bound for scheduling of unit jobs with immediate decision on parallel machines A lower bound for scheduling of unit jobs with immediate decision on parallel machines Tomáš Ebenlendr Jiří Sgall Abstract Consider scheduling of unit jobs with release times and deadlines on m identical

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours

More information

Probabilistic Databases

Probabilistic Databases Probabilistic Databases Amol Deshpande University of Maryland Goal Introduction to probabilistic databases Focus on an overview of: Different possible representations Challenges in using them Probabilistic

More information

Logical Provenance in Data-Oriented Workflows (Long Version)

Logical Provenance in Data-Oriented Workflows (Long Version) Logical Provenance in Data-Oriented Workflows (Long Version) Robert Ikeda Stanford University rmikeda@cs.stanford.edu Akash Das Sarma IIT Kanpur akashds.iitk@gmail.com Jennifer Widom Stanford University

More information

Cover Page. The handle holds various files of this Leiden University dissertation

Cover Page. The handle  holds various files of this Leiden University dissertation Cover Page The handle http://hdl.handle.net/1887/39637 holds various files of this Leiden University dissertation Author: Smit, Laurens Title: Steady-state analysis of large scale systems : the successive

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

1 Approximate Quantiles and Summaries

1 Approximate Quantiles and Summaries CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity

More information

Probability (continued)

Probability (continued) DS-GA 1002 Lecture notes 2 September 21, 15 Probability (continued) 1 Random variables (continued) 1.1 Conditioning on an event Given a random variable X with a certain distribution, imagine that it is

More information

FIRST-ORDER QUERY EVALUATION ON STRUCTURES OF BOUNDED DEGREE

FIRST-ORDER QUERY EVALUATION ON STRUCTURES OF BOUNDED DEGREE FIRST-ORDER QUERY EVALUATION ON STRUCTURES OF BOUNDED DEGREE INRIA and ENS Cachan e-mail address: kazana@lsv.ens-cachan.fr WOJCIECH KAZANA AND LUC SEGOUFIN INRIA and ENS Cachan e-mail address: see http://pages.saclay.inria.fr/luc.segoufin/

More information

Lower Bounds for Testing Bipartiteness in Dense Graphs

Lower Bounds for Testing Bipartiteness in Dense Graphs Lower Bounds for Testing Bipartiteness in Dense Graphs Andrej Bogdanov Luca Trevisan Abstract We consider the problem of testing bipartiteness in the adjacency matrix model. The best known algorithm, due

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Testing Problems with Sub-Learning Sample Complexity

Testing Problems with Sub-Learning Sample Complexity Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology

More information

Midterm Exam 1 Solution

Midterm Exam 1 Solution EECS 126 Probability and Random Processes University of California, Berkeley: Fall 2015 Kannan Ramchandran September 22, 2015 Midterm Exam 1 Solution Last name First name SID Name of student on your left:

More information

NP Completeness and Approximation Algorithms

NP Completeness and Approximation Algorithms Chapter 10 NP Completeness and Approximation Algorithms Let C() be a class of problems defined by some property. We are interested in characterizing the hardest problems in the class, so that if we can

More information

Histograms and Wavelets on Probabilistic Data

Histograms and Wavelets on Probabilistic Data Histograms and Wavelets on Probabilistic Data Graham Cormode, AT&T Labs Research Minos Garofalakis, Technical University of Crete Email: graham@research.att.com, minos@acm.org Abstract There is a growing

More information

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS CRYSTAL L. KAHN and BENJAMIN J. RAPHAEL Box 1910, Brown University Department of Computer Science & Center for Computational Molecular Biology

More information

Query answering using views

Query answering using views Query answering using views General setting: database relations R 1,...,R n. Several views V 1,...,V k are defined as results of queries over the R i s. We have a query Q over R 1,...,R n. Question: Can

More information

1 Maintaining a Dictionary

1 Maintaining a Dictionary 15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition

More information

The Hilbert Space of Random Variables

The Hilbert Space of Random Variables The Hilbert Space of Random Variables Electrical Engineering 126 (UC Berkeley) Spring 2018 1 Outline Fix a probability space and consider the set H := {X : X is a real-valued random variable with E[X 2

More information

arxiv: v1 [cs.ds] 3 Feb 2018

arxiv: v1 [cs.ds] 3 Feb 2018 A Model for Learned Bloom Filters and Related Structures Michael Mitzenmacher 1 arxiv:1802.00884v1 [cs.ds] 3 Feb 2018 Abstract Recent work has suggested enhancing Bloom filters by using a pre-filter, based

More information

c 2011 Nisha Somnath

c 2011 Nisha Somnath c 2011 Nisha Somnath HIERARCHICAL SUPERVISORY CONTROL OF COMPLEX PETRI NETS BY NISHA SOMNATH THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Aerospace

More information

Integrated reliable and robust design

Integrated reliable and robust design Scholars' Mine Masters Theses Student Research & Creative Works Spring 011 Integrated reliable and robust design Gowrishankar Ravichandran Follow this and additional works at: http://scholarsmine.mst.edu/masters_theses

More information

Statistical Data Analysis

Statistical Data Analysis DS-GA 0 Lecture notes 8 Fall 016 1 Descriptive statistics Statistical Data Analysis In this section we consider the problem of analyzing a set of data. We describe several techniques for visualizing the

More information

CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2017. Tom M. Mitchell. All rights reserved. *DRAFT OF September 16, 2017* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is

More information

August 27, Review of Algebra & Logic. Charles Delman. The Language and Logic of Mathematics. The Real Number System. Relations and Functions

August 27, Review of Algebra & Logic. Charles Delman. The Language and Logic of Mathematics. The Real Number System. Relations and Functions and of August 27, 2015 and of 1 and of 2 3 4 You Must Make al Connections and of Understanding higher mathematics requires making logical connections between ideas. Please take heed now! You cannot learn

More information

Lecture 5: The Principle of Deferred Decisions. Chernoff Bounds

Lecture 5: The Principle of Deferred Decisions. Chernoff Bounds Randomized Algorithms Lecture 5: The Principle of Deferred Decisions. Chernoff Bounds Sotiris Nikoletseas Associate Professor CEID - ETY Course 2013-2014 Sotiris Nikoletseas, Associate Professor Randomized

More information

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds Lecture 25 of 42 PAC Learning, VC Dimension, and Mistake Bounds Thursday, 15 March 2007 William H. Hsu, KSU http://www.kddresearch.org/courses/spring2007/cis732 Readings: Sections 7.4.17.4.3, 7.5.17.5.3,

More information

Assortment Optimization under the Multinomial Logit Model with Nested Consideration Sets

Assortment Optimization under the Multinomial Logit Model with Nested Consideration Sets Assortment Optimization under the Multinomial Logit Model with Nested Consideration Sets Jacob Feldman School of Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853,

More information

Relative Improvement by Alternative Solutions for Classes of Simple Shortest Path Problems with Uncertain Data

Relative Improvement by Alternative Solutions for Classes of Simple Shortest Path Problems with Uncertain Data Relative Improvement by Alternative Solutions for Classes of Simple Shortest Path Problems with Uncertain Data Part II: Strings of Pearls G n,r with Biased Perturbations Jörg Sameith Graduiertenkolleg

More information

Quick Sort Notes , Spring 2010

Quick Sort Notes , Spring 2010 Quick Sort Notes 18.310, Spring 2010 0.1 Randomized Median Finding In a previous lecture, we discussed the problem of finding the median of a list of m elements, or more generally the element of rank m.

More information

CS 6820 Fall 2014 Lectures, October 3-20, 2014

CS 6820 Fall 2014 Lectures, October 3-20, 2014 Analysis of Algorithms Linear Programming Notes CS 6820 Fall 2014 Lectures, October 3-20, 2014 1 Linear programming The linear programming (LP) problem is the following optimization problem. We are given

More information

CONVERGENCE OF RANDOM SERIES AND MARTINGALES

CONVERGENCE OF RANDOM SERIES AND MARTINGALES CONVERGENCE OF RANDOM SERIES AND MARTINGALES WESLEY LEE Abstract. This paper is an introduction to probability from a measuretheoretic standpoint. After covering probability spaces, it delves into the

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior

More information

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Monte Carlo Studies. The response in a Monte Carlo study is a random variable. Monte Carlo Studies The response in a Monte Carlo study is a random variable. The response in a Monte Carlo study has a variance that comes from the variance of the stochastic elements in the data-generating

More information

Discrete Structures Proofwriting Checklist

Discrete Structures Proofwriting Checklist CS103 Winter 2019 Discrete Structures Proofwriting Checklist Cynthia Lee Keith Schwarz Now that we re transitioning to writing proofs about discrete structures like binary relations, functions, and graphs,

More information

Where do pseudo-random generators come from?

Where do pseudo-random generators come from? Computer Science 2426F Fall, 2018 St. George Campus University of Toronto Notes #6 (for Lecture 9) Where do pseudo-random generators come from? Later we will define One-way Functions: functions that are

More information

6.842 Randomness and Computation Lecture 5

6.842 Randomness and Computation Lecture 5 6.842 Randomness and Computation 2012-02-22 Lecture 5 Lecturer: Ronitt Rubinfeld Scribe: Michael Forbes 1 Overview Today we will define the notion of a pairwise independent hash function, and discuss its

More information

INFORMATION APPROACH FOR CHANGE POINT DETECTION OF WEIBULL MODELS WITH APPLICATIONS. Tao Jiang. A Thesis

INFORMATION APPROACH FOR CHANGE POINT DETECTION OF WEIBULL MODELS WITH APPLICATIONS. Tao Jiang. A Thesis INFORMATION APPROACH FOR CHANGE POINT DETECTION OF WEIBULL MODELS WITH APPLICATIONS Tao Jiang A Thesis Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the

More information

Errata and Proofs for Quickr [2]

Errata and Proofs for Quickr [2] Errata and Proofs for Quickr [2] Srikanth Kandula 1 Errata We point out some errors in the SIGMOD version of our Quickr [2] paper. The transitivity theorem, in Proposition 1 of Quickr, has a revision in

More information

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable

More information

Examining the accuracy of the normal approximation to the poisson random variable

Examining the accuracy of the normal approximation to the poisson random variable Eastern Michigan University DigitalCommons@EMU Master's Theses and Doctoral Dissertations Master's Theses, and Doctoral Dissertations, and Graduate Capstone Projects 2009 Examining the accuracy of the

More information

Factorized Relational Databases Olteanu and Závodný, University of Oxford

Factorized Relational Databases   Olteanu and Závodný, University of Oxford November 8, 2013 Database Seminar, U Washington Factorized Relational Databases http://www.cs.ox.ac.uk/projects/fd/ Olteanu and Závodný, University of Oxford Factorized Representations of Relations Cust

More information

Slope Fields: Graphing Solutions Without the Solutions

Slope Fields: Graphing Solutions Without the Solutions 8 Slope Fields: Graphing Solutions Without the Solutions Up to now, our efforts have been directed mainly towards finding formulas or equations describing solutions to given differential equations. Then,

More information

Probabilistic Characterization of Nearest Neighbor Classifier

Probabilistic Characterization of Nearest Neighbor Classifier Noname manuscript No. (will be inserted by the editor) Probabilistic Characterization of Nearest Neighbor Classifier Amit Dhurandhar Alin Dobra Received: date / Accepted: date Abstract The k-nearest Neighbor

More information

The Growth of Functions. A Practical Introduction with as Little Theory as possible

The Growth of Functions. A Practical Introduction with as Little Theory as possible The Growth of Functions A Practical Introduction with as Little Theory as possible Complexity of Algorithms (1) Before we talk about the growth of functions and the concept of order, let s discuss why

More information

Abstract & Applied Linear Algebra (Chapters 1-2) James A. Bernhard University of Puget Sound

Abstract & Applied Linear Algebra (Chapters 1-2) James A. Bernhard University of Puget Sound Abstract & Applied Linear Algebra (Chapters 1-2) James A. Bernhard University of Puget Sound Copyright 2018 by James A. Bernhard Contents 1 Vector spaces 3 1.1 Definitions and basic properties.................

More information

Randomized Complexity Classes; RP

Randomized Complexity Classes; RP Randomized Complexity Classes; RP Let N be a polynomial-time precise NTM that runs in time p(n) and has 2 nondeterministic choices at each step. N is a polynomial Monte Carlo Turing machine for a language

More information

Divisible Load Scheduling

Divisible Load Scheduling Divisible Load Scheduling Henri Casanova 1,2 1 Associate Professor Department of Information and Computer Science University of Hawai i at Manoa, U.S.A. 2 Visiting Associate Professor National Institute

More information

2.3 Some Properties of Continuous Functions

2.3 Some Properties of Continuous Functions 2.3 Some Properties of Continuous Functions In this section we look at some properties, some quite deep, shared by all continuous functions. They are known as the following: 1. Preservation of sign property

More information

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Statistics - Lecture One. Outline. Charlotte Wickham  1. Basic ideas about estimation Statistics - Lecture One Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Outline 1. Basic ideas about estimation 2. Method of Moments 3. Maximum Likelihood 4. Confidence

More information

Lecture 6 September 13, 2016

Lecture 6 September 13, 2016 CS 395T: Sublinear Algorithms Fall 206 Prof. Eric Price Lecture 6 September 3, 206 Scribe: Shanshan Wu, Yitao Chen Overview Recap of last lecture. We talked about Johnson-Lindenstrauss (JL) lemma [JL84]

More information

Computability of Heyting algebras and. Distributive Lattices

Computability of Heyting algebras and. Distributive Lattices Computability of Heyting algebras and Distributive Lattices Amy Turlington, Ph.D. University of Connecticut, 2010 Distributive lattices are studied from the viewpoint of effective algebra. In particular,

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Review. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with

More information

Lecture Wigner-Ville Distributions

Lecture Wigner-Ville Distributions Introduction to Time-Frequency Analysis and Wavelet Transforms Prof. Arun K. Tangirala Department of Chemical Engineering Indian Institute of Technology, Madras Lecture - 6.1 Wigner-Ville Distributions

More information

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu

More information

Applied Statistics and Econometrics

Applied Statistics and Econometrics Applied Statistics and Econometrics Lecture 6 Saul Lach September 2017 Saul Lach () Applied Statistics and Econometrics September 2017 1 / 53 Outline of Lecture 6 1 Omitted variable bias (SW 6.1) 2 Multiple

More information

Experiment 2 Random Error and Basic Statistics

Experiment 2 Random Error and Basic Statistics PHY9 Experiment 2: Random Error and Basic Statistics 8/5/2006 Page Experiment 2 Random Error and Basic Statistics Homework 2: Turn in at start of experiment. Readings: Taylor chapter 4: introduction, sections

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

Introduction to Probability

Introduction to Probability LECTURE NOTES Course 6.041-6.431 M.I.T. FALL 2000 Introduction to Probability Dimitri P. Bertsekas and John N. Tsitsiklis Professors of Electrical Engineering and Computer Science Massachusetts Institute

More information

SOME RESOURCE ALLOCATION PROBLEMS

SOME RESOURCE ALLOCATION PROBLEMS SOME RESOURCE ALLOCATION PROBLEMS A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

More information

2. Probability. Chris Piech and Mehran Sahami. Oct 2017

2. Probability. Chris Piech and Mehran Sahami. Oct 2017 2. Probability Chris Piech and Mehran Sahami Oct 2017 1 Introduction It is that time in the quarter (it is still week one) when we get to talk about probability. Again we are going to build up from first

More information

Some notes on streaming algorithms continued

Some notes on streaming algorithms continued U.C. Berkeley CS170: Algorithms Handout LN-11-9 Christos Papadimitriou & Luca Trevisan November 9, 016 Some notes on streaming algorithms continued Today we complete our quick review of streaming algorithms.

More information

2.5.2 Basic CNF/DNF Transformation

2.5.2 Basic CNF/DNF Transformation 2.5. NORMAL FORMS 39 On the other hand, checking the unsatisfiability of CNF formulas or the validity of DNF formulas is conp-complete. For any propositional formula φ there is an equivalent formula in

More information

CS246 Final Exam, Winter 2011

CS246 Final Exam, Winter 2011 CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

2012 IEEE International Symposium on Information Theory Proceedings

2012 IEEE International Symposium on Information Theory Proceedings Decoding of Cyclic Codes over Symbol-Pair Read Channels Eitan Yaakobi, Jehoshua Bruck, and Paul H Siegel Electrical Engineering Department, California Institute of Technology, Pasadena, CA 9115, USA Electrical

More information

Compiling Knowledge into Decomposable Negation Normal Form

Compiling Knowledge into Decomposable Negation Normal Form Compiling Knowledge into Decomposable Negation Normal Form Adnan Darwiche Cognitive Systems Laboratory Department of Computer Science University of California Los Angeles, CA 90024 darwiche@cs. ucla. edu

More information

Chapter 2. Mathematical Reasoning. 2.1 Mathematical Models

Chapter 2. Mathematical Reasoning. 2.1 Mathematical Models Contents Mathematical Reasoning 3.1 Mathematical Models........................... 3. Mathematical Proof............................ 4..1 Structure of Proofs........................ 4.. Direct Method..........................

More information

Finding Frequent Items in Probabilistic Data

Finding Frequent Items in Probabilistic Data Finding Frequent Items in Probabilistic Data Qin Zhang, Hong Kong University of Science & Technology Feifei Li, Florida State University Ke Yi, Hong Kong University of Science & Technology SIGMOD 2008

More information

Computational Learning Theory

Computational Learning Theory 1 Computational Learning Theory 2 Computational learning theory Introduction Is it possible to identify classes of learning problems that are inherently easy or difficult? Can we characterize the number

More information

Notes 6 : First and second moment methods

Notes 6 : First and second moment methods Notes 6 : First and second moment methods Math 733-734: Theory of Probability Lecturer: Sebastien Roch References: [Roc, Sections 2.1-2.3]. Recall: THM 6.1 (Markov s inequality) Let X be a non-negative

More information

Evaluation of Probabilistic Queries over Imprecise Data in Constantly-Evolving Environments

Evaluation of Probabilistic Queries over Imprecise Data in Constantly-Evolving Environments Evaluation of Probabilistic Queries over Imprecise Data in Constantly-Evolving Environments Reynold Cheng, Dmitri V. Kalashnikov Sunil Prabhakar The Hong Kong Polytechnic University, Hung Hom, Kowloon,

More information

Statistics and Data Analysis

Statistics and Data Analysis Statistics and Data Analysis The Crash Course Physics 226, Fall 2013 "There are three kinds of lies: lies, damned lies, and statistics. Mark Twain, allegedly after Benjamin Disraeli Statistics and Data

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Price: $25 (incl. T-Shirt, morning tea and lunch) Visit:

Price: $25 (incl. T-Shirt, morning tea and lunch) Visit: Three days of interesting talks & workshops from industry experts across Australia Explore new computing topics Network with students & employers in Brisbane Price: $25 (incl. T-Shirt, morning tea and

More information

Exact and Approximate Equilibria for Optimal Group Network Formation

Exact and Approximate Equilibria for Optimal Group Network Formation Exact and Approximate Equilibria for Optimal Group Network Formation Elliot Anshelevich and Bugra Caskurlu Computer Science Department, RPI, 110 8th Street, Troy, NY 12180 {eanshel,caskub}@cs.rpi.edu Abstract.

More information

THE SURE-THING PRINCIPLE AND P2

THE SURE-THING PRINCIPLE AND P2 Economics Letters, 159: 221 223, 2017 DOI 10.1016/j.econlet.2017.07.027 THE SURE-THING PRINCIPLE AND P2 YANG LIU Abstract. This paper offers a fine analysis of different versions of the well known sure-thing

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at

More information