A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

Size: px
Start display at page:

Download "A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases"

Transcription

1 A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases Thomas Bernecker #, Tobias Emrich #, Hans-Peter Kriegel #, Nikos Mamoulis, Matthias Renz #, Andreas Züfle # # Department of Computer Science, Ludwig-Maximilians-Universität München Oettingenstr. 67, Munich, Germany bernecker,emrich,kriegel,renz,zuefle@dbs.ifi.lmu.de Department of Computer Science, University of Hong Kong Pokfulam Road, Hong Kong nikos@cs.hku.hk Abstract In this paper, we propose a novel, effective and efficient probabilistic pruning criterion for probabilistic similarity queries on uncertain data. Our approach supports a general uncertainty model using continuous probabilistic density functions to describe the (possibly correlated) uncertain attributes of objects. In a nutshell, the problem to be solved is to compute the PDF of the random variable denoted by the probabilistic domination count: Given an uncertain database object B, an uncertain reference object R and a set D of uncertain database objects in a multi-dimensional space, the probabilistic domination count denotes the number of uncertain objects in D that are closer to R than B. This domination count can be used to answer a wide range of probabilistic similarity queries. Specifically, we propose a novel geometric pruning filter and introduce an iterative filter-refinement strategy for conservatively and progressively estimating the probabilistic domination count in an efficient way while keeping correctness according to the possible world semantics. In an experimental evaluation, we show that our proposed technique allows to acquire tight probability bounds for the probabilistic domination count quickly, even for large uncertain databases. I. INTRODUCTION In the past two decades, there has been a great deal of interest in developing efficient and effective methods for similarity queries, e.g. k-nearest neighbor search, reverse k-nearest neighbor search and ranking in spatial, temporal, multimedia and sensor databases. Many applications dealing with such data have to cope with uncertain or imprecise data. In this work, we introduce a novel scalable pruning approach to identify candidates for a class of probabilistic similarity queries. Generally spoken, probabilistic similarity queries compute for each database object o Dthe probability that a given query predicate is fulfilled. Our approach addresses probabilistic similarity queries where the query predicate is based on object (distance) relations, i.e. the event that an object B belongs to the result set depends on the relation of its distance to the query object R and the distance of another object A to the query object. Exemplarily, we apply our novel pruning method to the most prominent queries of the above mentioned class, including the probabilistic k-nearest neighbor (PkNN) query, the probabilistic reverse k-nearest Fig.. A R A dominates B w.r.t. R with high probability. neighbor (PRkNN) query and the probabilistic inverse ranking query. A. Uncertainty Model In this paper, we assume that the database D consists of multi-attribute objects o,...,o N that may have uncertain attribute values. An uncertain attribute is defined as follows: Definition (Probabilistic Attribute). A probabilistic attribute attr of object o i is a random variable drawn from a probability distribution with density function f attr i. An uncertain object o i has at least one uncertain attribute value. The function f i denotes the multi-dimensional probability density distribution (PDF) of o i that combines all density functions for all probabilistic attributes attr of o i. Following the convention of uncertain databases [6], [8], [9], [], [4], [2], [24], we assume that f i is (minimally) bounded by an uncertainty region R i such that x / R i : f i (x) =0and R i f i (x)dx. Specifically, the case R i f i (x)dx < implements existential uncertainty, i.e. object o i may not exist in the database at all with a probability greater than zero. In this paper we focus on the case R i f i (x)dx =, but the proposed concepts can B

2 be easily adapted to existentially uncertain objects. Although our approach is also applicable for unbounded PDF, e.g., Gaussian PDF, here we assume f i exceeds zero only within a bounded region. This is a realistic assumption because the spectrum of possible values of attributes is usually bounded and it is commonly used in related work, e.g. [8], [9] and [6]. Even if f i is given as an unbounded PDF, a common strategy is to truncate PDF tails with negligible probabilities and normalize the resulting PDF. In specific, [6] shows that for a reasonable low truncation threshold, the impact on the accuracy of probabilistic ranking queries is quite low while having a very high impact on the query performance. In this way, each uncertain object can be considered as a d-dimensional rectangle with an associated multi-dimensional object PDF (c.f. Figure ). Here, we assume that uncertain attributes may be mutually dependent. Therefore the object PDF can have any arbitrary form, and in general, cannot simply be derived from the marginal distribution of the uncertain attributes. Note that in many applications, a discrete uncertainty model is appropriate, meaning that the probability distribution of an uncertain object is given by a finite number of alternatives assigned with probabilities. This can be seen as a special case of our model. B. Problem Formulation We address the problem of detecting for a given uncertain object B the number of uncertain objects of an uncertain database D that are closer to (i.e. dominate) a reference object R than B. We call this number the domination count of B w.r.t. R as defined below: Definition 2 (Domination). Consider an uncertain database D = {o,...,o N } and an uncertain reference object R. Let A, B D. Dom(A, B, R) is the random indicator variable that is, iff A dominates B w.r.t. R, formally:, if dist(a, r) <dist(b, r) Dom(A, B, R) = a A, b B,r R 0, otherwise where a, b and r are samples drawn from the PDFs of A, B and R, respectively and dist is a distance function on vector objects. Definition 3 (Domination Count). Consider an uncertain database D = {o,...,o N } and an uncertain reference object R. For each uncertain object B D, let DomCount(B,R) be the random variable of the number of uncertain objects A D(A= B) that are closer to R than B: DomCount(B,R) = Dom(A, B, R) A D,A=B DomCount(B,R) is the sum of N non-necessarily identically distributed and non-necessarily independent We assume Euclidean distance for the remainder of the paper, but the techniques can be applied to any L p norm. Bernoulli variables. The problem solved in this paper is to efficiently compute the probability density distribution of DomCount(B,R)(B D) formally introduced by means of the probabilistic domination (cf. Section III) and the probabilistic domination count (cf. Section IV). Determining domination is a central module for most types of similarity queries in order to identify true hits and true drops (pruning). In the context of probabilistic similarity queries, knowledge about the PDF of DomCount(B,R) can be used to find out if B satisfies the query predicate. For example, for a probabilistic 5NN query with probability threshold τ = 0% and query object Q, an object B can be pruned (returned as a true hit), if the probability P (DomCount(B,Q) < 5) is less (more) than 0%. C. Overview Given an uncertain database D = {o,...,o N } and an uncertain reference object R, our objective is to efficiently derive the distribution of DomCount(B,R) for any uncertain object B Dand use it in the computation of probabilistic similarity queries. First (Section III), we build on the methodology of [5] to efficiently find the complete set of objects in D that definitely dominate (are dominated by) B w.r.t. R. At the same time, we find the set of objects whose dominance relationship to B is uncertain. Using a decomposition technique, for each object A in this set, we can derive a lower and an upper bound for PDom(A, B, R), i.e., the probability that A dominates B w.r.t. R. In Section IV, we show that due to dependencies between object distances to R, these probabilities cannot be combined in a straightforward manner to approximate the distribution of DomCount(B,R). We propose a solution that copes with these dependencies and introduce techniques that help to to compute the probabilistic domination count in an efficient way. In particular, we prove that the bounds of PDom(A, B, R) are mutually independent if they are computed without a decomposition of B and R. Then, we provide a class of uncertain generating functions that use these bounds to build the distribution of DomCount(B,R). We then propose an algorithm which progressively refines DomCount(B,R) by iteratively decomposing the objects that influence its computation (Section V). Section VI shows how to apply this iterative probabilistic domination count refinement process to evaluate several types of probabilistic similarity queries. In Section VII, we experimentally demonstrate the effectiveness and efficiency of our probabilistic pruning methods for various parameter settings on artificial and real-world datasets. II. RELATED WORK The management of uncertain data has gained increasing interest in diverse application fields, e.g. sensor monitoring [2], traffic analysis, location-based services [27] etc. Thus, modelling probabilistic databases has become very important in the literature, e.g. [], [23], [24]. In general, these models can be classified in two types: discrete and continuous uncertainty models. Discrete models represent each uncertain object by a discrete set of alternative values, each associated with a

3 probability. This model is in general adopted for probabilistic databases, where tuples are associated with existential probabilities, e.g.[4], [9], [25], [6]. In this work, we concentrate on the continuous model in which an uncertain object is represented by a probability density function (PDF) within the vector space. In general, similarity search methods based on this model involve expensive integrations of the PDFs, hence special approximation and indexing techniques for efficient query processing are typically employed [3], [26]. Uncertain similarity query processing has focused on various aspects. A lot of existing work dealing with uncertain data addresses probabilistic nearest neighbor (NN) queries for certain query objects [], [8] and for uncertain queries [7]. To reduce computational effort, [9] add threshold constraints in order to retrieve only objects whose probability of being the nearest neighbor exceeds a user-specified threshold to control the desired confidence required in a query answer. Similar semantics of queries in probabilistic databases are provided by Top-k nearest neighbor queries [6], where the k most probable results of being the nearest neighbor to a certain query point are returned. Existing solutions on probabilistic k- nearest neighbor (knn) queries restrict to expected distances of the uncertain objects to the query object [22] or also use a threshold constraint [0]. However, the use of expected distances does not adhere to the possible world semantics and may thus produce very inaccurate results, that may have a very small probability of being an actual result ([25], [9]). Several approaches return the full result to queries as a ranking of probabilistic objects according to their distance to a certain query point [4], [4], [9], [25]. However, all these prior works have in common that the query is given as a single (certain) point. To the best of our knowledge, k-nearest neighbor queries as well as ranking queries on uncertain data, where the query object is allowed to be uncertain, have not been addressed so far. Probabilistic reverse nearest neighbor (RNN) queries have been addressed in [7] to process them on data based on discrete and continuous uncertainty models. Similar to our solution, the uncertainty regions of the data are modelled by MBRs. Based on these approximations, the authors of [7] are able to apply a combination of spatial, metric and probabilistic pruning criteria to efficiently answer queries. All of the above approaches that use MBRs as approximations for uncertain objects utilize the minimum/maximum distance approximations in order to remove possible candidates. However, the pruning power can be improved using geometry-based pruning techniques as shown in [5]. In this context, [20] introduces a geometric pruning technique that can be utilized to answer monochromatic and bichromatic probabilistic RNN queries for arbitrary object distributions. The framework that we introduce in this paper can be used to answer probabilistic (threshold) knn queries and probabilistic reverse (threshold) knn queries as well as probabilistic ranking and inverse ranking queries for uncertain query objects. III. SIMILARITY DOMINATION ON UNCERTAIN DATA In this section, we tackle the following problem: Given three uncertain objects A, B and R in a multidimensional space R d, determine whether object A is closer to R than B w.r.t. a distance function defined on the objects in R d. If this is the case, we say A dominates B w.r.t. R. In contrast to [5], where this problem is solved for certain data, in the context of uncertain objects this domination relation is not a predicate that is either true or false, but rather a (dichotomous) random variable as defined in Definition 2. In the example depicted in Figure, there are three uncertain objects A, B and R, each bounded by a rectangle representing the possible locations of the object in R 2. The PDFs of A, B and R are depicted as well. In this scenario, we cannot determine for sure whether object A dominates B w.r.t. R. However, it is possible to determine that object A dominates object B w.r.t. R with a high probability. The problem at issue is to determine the probabilistic domination probability defined as: Definition 4 (Probabilistic Domination). Given three uncertain objects A, B and R, the probabilistic domination PDom(A, B, R) denotes the probability that A dominates B w.r.t. R. Naively, we can compute PDom(A, B, R) by simply integrating the probability of all possible worlds in which A dominates B w.r.t. R exploiting inter-object independency: a A b B r R PDom(A, B, R) = δ(a, b, r) P (A = a) P (B = b) P (R = r)da db dr, where δ(a, b, r) is the following indicator function:, if dist(a, r) <dist(b, r) δ(a, b, r) = 0, else The problem of this naive approach is the computational cost of the triple-integral. The integrals of the PDFs of A, B and R may in general not be representable as a closedform expression and the integral of δ(a, b, r) does not have a closed-from expression. Therefore, an expensive numeric approximation is required for this approach. In the rest of this section we propose methods that efficiently derive bounds for PDom(A, B, R), which can be used to prune objects avoiding integral computations. A. Complete Domination First, we show how to detect whether A completely dominates B w.r.t. R (i.e. if PDom(A, B, R) = ) regardless of the probability distributions assigned to the rectangular uncertainty regions. The state-of-the-art criterion to detect spatial domination on rectangular uncertainty regions is with the use of minimum/maximum distance approximations. This criterion states that A dominates B w.r.t. R if the minimum distance between R and B is greater than the maximum distance between R and A. Although correct, this criterion is not tight (cf. [5]), i.e. not each case where A dominates

4 (a) Complete domination Fig. 2. (b) Probabilistic domination Similarity Domination. B w.r.t. R is detected by the min/max-domination criterion. The problem is that the dependency between the two distances between A and R and between B and R is ignored. Obviously, the distance between A and R as well as the distance between B and R depend on the location of R. However, since R can only have a unique location within its uncertainty region, both distances are mutually dependent. Therefore, we adopt the spatial domination concepts proposed in [5] for rectangular uncertainty regions. Corollary (Complete Domination). Let A, B, R be uncertain objects with rectangular uncertainty regions. Then the following statement holds: d i= max r i {R min i,ri max } PDom(A, B, R) = (MaxDist(A i,r i ) p MinDist(B i,r i ) p ) < 0, where A i,b i and R i denote the projection interval of the respective rectangular uncertainty region of A, B and R on the i th dimension; Ri min (Ri max ) denotes the lower (upper) bound of interval R i, and p corresponds to the used L p norm. The functions MaxDist(A, r) and MinDist(A, r) denote the maximal (respectively minimal) distance between the onedimensional interval A and the one-dimensional point r. Corollary follows directly from [5]; the inequality is true if and only if for all points a A, b B,r R, a is closer to r than b. Translated into the possible worlds model, this is equivalent to the statement that A is closer to R than B for any possible world, which in return means that PDom(A, B, R) =. In addition, it holds that Corollary 2. PDom(A, B, R) = PDom(B, A, R) =0 In the example depicted in Figure 2(a), the grey region on the right shows all points that definitely are closer to A than to B and the grey region on the left shows all points that definitely are closer to B than to A. Consequently, A dominates B (B dominates A) if R completely falls into the right (left) grey shaded half-space. 2 2 Note that the grey regions are not explicitly computed; we only include them in Figure 2(a) for illustration purpose. B. Probabilistic Domination Now, we consider the case where A does not completely dominate B w.r.t. R. In consideration of the possible world semantics, there may exist worlds in which A dominates B w.r.t. R, but not all possible worlds satisfy this criterion. Let us consider the example shown in Figure 2(b) where the uncertainty region of A is decomposed into five partitions, each assigned to one of the five grey-shaded regions illustrating which points are closer to the partition in A than to B. As we can see, R only completely falls into three grey-shaded regions. This means that A does not completely dominate B w.r.t. R. However, we know that in some possible worlds (at least in all possible words where A is located in A, A 2 or A 3 ) A does dominate B w.r.t. R. The question at issue is how to determine the probability PDom(A, B, R) that A dominates B w.r.t. R in an efficient way. The key idea is to decompose the uncertainty region of an object X into subregions for which we know the probability that X is located in that subregion (as done for object A in our example). Therefore, if neither Dom(A, B, R) nor Dom(B, A, R) holds, then there may still exist subregions A A, B B and R R such that A dominates B w.r.t. R. Given disjunctive decomposition schemes A, B and R we can identify triples of subregions (A A, B B, R R) for which Dom(A,B,R ) holds. Let δ(a,b,r ) be the following indicator function: δ(a,b,r, if Dom(A,B,R ) )= 0, else Lemma. Let A, B and R be uncertain objects with disjunctive object decompositions A, B and R, respectively. To derive a lower bound PDom LB (A, B, R) of the probability PDom(A, B, R) that A dominates B w.r.t. R, we can accumulate the probabilities of combinations of these subregions as follows: PDom LB (A, B, R) = A A,B B,R R P (a A ) P (b B ) P (r R ) δ(a,b,r ), where P (X X ) denotes the probability that object X is located within the region X. Proof: The probability of a combination (A,B,R ) can be computed by P (a A ) P (b B ) P (r R ) due to the assumption of mutually independent objects. These probabilities can be aggregated due to the assumption of disjunctive subregions, which implies that any two different combinations of subregions (A A,B B,R R) and (A A,B B,R R, A = A B = B R = R must represent disjunctive sets of possible worlds. It is obvious that all possible worlds defined by combinations (A,B,R ) where δ(a,b,r ) =, A dominates B w.r.t. R. But not all possible worlds where A dominates B w.r.t. R are covered by these combinations and, thus, do not contribute to PDom LB (A, B, R). Consequently, PDom LB (A, B, R) lower bounds PDom(A, B, R).

5 R A,A 2 Fig. 3. A and A 2 dominate B w.r.t. R with a probability of 50%, respectively. Analogously, we can define an upper bound of PDom(A, B, R): Lemma 2. An upper bound PDom UB (A, B, R) of PDom(A, B, R) can be derived as follows: PDom UB (A, B, R) = PDom LB (B, A, R) Naturally, the more refined the decompositions are, the tighter the bounds that can be computed and the higher the corresponding cost of deriving them. In particular, starting from the entire MBRs of the objects, we can progressively partition them to iteratively derive tighter bounds for their dependency relationships until a desired degree of certainty is achieved (based on some threshold). However, in the next section, we show that the derivation of the domination count DomCount(B,R) of a given object B (cf. Definition 3), which is the main module of prominent probabilistic queries cannot be straightforwardly derived with the use of these bounds and we propose a methodology based on generating functions for this purpose. IV. PROBABILISTIC DOMINATION COUNT In Section III we described how to conservatively and progressively approximate the probability that A dominates B w.r.t. R. Given these approximations PDom LB (A, B, R) and PDom UB (A, B, R), the next problem is to cumulate these probabilities to get an approximation of the domination count DomCount(B,R) of an object B w.r.t. R (cf. Definition 3). To give an intuition how challenging this problem is, we first present a naive solution that can yield incorrect results due to ignoring dependencies between domination relations in Section IV-A. To avoid the problem of dependent domination relations, we first show in Section IV-B how to exploit object independencies to derive domination bounds that are mutually independent. Afterwards, in Section IV-C, we introduce a new class of uncertain generating functions that can be used to derive bounds for the domination count efficiently, as we show in Section IV-D. Finally, in Section IV-E, we show how to improve our domination count approximation by considering disjunct subsets of possible worlds for which a more accurate approximation can be computed. A. The Problem of Domination Dependencies To compute DomCount(B,R), a straightforward solution is to first approximate PDom(A, B, R) for all A D using the technique proposed in Section III. Then, given these probabilities we can apply the technique of uncertain generating functions (cf. Section IV-C) to approximate the B probability that exactly 0, exactly,..., exactly n uncertain objects dominate B. However, this approach ignores possible dependencies between domination relationships. Although we assume independence between objects, the random variables Dom(A,B,R) and Dom(A 2,B,R) are mutually dependent because the distance between A and R depends on the distance between A 2 and R because object R can only appear once. Consider the following example: Example. Consider a database of three certain objects B, A and A 2 and the uncertain reference object R, as shown in Figure 3. For simplicity, objects A and A 2 have the same position in this example. The task is to determine the domination count of B w.r.t. R. The domination half-space for A and A 2 is depicted here as well. Let us assume that A (A 2 ) dominates B with a probability of PDom(A,B,R)= PDom(A 2,B,R) = 50%. Recall that this probability can be computed by integration or approximated with arbitrary precision using the technique of Section III. However, in this example, the probability that both A and A 2 dominate B is not simply 50% 50% = 25%, as the generating function technique would return. The reason for the wrong result in this example, is that the generating function requires mutually independent random variables. However, in this example, it holds that if and only if R falls into the domination half-space of A, it also falls into the domination half-space of A 2. Thus we have the dependency dom(a,b,r) dom(a 2,B,R) and the probability for R to be dominated by both A and A 2 is P (dom(a,b,r)) P (dom(a 2,B,R) dom(a,b,r)) =0.5 =0.5. B. Domination Approximations Based on Independent Objects In general, domination relations may have arbitrary correlations. Therefore, we present a way to compute the domination count DomCount(B,R) while accounting for the dependencies between domination relations. Complete Domination: In an initial step, complete domination serves as a filter which allows us to detect those objects A Dthat definitely dominate a specific object B w.r.t. R and those objects that definitely do not dominate B w.r.t. R by means of evaluating PDom(A, B, R). It is important to note that complete domination relations are mutually independent, since complete domination is evaluated on the entire uncertainty regions of the objects. After applying complete domination, we have detected objects that dominate B in all, or no possible worlds. Consequently, we get a first approximation of the domination count DomCount(B,R), obviously, it must be higher than the number N of objects that dominate B and lower than D M, where M is the number of objects that dominate B in no possible world, i.e. P (DomCount(B,R) =k) =0for k N and k D M. Nevertheless, for N < k < D M we still have a very bad approximation of the domination count probability of 0 P (DomCount(B,R) =k).

6 Probabilistic Domination: In order to refine this probability distribution, we have to take the set of influence objects influenceobjects = {A,...,A C }, which neither completely prune B nor are completely dominated by B w.r.t. R. For each A i influenceobjects, 0 < PDom(A i,b,r) <. For these objects, we can compute probabilities PDom(A,B,R),...,PDom(A C,B,R) according to the methodology in Section III. However, due to the mutual dependencies between domination relations (cf. Section IV-A), we cannot simply use these probabilities directly, as they may produce incorrect results. However, we can use the observation that the objects A i are mutually independent and each candidate object A i only appears in a single domination relation Dom(A,B,R),...,Dom(A C,B,R). Exploiting this observation, we can decompose the objects A,...,A C only, to obtain mutually independent bounds for the probabilities PDom(A,B,R),...,PDom(A C,B,R), as stated by the following lemma: Lemma 3. Let A,...A C be uncertain objects with disjunctive object decompositions A,...,A C, respectively. Also, let B and R be uncertain objects (without any decomposition). The lower (upper) bound PDom LB (A i,b,r) (PDom UB (A i,b,r)) as defined in Lemma (Lemma 2) of the random variable Dom(A i,b,r) is independent of the random variable Dom(A j,b,r) ( i = j C). Proof: Consider the random variable Dom(A i,b,r) conditioned on the event Dom(A j,b,r) =. Using Equation, we can derive the lower bound probability of Dom(A i,b,r)= Dom(A j,b,r)=as follows: PDom LB (A i,b,r Dom(A j,b,r) = ) = [P (a i A i Dom(A j,b,r) = ) A i Ai,B B,R R P (b B Dom(A j,b,r) = ) P (r R Dom(A j,b,r) = ) δ(a i,b,r )] Now we exploit that B and R are not decomposed, thus B = B and R = R, and thus P (B B Dom(A j,b,r) = ) = =P (B B ) and P (R R Dom(A j,b,r) = ) = = P (R R ). We obtain: PDom LB (A i,b,r Dom(A j,b,r) = ) = [P (a i A i Dom(A j,b,r) = ) A i Ai,B B,R R P (b B ) P (r R ) δ(a i,b,r )] Next we exploit that P (a i A i Dom(A j,b,r) = ) = P (a i A i ) since A i is independent from Dom(A j,b,r) and obtain: PDom LB (A i,b,r Dom(A j,b,r) = ) = [P (a i A i) P (b B ) P (r R ) δ(a i,b,r )] A i A i,b B,R R = PDom LB (A i,b,r) Analogously, it can be shown that PDom UB (A i,b,r Dom(A j,b,r) = ) = PDom UB (A i,b,r). In summary, we can now derive, for each object A i a lower and an upper bound of the probability that A i dominates B w.r.t. R. However, these bounds may still be rather loose, since we only consider the full uncertainty region of B and R so far, without any decomposition. In Section IV- E, we will show how to obtain more accurate, still mutual independent probability bounds based on decompositions of B and R. Due to the mutual independency of the lower and upper probability bounds, these probabilities can now be used to get an approximation of the domination count of B. In order to do this efficiently, we adapt the generating functions technique which is proposed in [9]. The main challenge here is to extend the generating function technique in order to cope with probability bounds instead of concrete probability values. It can be shown that a straightforward solution based on the existing generating functions technique applied to the lower/upper probability bounds in an appropriate way does solve the given problem efficiently, but overestimates the domination count probability and thus, does not yield good probability bounds. Rather, we have to redesign the generating functions technique such that lower/upper probability bounds can be handled correctly. C. Uncertain Generating Functions (UGFs) In this subsection, we will give a brief survey on the existing generating function technique (for more details refer to [9]) and then propose our new technique of uncertain generating functions. Generating Functions: Consider a set of N mutually independent, but not necessarily identically distributed Bernoulli {0, } random variables X,...,X N. Let P (X i ) denote the probability that X i =. The problem is to efficiently compute the sum N N X i = Dom(A i,b,r) i= i= of these random variables. A naive solution would be to count, for each 0 k N, all combinations with exactly k occurrences of X i = and accumulate the respective probabilities of these combinations. This approach, however, shows a complexity of O(2 N ). In [5], an approach was proposed that achieves an O(N) complexity using the Poisson Binomial Recurrence. Note that O(N) time is asymptotically optimal in general, since the computation involves at least O(N) computations, namely P (X i ), i N. In the following, we propose a different approach that, albeit having the same linear asymptotical complexity, has other advantages, as we will see. We apply the concept of generating functions as proposed in the context of probabilistic ranking in [9]. Consider the function F(x) = n i= (a i+b i x). The coefficient of x k in F(x) is given by: β =k i:β a i=0 i i:β b i= i,

7 where β = β,...,β N is a Boolean vector, and β denotes the number of s in β. Now consider the following generating function: F i = X i ( P (X i )+P (X i ) x) = j 0 c j x j. The coefficient c j of x j in the expansion of F i is the probability that for exactly j random variables X i it holds that X i =. Since F i contains at most i + non-zero terms and by observing that F i = F i ( P (X i )+P (X i ) x), we note that F i can be computed in O(i) time given F i. Since F 0 =x 0 =, we conclude that F N can be computed in O(N 2 ) time. If only the first k coefficients are required (i.e. coefficients c j where j<k), this cost can be reduced to O(k N), by simply dropping the summands c j x j where j k. Example 2. As an example, consider three independent random variables X, X 2 and X 3. Let P (X )=0.2, P (X 2 )= 0. and P (X 3 )=0.3, and let k =2. Then: F = F 0 ( x) =0.2x +0.8x 0 F 2 = F (0.9+0.x) =0.02x x +0.72x 0 =0.26x +0.72x 0 F 3 = F 2 ( x) =0.078x x x 0 =0.48x x 0 Thus, P (DomCount(B) = 0) = 50.4% and P (DomCount(B) = ) = 4.8%. We obtain P (DomCount(B) < 2) = 92.2%. Thus, B can be reported as a true hit if τ is not greater than 92.2%. Equations marked by * exploit that we only need to compute the c j where j<k=2. Uncertain Generating Functions: Given a set of N independent but not necessarily identically distributed Bernoulli {0, } random variables X i, i N. Let P LB (X i ) (P UB (X i )) be a lower (upper) bound approximation of the probability P (X i = ). Consider the random variable N X i. i= We make the following observation: The lower and upper bound probabilities P LB (X i ) and P UB (X i ) correspond to the probabilities of the three following events: X i = definitely holds with a probability of at least P LB (Dom(A i,b,r)). X i = 0 definitely holds with a probability of at least P UB (X i ). It is unknown whether X i = 0 or X i = with the remaining probability of P UB (Dom(A i,b,r)) P LB (Dom(A i,b,r)) = PDom UB (A i,b,r) PDom LB (A i,b,r). Based on this observation, we consider the following uncertain generating function (UGF): F N = i,...,n [(P LB (X i ) x +( P UB (X i )) y+ (P UB (X i ) P LB (X i )))] = c i,j x i y j. i,j 0 The coefficient c i,j has the following meaning: With a probability of c i,j, B is definitely dominated at least i times, and possibly dominated another 0 to j times. Therefore, the minimum probability that N i= X i = k is c k,0, since that is the probability that exactly k random variables X i are. The maximum probability that N i= X i = k is i k,i+j k c i,j, i.e. the total probability of all possible combinations in which N i= X i = k, may hold. Therefore, we obtain an approximated PDF of N i= X i. In the approximated PDF of N i= X i, each probability N i= X i = k is given by a conservative and a progressive approximation. Example 3. Let P LB (X ) = 20%, P UB (X ) = 50%, P LB (X 2 ) = 60% and P UB (X 2 ) = 80%. The generating function for the random variable 2 i= X i is the following: F 2 =(0.2x +0.5y +0.3)(0.6x +0.2y +0.2) =0.2x x xy +0.6y +0.06y 2 That implies that, with a probability of at least 2%, 2 i= X i =2. In addition, with a probability of 22% plus 6%, it may hold that 2 i= X i =2, so that we obtain a probability bound of 2% 40% for the random event 2 i= X i =2. Analogously, 2 i= X i =with a probability of 34% 78% and 2 i= X i = 0 with a probability of 0% 32%. The approximated PDF of 2 i= X i is depicted in Figure 4. X i = k Fig. 4. Approximated PDF of 2 i= X i. Each expansion F l can be obtained from the expansion of F l as follows: F l = F l [P LB (X l ) x +( P UB (X l )) + (P UB (X l ) P LB (X l )) y]. We note that F l contains at most l+ i= i non-zero terms (one c i,j for each combination of i and j where i + j l). Therefore, the total complexity to compute F l is O(l 3 ).

8 D. Efficient Domination Count Approximation using UGFs We can directly use the uncertain generating functions proposed in the previous section to derive bounds for the probability distribution of the domination count DomCount(B,R). Again, let D = A,...,A N be an uncertain object database and B and R be uncertain objects in R d. Let Dom(A i,b,r), i N denote the random Bernoulli event that A i dominates B w.r.t. R. 3 Also recall that the domination count is defined as the random variable that is the sum of the domination indicator variables of all uncertain objects in the database (cf. Definition 3). Considering the generating function F N = [(P LB (Dom(A i,b,r)) x+ i,...,n (P UB (Dom(A i,b,r)) P LB (Dom(A i,b,r))) y)+ ( P UB (Dom(A i,b,r)))] = c i,j x i y j, () i,j 0 we can efficiently compute lower and upper bounds of the probability that DomCount(B,R) =k for 0 k D, as discussed in Section IV-C and because the independence property of random variables required by the generating functions is satisfied due to Lemma 3. Lemma 4. A lower bound DomCount k LB (B,R) of the probability that DomCount(B,R) =k is given by DomCount k LB(B,R) =c k,0 and an upper bound DomCount k UB (B,R) of the probability that DomCount(B,R) =k is given by DomCount k UB(B,R) = i k,i+j k Example 4. Assume a database containing uncertain objects A, A 2, B and R. The task is to determine a lower (upper) bound of the domination count probability DomCount k LB (B,R) (DomCount k UB (B,R)) of B w.r.t. R. Assume that, by decomposing A and A 2 and using the probabilistic domination approximation technique proposed in Section III-B, we determine that A has a minimum probability PDom LB (A,B,R) of dominating B of 20% and a maximum probability PDom UB (A,B,R) of 50%. For A 2, PDom LB (A 2,B,R) is 60% and PDom UB (A 2,B,R) is 80%. By applying the technique in the previous subsection, we get the same generating function as in Example 3 and thus, the same approximated PDF for the DomCount(B,R) depicted in Figure 4. c i,j To compute the uncertain generating function and thus the probabilistic domination count of an object in an uncertain database of size N, the total complexity is O(N 3 ). The reason is that the maximal number of coefficients of the generating 3 That is, X[Dom(A i,b,r)] = iff A i dominates B w.r.t. R and X[Dom(A i,b,r)] = 0 otherwise. function F x is quadratic in x, since F x contains coefficients c i,j where i + j x, that is at most x2 2 coefficients. Since we have to compute F x for each (x <N), the total time complexity is O(N 3 ). Note that only candidate objects c Cand for which a complete domination cannot be detected (cf. Section III-A) have to be considered in the generating functions. Thus, the total runtime to compute DomCount k LB (B,R) as well as DomCount k UB (B,R) is O( Cand 3 ). In addition, we will show in Section VI how to reduce, specifically for knn and RkNN queries, the total time complexity to O(k 2 Cand ). Discussion: In the extended version of this paper ([3]), we show that instead of applying the uncertain generating function to approximate the domination count of B, two regular generating functions can be used; one generating function that uses the progressive (lower) bounds P UB (Dom(A i,b,r)) and one that uses the conservative (upper) probability bounds P UB (Dom(A i,b,r)). However, we give an intuition and a formal proof that using regular generating functions yields looser bounds for the approximated domination. E. Efficient Domination Count Approximation Based on Disjunctive Worlds Since the uncertain objects B and R appear in each domination relation PDom(A,B,R),...,PDom(A C,B,R) that is to evaluate, we cannot split objects B and R independently (cf. Section IV-A). The reason for this dependency is that knowledge about the predicate Dom(A i,b,r) may impose constraints on the position of B and R. Thus, for a partition B B, the probability PDom(A j,b,r) may change given Dom(A i,b,r) ( i, j C, i = j). However, note: Lemma 5. Given fixed partitions B B and R R, then the random variables Dom(A i,b,r ) are mutually independent for i, j C, i = j. Proof: Similar to the proof of Lemma 3. This allows us to individually consider the subset of possible worlds where b B and r R and use Lemma 5 to efficiently compute the approximated domination count probabilities DomCount k LB (B,R ) and DomCount k UB (B,R ) under the condition that B falls into a partition B B and R falls into a partition R R. This can be performed for each pair (B,R ) B R, where B and R denote the decompositions of B and R, respectively. Now, we can treat pairs of partitions (B,R ) B R independently, since all pairs of partition represent disjunctive sets of possible worlds due to the assumption of a disjunctive partitioning. Exploiting this independency, the PDF of the domination count DomCount(B,R) of the total objects B and R can then be obtained by creating an uncertain generating function for each pair (B,R ) to derive a lower and an upper bound of P (DomCount(B,R )=k) and then computing the weighted sum of these bounds as follows: B B,R R DomCount k LB(B,R) = DomCount k LB(B,R ) P (B ) P (R ).

9 The complete algorithm of our domination count approximation approach can be found in the next Section. V. IMPLEMENTATION Algorithm is a complete method for iteratively computing and refining the probabilistic domination count for a given object B and a reference object R. The algorithm starts by detecting complete domination (cf. Section III-A). For each object that completely dominates B, a counter CompleteDominationCount is increased and each object that is completely dominated by B is removed from further consideration, since it has no influence on the domination count of B. The remaining objects, which may have a probability greater than zero and less than one to dominate B, are stored in a set influenceobjects. The set influenceobjects is now used to compute the probabilistic domination count (DomCount LB, DomCount UB ) 4 : The main loop of the probabilistic domination count approximation starts in line 4. In each iteration, B, R, and all influence objects are partitioned. For each combination of partitions B and R, and each database object A i influenceobjects, the probability PDom(A i,b,r ) is approximated (cf. Section IV- B). These domination probability bounds are used to build an uncertain generating function (cf. Section IV-D) for the domination count of B w.r.t. R. Finally, these domination counts are aggregated for each pair of partitions B,R into the domination count DomCount(B,R) (cf. Section IV-E). The main loop continues until a domain- and user-specific stop criterion is satisfied. For example, for a threshold knn query, a stop criterion is to decide whether the lower (upper) bound that B has a domination count of less than (at least) k, exceeds (falls below) the given threshold. The progressive decomposition of objects (line 5) can be facilitated by precomputed split points at the object PDFs. More specifically, we can iteratively split each object X by means of a median-split-based bisection method and use a kdtree [2] to hierarchically organize the resulting partitions. The kd-tree is a binary tree. The root of a kd-tree represents the complete region of an uncertain object. Every node implicitly generates a splitting hyperplane that divides the space into two subspaces. This hyperplane is perpendicular to a chosen split axis and located at the median of the node s distribution in this axis. The advantage is that, for each node in the kdtree, the probability of the respective subregion X is simply given by 0.5 X.level, where X.level is the level of X. In addition, the bounds of a subregion X can be determined by backtracking to the root. In general, for continuously partitioned uncertain objects, the corresponding kd-tree may have an infinite height, however for practical reasons, the height h of the kd-tree is limited. The choice of h is a trade-off between approximation quality and efficiency: for a very large h, considering each leaf node is similar to applying integration on the PDFs, which yields an exact result; however, the number 4 DomCount LB and DomCount UB are lists containing, at each position i, a lower and an upper bound for P (DomCount(B, R) =i), respectively. This notation is equivalent to a single uncertain domination count PDF. of leaf nodes, and thus the worst case complexity increases exponentially in h. Note that our experiments (c.f. Section VII) show that a low h value is sufficient to yield reasonably tight approximation bounds. Yet it has to be noted, that in the general case of continuous uncertainty, our proposed approach may only return an approximation of the exact probabilistic domination count. However, such an approximation may be sufficient to decide a given predicate as we will see in Section VI and even in the case where the approximation does not suffice to decide the query predicate, the approximation will give the user a confidence value, based on which a user may be able decide whether to include an object in the result. Algorithm Probabilistic Inverse Ranking Require: : Q, B, D : influenceobjects = 2: CompleteDominationCount =0 3: //Complete Domination 4: for all A i Ddo 5: if DDC Optimal (A i,b,r) then 6: CompleteDominationCount++ 7: else if DDC Optimal (B, A i,r) then 8: influenceobjects = influenceobjects A i 9: end if 0: end for : //probabilistic domination count 2: DomCount LB = [0,...,0] //length D 3: DomCount UB = [,...,] //length D 4: while stopcriterion do 5: split(r), split(b), split(a i D) 6: for all B B, R R do 7: cand LB = [0,...,0] //length uncertainobjects 8: cand UB = [,...,] //length uncertainobjects 9: for all (0 <i< influenceobjects ) do 20: A i = influenceobjects[i] 2: 22: for all A i A i do if DDC Optimal (A i,b,r ) then 23: cand LB [i]+=(p (A i )) 24: else if DDC Optimal (B,A i,r ) then 25: cand UB [i]-=(p (A i )) 26: end if 27: end for 28: end for 29: compute DomCount LB (B,R ) and DomCount UB (B,R ) using UGFs. 30: for all (0 <i<d) do 3: DomCount LB [i]+=domcount(b,r ) LB P (B ) P (R ) 32: DomCount UB [i]+=domcount(b,r ) UB P (B ) P (R ) 33: end for 34: end for 35: ShiftRight(DomCount LB,CompleteDominationCount) 36: ShiftRight(DomCount UB,CompleteDominationCount) 37: end while 38: return (DomCount LB, DomCount UB ) VI. APPLICATIONS In this section, we outline how the probabilistic domination count can be used to efficiently evaluate a variety of probabilistic similarity query types, namely the probabilistic inverse similarity ranking query [2], the probabilistic threshold k-nn query [0], the probabilistic threshold reverse k-nn query and the probabilistic similarity ranking query [4], [4], [9], [25]. We start with the probabilistic inverse ranking query, because it can be derived trivially from the probabilistic domination

10 count introduced in Section IV. In the following, let D = {A,...,A N } be an uncertain database containing uncertain objects A,...,A N. Corollary 3. Let B and R be uncertain objects. The task is to determine the probabilistic ranking distribution Rank(B,R) of B w.r.t. to similarity to R, i.e. the distribution of the position Rank(B,R) of object B in a complete similarity ranking of A,..., A N,B w.r.t. the distance to an uncertain reference object R. Using our techniques, we can compute Rank(B,R) as follows: P (Rank(B,R) =i) =P (DomCount(B,R) =i ) The above corollary is evident, since the proposition B has rank i is equivalent to the proposition B is dominated by i objects. The most prominent probabilistic similarity search query is the probabilistic threshold knn query. Corollary 4. Let Q = R be an uncertain query object and let k be a scalar. The problem is to find all uncertain objects knn τ (Q) that are the k-nearest neighbors of Q with a probability of at least τ. Using our techniques, we can compute the probability P knn (B,Q) that an object B is a knn of Q as follows: k P knn (B,Q) = P (DomCount(B,Q) =i) i=0 The above corollary is evident, since the proposition B is a knn of Q is equivalent to the proposition B is dominated by less than k objects. To decide whether B is a knn of Q, i.e. if B knn τ (Q), we just need to check if P knn (B,Q) >τ. Next we show how to answer probabilistic threshold RkNN queries. Corollary 5. Let Q = R be an uncertain query object and let k be a scalar. The problem is to find all uncertain objects A i that have Q as one of their knns with a probability of at least τ, that is, all objects A i for which it holds that Q knn τ (A i ). Using our techniques, we can compute the probability P RkNN (B,Q) that an object B is a RkNN of Q as follows: k P RkNN (B,Q) = P (DomCount(Q, B) =i) i=0 The intuition here is that an object B is a RkNN of Q if and only if Q is dominated less than k times w.r.t. B. For knn and RkNN queries, the total complexity to compute the uncertain generating function can be improved from O( Cand 3 ) to O( Cand k 2 ) since it can be observed from Corollaries 4 and 5 that for knn and RkNN queries, we only require the section of the PDF of DomCount(B,R) where DomCount(B,R) < k, i.e. we only need to know the probabilities P (DomCount(B,R) = x),x < k. This can be exploited to improve the runtime of the computation of the PDF of DomCount(B,R) as follows: Consider the iterative computation of the generating functions F,...,F cand. For each F l, l cand, we only need to consider the coefficients c i,j in the generating function F i where i<k, since only these coefficients have an influence on P (DomCount(B,R) = x),x < k (cf. Section 4). In addition, we can merge all coefficients c i,j, c i,j where i = i, i + j > k and i + j > k, since all these coefficients only differ in their influence on the upper bounds of P (DomCount(B,R) =x),x k, and are treated equally for P (DomCount(B,R) =x),x < k. Thus, each F l contains at most k+ i= i coefficients (one c i,j for each combination of i and j where i + j k). Thus reducing the total complexity to O(k 2 cand ). Finally, we show how to compute the expected rank (cf. [4]) of an uncertain object. Corollary 6. Let Q = R be an uncertain query object. The problem is to rank the uncertain objects A i according to their expected rank E(Rank(A i )) w.r.t. the distance to Q. The expected rank of an uncertain object A i can be computed as follows: E(Rank(A i )) = N i=0 P (DomCount(Q, B) =i) (i + ) Other probabilistic similarity queries (e.g. knn and RkNN queries with a different uncertainty predicate instead of a threshold τ) can be approximated efficiently using our techniques as well. Details are omitted due to space constraints. VII. EXPERIMENTAL EVALUATION In this section, we review the characteristics of the proposed algorithm on synthetic and real-world data. The algorithm will be referred to as IDCA (Iterative Domination Count Approximation). We performed experiments under various parameter settings. Unless otherwise stated, for 00 queries, we chose B to be the object with the 0 th smallest MinDist to the reference object R. We used a synthetic dataset with 0,000 objects modeled as 2D rectangles. The degree of uncertainty of the objects in each dimension is modeled by their relative extent. The extents were generated uniformly and at random with as maximum value. For the evaluation on real-world data, we utilized the International Ice Patrol (IIP) Iceberg Sightings Dataset 5. This dataset contains information about iceberg activity in the North Atlantic in The latitude and longitude values of sighted icebergs serve as certain 2D mean values for the 6,26 probabilistic objects that we generated. Based on the date and the time of the latest sighting, we added Gaussian noise to each object, such that the passed time period since the latest date of sighting corresponds to the degree of uncertainty (i.e. the extent). The extents were normalized w.r.t. the extent of the data space, and the maximum extent of an object in either dimension is The IIP dataset is available at the National Snow and Ice Data Center (NSIDC) web site (

Querying Uncertain Spatio-Temporal Data

Querying Uncertain Spatio-Temporal Data Querying Uncertain Spatio-Temporal Data Tobias Emrich #, Hans-Peter Kriegel #, Nikos Mamoulis 2, Matthias Renz #, Andreas Züfle # # Institute for Informatics, Ludwig-Maximilians-Universität München Oettingenstr.

More information

arxiv: v2 [cs.db] 5 May 2011

arxiv: v2 [cs.db] 5 May 2011 Inverse Queries For Multidimensional Spaces Thomas Bernecker 1, Tobias Emrich 1, Hans-Peter Kriegel 1, Nikos Mamoulis 2, Matthias Renz 1, Shiming Zhang 2, and Andreas Züfle 1 arxiv:113.172v2 [cs.db] May

More information

Probabilistic Nearest-Neighbor Query on Uncertain Objects

Probabilistic Nearest-Neighbor Query on Uncertain Objects Probabilistic Nearest-Neighbor Query on Uncertain Objects Hans-Peter Kriegel, Peter Kunath, Matthias Renz University of Munich, Germany {kriegel, kunath, renz}@dbs.ifi.lmu.de Abstract. Nearest-neighbor

More information

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 13 Indexes for Multimedia Data 13 Indexes for Multimedia

More information

Multimedia Databases 1/29/ Indexes for Multimedia Data Indexes for Multimedia Data Indexes for Multimedia Data

Multimedia Databases 1/29/ Indexes for Multimedia Data Indexes for Multimedia Data Indexes for Multimedia Data 1/29/2010 13 Indexes for Multimedia Data 13 Indexes for Multimedia Data 13.1 R-Trees Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

More information

Optimal Spatial Dominance: An Effective Search of Nearest Neighbor Candidates

Optimal Spatial Dominance: An Effective Search of Nearest Neighbor Candidates Optimal Spatial Dominance: An Effective Search of Nearest Neighbor Candidates X I A O YA N G W A N G 1, Y I N G Z H A N G 2, W E N J I E Z H A N G 1, X U E M I N L I N 1, M U H A M M A D A A M I R C H

More information

Evaluation of Probabilistic Queries over Imprecise Data in Constantly-Evolving Environments

Evaluation of Probabilistic Queries over Imprecise Data in Constantly-Evolving Environments Evaluation of Probabilistic Queries over Imprecise Data in Constantly-Evolving Environments Reynold Cheng, Dmitri V. Kalashnikov Sunil Prabhakar The Hong Kong Polytechnic University, Hung Hom, Kowloon,

More information

Scalable Processing of Snapshot and Continuous Nearest-Neighbor Queries over One-Dimensional Uncertain Data

Scalable Processing of Snapshot and Continuous Nearest-Neighbor Queries over One-Dimensional Uncertain Data VLDB Journal manuscript No. (will be inserted by the editor) Scalable Processing of Snapshot and Continuous Nearest-Neighbor Queries over One-Dimensional Uncertain Data Jinchuan Chen Reynold Cheng Mohamed

More information

Chapter 6: Classification

Chapter 6: Classification Chapter 6: Classification 1) Introduction Classification problem, evaluation of classifiers, prediction 2) Bayesian Classifiers Bayes classifier, naive Bayes classifier, applications 3) Linear discriminant

More information

Monochromatic and Bichromatic Reverse Skyline Search over Uncertain Databases

Monochromatic and Bichromatic Reverse Skyline Search over Uncertain Databases Monochromatic and Bichromatic Reverse Skyline Search over Uncertain Databases Xiang Lian and Lei Chen Department of Computer Science and Engineering Hong Kong University of Science and Technology Clear

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Probabilistic Similarity Search for Uncertain Time Series

Probabilistic Similarity Search for Uncertain Time Series Probabilistic Similarity Search for Uncertain Time Series Johannes Aßfalg, Hans-Peter Kriegel, Peer Kröger, Matthias Renz Ludwig-Maximilians-Universität München, Oettingenstr. 67, 80538 Munich, Germany

More information

arxiv: v2 [cs.db] 20 Jan 2014

arxiv: v2 [cs.db] 20 Jan 2014 Probabilistic Nearest Neighbor Queries on Uncertain Moving Object Trajectories Submitted for Peer Review 1.5.213 Please get the significantly revised camera-ready version under http://www.vldb.org/pvldb/vol7/p25-niedermayer.pdf

More information

Testing Problems with Sub-Learning Sample Complexity

Testing Problems with Sub-Learning Sample Complexity Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

Skylines. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland

Skylines. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland Yufei Tao ITEE University of Queensland Today we will discuss problems closely related to the topic of multi-criteria optimization, where one aims to identify objects that strike a good balance often optimal

More information

Technical Report. Continuous Probabilistic Sum Queries in Wireless Sensor Networks with Ranges.

Technical Report. Continuous Probabilistic Sum Queries in Wireless Sensor Networks with Ranges. Technical Report Continuous Probabilistic Sum Queries in Wireless Sensor Networks with Ranges Nina Hubig 1, Andreas Zuefle 1 Mario A. Nascimento 2, Tobias Emrich 1, Matthias Renz 1, and Hans-Peter Kriegel

More information

Tailored Bregman Ball Trees for Effective Nearest Neighbors

Tailored Bregman Ball Trees for Effective Nearest Neighbors Tailored Bregman Ball Trees for Effective Nearest Neighbors Frank Nielsen 1 Paolo Piro 2 Michel Barlaud 2 1 Ecole Polytechnique, LIX, Palaiseau, France 2 CNRS / University of Nice-Sophia Antipolis, Sophia

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information

Model-based probabilistic frequent itemset mining

Model-based probabilistic frequent itemset mining Knowl Inf Syst (2013) 37:181 217 DOI 10.1007/s10115-012-0561-2 REGULAR PAPER Model-based probabilistic frequent itemset mining Thomas Bernecker Reynold Cheng David W. Cheung Hans-Peter Kriegel Sau Dan

More information

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM SPATIAL DATA MINING Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM INTRODUCTION The main difference between data mining in relational DBS and in spatial DBS is that attributes of the neighbors

More information

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler Complexity Theory Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität Wien 15 May, 2018 Reinhard

More information

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181. Complexity Theory Complexity Theory Outline Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität

More information

Uncertain Time-Series Similarity: Return to the Basics

Uncertain Time-Series Similarity: Return to the Basics Uncertain Time-Series Similarity: Return to the Basics Dallachiesa et al., VLDB 2012 Li Xiong, CS730 Problem Problem: uncertain time-series similarity Applications: location tracking of moving objects;

More information

Chapter 3 Deterministic planning

Chapter 3 Deterministic planning Chapter 3 Deterministic planning In this chapter we describe a number of algorithms for solving the historically most important and most basic type of planning problem. Two rather strong simplifying assumptions

More information

Spatial Database. Ahmad Alhilal, Dimitris Tsaras

Spatial Database. Ahmad Alhilal, Dimitris Tsaras Spatial Database Ahmad Alhilal, Dimitris Tsaras Content What is Spatial DB Modeling Spatial DB Spatial Queries The R-tree Range Query NN Query Aggregation Query RNN Query NN Queries with Validity Information

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

Nearest Neighbor Search with Keywords

Nearest Neighbor Search with Keywords Nearest Neighbor Search with Keywords Yufei Tao KAIST June 3, 2013 In recent years, many search engines have started to support queries that combine keyword search with geography-related predicates (e.g.,

More information

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Indexes for Multimedia Data 14 Indexes for Multimedia

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

Data Mining and Machine Learning

Data Mining and Machine Learning Data Mining and Machine Learning Concept Learning and Version Spaces Introduction Concept Learning Generality Relations Refinement Operators Structured Hypothesis Spaces Simple algorithms Find-S Find-G

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some

More information

Robotics 2 Data Association. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Wolfram Burgard

Robotics 2 Data Association. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Wolfram Burgard Robotics 2 Data Association Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Wolfram Burgard Data Association Data association is the process of associating uncertain measurements to known tracks. Problem

More information

Searching Dimension Incomplete Databases

Searching Dimension Incomplete Databases IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO., JANUARY 3 Searching Dimension Incomplete Databases Wei Cheng, Xiaoming Jin, Jian-Tao Sun, Xuemin Lin, Xiang Zhang, and Wei Wang Abstract

More information

What is (certain) Spatio-Temporal Data?

What is (certain) Spatio-Temporal Data? What is (certain) Spatio-Temporal Data? A spatio-temporal database stores triples (oid, time, loc) In the best case, this allows to look up the location of an object at any time 2 What is (certain) Spatio-Temporal

More information

First-Order Logic. Chapter Overview Syntax

First-Order Logic. Chapter Overview Syntax Chapter 10 First-Order Logic 10.1 Overview First-Order Logic is the calculus one usually has in mind when using the word logic. It is expressive enough for all of mathematics, except for those concepts

More information

Multi-Dimensional Top-k Dominating Queries

Multi-Dimensional Top-k Dominating Queries Noname manuscript No. (will be inserted by the editor) Multi-Dimensional Top-k Dominating Queries Man Lung Yiu Nikos Mamoulis Abstract The top-k dominating query returns k data objects which dominate the

More information

Scalable Algorithms for Distribution Search

Scalable Algorithms for Distribution Search Scalable Algorithms for Distribution Search Yasuko Matsubara (Kyoto University) Yasushi Sakurai (NTT Communication Science Labs) Masatoshi Yoshikawa (Kyoto University) 1 Introduction Main intuition and

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit

More information

A General Testability Theory: Classes, properties, complexity, and testing reductions

A General Testability Theory: Classes, properties, complexity, and testing reductions A General Testability Theory: Classes, properties, complexity, and testing reductions presenting joint work with Luis Llana and Pablo Rabanal Universidad Complutense de Madrid PROMETIDOS-CM WINTER SCHOOL

More information

Introduction to Probability

Introduction to Probability LECTURE NOTES Course 6.041-6.431 M.I.T. FALL 2000 Introduction to Probability Dimitri P. Bertsekas and John N. Tsitsiklis Professors of Electrical Engineering and Computer Science Massachusetts Institute

More information

Static-Priority Scheduling. CSCE 990: Real-Time Systems. Steve Goddard. Static-priority Scheduling

Static-Priority Scheduling. CSCE 990: Real-Time Systems. Steve Goddard. Static-priority Scheduling CSCE 990: Real-Time Systems Static-Priority Scheduling Steve Goddard goddard@cse.unl.edu http://www.cse.unl.edu/~goddard/courses/realtimesystems Static-priority Scheduling Real-Time Systems Static-Priority

More information

Relative Improvement by Alternative Solutions for Classes of Simple Shortest Path Problems with Uncertain Data

Relative Improvement by Alternative Solutions for Classes of Simple Shortest Path Problems with Uncertain Data Relative Improvement by Alternative Solutions for Classes of Simple Shortest Path Problems with Uncertain Data Part II: Strings of Pearls G n,r with Biased Perturbations Jörg Sameith Graduiertenkolleg

More information

Constraint-based Subspace Clustering

Constraint-based Subspace Clustering Constraint-based Subspace Clustering Elisa Fromont 1, Adriana Prado 2 and Céline Robardet 1 1 Université de Lyon, France 2 Universiteit Antwerpen, Belgium Thursday, April 30 Traditional Clustering Partitions

More information

Rank Determination for Low-Rank Data Completion

Rank Determination for Low-Rank Data Completion Journal of Machine Learning Research 18 017) 1-9 Submitted 7/17; Revised 8/17; Published 9/17 Rank Determination for Low-Rank Data Completion Morteza Ashraphijuo Columbia University New York, NY 1007,

More information

Matrix factorization models for patterns beyond blocks. Pauli Miettinen 18 February 2016

Matrix factorization models for patterns beyond blocks. Pauli Miettinen 18 February 2016 Matrix factorization models for patterns beyond blocks 18 February 2016 What does a matrix factorization do?? A = U V T 2 For SVD that s easy! 3 Inner-product interpretation Element (AB) ij is the inner

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

High-Dimensional Indexing by Distributed Aggregation

High-Dimensional Indexing by Distributed Aggregation High-Dimensional Indexing by Distributed Aggregation Yufei Tao ITEE University of Queensland In this lecture, we will learn a new approach for indexing high-dimensional points. The approach borrows ideas

More information

Classification of Ordinal Data Using Neural Networks

Classification of Ordinal Data Using Neural Networks Classification of Ordinal Data Using Neural Networks Joaquim Pinto da Costa and Jaime S. Cardoso 2 Faculdade Ciências Universidade Porto, Porto, Portugal jpcosta@fc.up.pt 2 Faculdade Engenharia Universidade

More information

Heuristics for The Whitehead Minimization Problem

Heuristics for The Whitehead Minimization Problem Heuristics for The Whitehead Minimization Problem R.M. Haralick, A.D. Miasnikov and A.G. Myasnikov November 11, 2004 Abstract In this paper we discuss several heuristic strategies which allow one to solve

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Description Logics: an Introductory Course on a Nice Family of Logics. Day 2: Tableau Algorithms. Uli Sattler

Description Logics: an Introductory Course on a Nice Family of Logics. Day 2: Tableau Algorithms. Uli Sattler Description Logics: an Introductory Course on a Nice Family of Logics Day 2: Tableau Algorithms Uli Sattler 1 Warm up Which of the following subsumptions hold? r some (A and B) is subsumed by r some A

More information

Can Vector Space Bases Model Context?

Can Vector Space Bases Model Context? Can Vector Space Bases Model Context? Massimo Melucci University of Padua Department of Information Engineering Via Gradenigo, 6/a 35031 Padova Italy melo@dei.unipd.it Abstract Current Information Retrieval

More information

Efficient Approximation for Restricted Biclique Cover Problems

Efficient Approximation for Restricted Biclique Cover Problems algorithms Article Efficient Approximation for Restricted Biclique Cover Problems Alessandro Epasto 1, *, and Eli Upfal 2 ID 1 Google Research, New York, NY 10011, USA 2 Department of Computer Science,

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

CHAPTER 4 CLASSICAL PROPOSITIONAL SEMANTICS

CHAPTER 4 CLASSICAL PROPOSITIONAL SEMANTICS CHAPTER 4 CLASSICAL PROPOSITIONAL SEMANTICS 1 Language There are several propositional languages that are routinely called classical propositional logic languages. It is due to the functional dependency

More information

Motivation for introducing probabilities

Motivation for introducing probabilities for introducing probabilities Reaching the goals is often not sufficient: it is important that the expected costs do not outweigh the benefit of reaching the goals. 1 Objective: maximize benefits - costs.

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.2 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

Distribution-specific analysis of nearest neighbor search and classification

Distribution-specific analysis of nearest neighbor search and classification Distribution-specific analysis of nearest neighbor search and classification Sanjoy Dasgupta University of California, San Diego Nearest neighbor The primeval approach to information retrieval and classification.

More information

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2. APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product

More information

Basing Decisions on Sentences in Decision Diagrams

Basing Decisions on Sentences in Decision Diagrams Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence Basing Decisions on Sentences in Decision Diagrams Yexiang Xue Department of Computer Science Cornell University yexiang@cs.cornell.edu

More information

VC-DENSITY FOR TREES

VC-DENSITY FOR TREES VC-DENSITY FOR TREES ANTON BOBKOV Abstract. We show that for the theory of infinite trees we have vc(n) = n for all n. VC density was introduced in [1] by Aschenbrenner, Dolich, Haskell, MacPherson, and

More information

Introduction to Real Analysis Alternative Chapter 1

Introduction to Real Analysis Alternative Chapter 1 Christopher Heil Introduction to Real Analysis Alternative Chapter 1 A Primer on Norms and Banach Spaces Last Updated: March 10, 2018 c 2018 by Christopher Heil Chapter 1 A Primer on Norms and Banach Spaces

More information

Search and Lookahead. Bernhard Nebel, Julien Hué, and Stefan Wölfl. June 4/6, 2012

Search and Lookahead. Bernhard Nebel, Julien Hué, and Stefan Wölfl. June 4/6, 2012 Search and Lookahead Bernhard Nebel, Julien Hué, and Stefan Wölfl Albert-Ludwigs-Universität Freiburg June 4/6, 2012 Search and Lookahead Enforcing consistency is one way of solving constraint networks:

More information

Cell-Probe Proofs and Nondeterministic Cell-Probe Complexity

Cell-Probe Proofs and Nondeterministic Cell-Probe Complexity Cell-obe oofs and Nondeterministic Cell-obe Complexity Yitong Yin Department of Computer Science, Yale University yitong.yin@yale.edu. Abstract. We study the nondeterministic cell-probe complexity of static

More information

SEARCHING THROUGH SPATIAL RELATIONSHIPS USING THE 2DR-TREE

SEARCHING THROUGH SPATIAL RELATIONSHIPS USING THE 2DR-TREE SEARCHING THROUGH SPATIAL RELATIONSHIPS USING THE DR-TREE Wendy Osborn Department of Mathematics and Computer Science University of Lethbridge 0 University Drive W Lethbridge, Alberta, Canada email: wendy.osborn@uleth.ca

More information

where X is the feasible region, i.e., the set of the feasible solutions.

where X is the feasible region, i.e., the set of the feasible solutions. 3.5 Branch and Bound Consider a generic Discrete Optimization problem (P) z = max{c(x) : x X }, where X is the feasible region, i.e., the set of the feasible solutions. Branch and Bound is a general semi-enumerative

More information

MATH 117 LECTURE NOTES

MATH 117 LECTURE NOTES MATH 117 LECTURE NOTES XIN ZHOU Abstract. This is the set of lecture notes for Math 117 during Fall quarter of 2017 at UC Santa Barbara. The lectures follow closely the textbook [1]. Contents 1. The set

More information

EE731 Lecture Notes: Matrix Computations for Signal Processing

EE731 Lecture Notes: Matrix Computations for Signal Processing EE731 Lecture Notes: Matrix Computations for Signal Processing James P. Reilly c Department of Electrical and Computer Engineering McMaster University September 22, 2005 0 Preface This collection of ten

More information

Model Complexity of Pseudo-independent Models

Model Complexity of Pseudo-independent Models Model Complexity of Pseudo-independent Models Jae-Hyuck Lee and Yang Xiang Department of Computing and Information Science University of Guelph, Guelph, Canada {jaehyuck, yxiang}@cis.uoguelph,ca Abstract

More information

A Single-Exponential Fixed-Parameter Algorithm for Distance-Hereditary Vertex Deletion

A Single-Exponential Fixed-Parameter Algorithm for Distance-Hereditary Vertex Deletion A Single-Exponential Fixed-Parameter Algorithm for Distance-Hereditary Vertex Deletion Eduard Eiben a, Robert Ganian a, O-joung Kwon b a Algorithms and Complexity Group, TU Wien, Vienna, Austria b Logic

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.1 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

8.1 Concentration inequality for Gaussian random matrix (cont d)

8.1 Concentration inequality for Gaussian random matrix (cont d) MGMT 69: Topics in High-dimensional Data Analysis Falll 26 Lecture 8: Spectral clustering and Laplacian matrices Lecturer: Jiaming Xu Scribe: Hyun-Ju Oh and Taotao He, October 4, 26 Outline Concentration

More information

Design and Analysis of Algorithms

Design and Analysis of Algorithms CSE 101, Winter 2018 Design and Analysis of Algorithms Lecture 5: Divide and Conquer (Part 2) Class URL: http://vlsicad.ucsd.edu/courses/cse101-w18/ A Lower Bound on Convex Hull Lecture 4 Task: sort the

More information

Trustworthy, Useful Languages for. Probabilistic Modeling and Inference

Trustworthy, Useful Languages for. Probabilistic Modeling and Inference Trustworthy, Useful Languages for Probabilistic Modeling and Inference Neil Toronto Dissertation Defense Brigham Young University 2014/06/11 Master s Research: Super-Resolution Toronto et al. Super-Resolution

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification

More information

Midterm Exam, Spring 2005

Midterm Exam, Spring 2005 10-701 Midterm Exam, Spring 2005 1. Write your name and your email address below. Name: Email address: 2. There should be 15 numbered pages in this exam (including this cover sheet). 3. Write your name

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

Decomposing Bent Functions

Decomposing Bent Functions 2004 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 49, NO. 8, AUGUST 2003 Decomposing Bent Functions Anne Canteaut and Pascale Charpin Abstract In a recent paper [1], it is shown that the restrictions

More information

Foundations of Artificial Intelligence

Foundations of Artificial Intelligence Foundations of Artificial Intelligence 7. Propositional Logic Rational Thinking, Logic, Resolution Joschka Boedecker and Wolfram Burgard and Bernhard Nebel Albert-Ludwigs-Universität Freiburg May 17, 2016

More information

The τ-skyline for Uncertain Data

The τ-skyline for Uncertain Data CCCG 2014, Halifax, Nova Scotia, August 11 13, 2014 The τ-skyline for Uncertain Data Haitao Wang Wuzhou Zhang Abstract In this paper, we introduce the notion of τ-skyline as an alternative representation

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

Binary Decision Diagrams

Binary Decision Diagrams Binary Decision Diagrams Literature Some pointers: H.R. Andersen, An Introduction to Binary Decision Diagrams, Lecture notes, Department of Information Technology, IT University of Copenhagen Tools: URL:

More information

Entailment with Conditional Equality Constraints (Extended Version)

Entailment with Conditional Equality Constraints (Extended Version) Entailment with Conditional Equality Constraints (Extended Version) Zhendong Su Alexander Aiken Report No. UCB/CSD-00-1113 October 2000 Computer Science Division (EECS) University of California Berkeley,

More information

Polynomial Space. The classes PS and NPS Relationship to Other Classes Equivalence PS = NPS A PS-Complete Problem

Polynomial Space. The classes PS and NPS Relationship to Other Classes Equivalence PS = NPS A PS-Complete Problem Polynomial Space The classes PS and NPS Relationship to Other Classes Equivalence PS = NPS A PS-Complete Problem 1 Polynomial-Space-Bounded TM s A TM M is said to be polyspacebounded if there is a polynomial

More information

Principles of AI Planning

Principles of AI Planning Principles of 7. Planning as search: relaxed Malte Helmert and Bernhard Nebel Albert-Ludwigs-Universität Freiburg June 8th, 2010 How to obtain a heuristic STRIPS heuristic Relaxation and abstraction A

More information

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm

Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm Zhixiang Chen (chen@cs.panam.edu) Department of Computer Science, University of Texas-Pan American, 1201 West University

More information