Skylines. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland

Similar documents
Nearest Neighbor Search with Keywords

Multidimensional Divide and Conquer 1 Skylines

High-Dimensional Indexing by Distributed Aggregation

Nearest Neighbor Search with Keywords: Compression

Nearest Neighbor Search with Keywords in Spatial Databases

Spatial Database. Ahmad Alhilal, Dimitris Tsaras

Query Processing in Spatial Network Databases

18.175: Lecture 2 Extension theorems, random variables, distributions

Generating p-extremal graphs

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Presentation for CSG399 By Jian Wen

Multimedia Databases 1/29/ Indexes for Multimedia Data Indexes for Multimedia Data Indexes for Multimedia Data

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Matroids and Greedy Algorithms Date: 10/31/16

Continuous Skyline Queries for Moving Objects

1 Approximate Quantiles and Summaries

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

P (E) = P (A 1 )P (A 2 )... P (A n ).

Lecture 8: Conditional probability I: definition, independence, the tree method, sampling, chain rule for independent events

Finding Top-k Preferable Products

The τ-skyline for Uncertain Data

CS 161: Design and Analysis of Algorithms

RECOVERY OF NON-LINEAR CONDUCTIVITIES FOR CIRCULAR PLANAR GRAPHS

CMSC 27130: Honors Discrete Mathematics

Design and Analysis of Algorithms

10.3 Matroids and approximation

MAT 271E Probability and Statistics

Linear Classification: Linear Programming

Extreme values: Maxima and minima. October 16, / 12

1 Multiset ordering. Travaux Dirigés. Multiset ordering, battle of the hydra Well-quasi-orderings, Higman and Kruskal theorems

α-acyclic Joins Jef Wijsen May 4, 2017

Price of Stability in Survivable Network Design

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

CSCE 750 Final Exam Answer Key Wednesday December 7, 2005

CMSC 451: Lecture 7 Greedy Algorithms for Scheduling Tuesday, Sep 19, 2017

CS60007 Algorithm Design and Analysis 2018 Assignment 1

Exact and Approximate Equilibria for Optimal Group Network Formation

1 Primals and Duals: Zero Sum Games

Matrix Factorization Reading: Lay 2.5

PEANO AXIOMS FOR THE NATURAL NUMBERS AND PROOFS BY INDUCTION. The Peano axioms

Multi-Dimensional Top-k Dominating Queries

MATH Max-min Theory Fall 2016

Lecture 23 Branch-and-Bound Algorithm. November 3, 2009

Finding Pareto Optimal Groups: Group based Skyline

Evaluation of Probabilistic Queries over Imprecise Data in Constantly-Evolving Environments

Dictionary: an abstract data type

ECE353: Probability and Random Processes. Lecture 2 - Set Theory

DO FIVE OUT OF SIX ON EACH SET PROBLEM SET

Chapter 5 Simplifying Formulas and Solving Equations

Linear Classification: Linear Programming

2017 ACM ICPC ASIA GWALIOR REGIONAL CONTEST

1.2 Functions What is a Function? 1.2. FUNCTIONS 11

Alpha-Beta Pruning: Algorithm and Analysis

Connectivity and tree structure in finite graphs arxiv: v5 [math.co] 1 Sep 2014

Dictionary: an abstract data type

c i r i i=1 r 1 = [1, 2] r 2 = [0, 1] r 3 = [3, 4].

Linear-Time Algorithms for Finding Tucker Submatrices and Lekkerkerker-Boland Subgraphs

Directed and Undirected Graphical Models

Modeling Data with Functions

Sequence convergence, the weak T-axioms, and first countability

Course 212: Academic Year Section 1: Metric Spaces

On improving matchings in trees, via bounded-length augmentations 1

MATH 118 FINAL EXAM STUDY GUIDE

Data Structures and Algorithms

Tree sets. Reinhard Diestel

Connectedness. Proposition 2.2. The following are equivalent for a topological space (X, T ).

MITOCW Lec 11 MIT 6.042J Mathematics for Computer Science, Fall 2010

Ranking Score Vectors of Tournaments

CS626 Data Analysis and Simulation

1 Maximum Budgeted Allocation

Search Trees. Chapter 10. CSE 2011 Prof. J. Elder Last Updated: :52 AM

Pivot Selection Techniques

CMPSCI611: The Matroid Theorem Lecture 5

3 Hausdorff and Connected Spaces

1 Some loose ends from last time

1 Terminology and setup

Multiclass Classification-1

Pavel Zezula, Giuseppe Amato,

Distributed Systems Byzantine Agreement

Lesson 7: The Mean as a Balance Point

Exact and Approximate Equilibria for Optimal Group Network Formation

General Relativity by Robert M. Wald Chapter 2: Manifolds and Tensor Fields

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. SAMs - Detailed outline. Spatial Access Methods - problem

CS5371 Theory of Computation. Lecture 9: Automata Theory VII (Pumping Lemma, Non-CFL)

Consequences of the Completeness Property

The matrix approach for abstract argumentation frameworks

Russell s logicism. Jeff Speaks. September 26, 2007

Before We Start. The Pumping Lemma. Languages. Context Free Languages. Plan for today. Now our picture looks like. Any questions?

Tailored Bregman Ball Trees for Effective Nearest Neighbors

Problem One: Order Relations i. What three properties does a binary relation have to have to be a partial order?

CPSC 421: Tutorial #1

Lecture 4. 1 Estimating the number of connected components and minimum spanning tree

Proving languages to be nonregular

Aditya Bhaskara CS 5968/6968, Lecture 1: Introduction and Review 12 January 2016

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Root systems and optimal block designs

Learning Large-Alphabet and Analog Circuits with Value Injection Queries

Chapter 8 The Disjoint Sets Class

February 2017 February 18, 2017

Efficient Continuously Moving Top-K Spatial Keyword Query Processing

arxiv: v1 [math.co] 22 Jan 2013

Transcription:

Yufei Tao ITEE University of Queensland

Today we will discuss problems closely related to the topic of multi-criteria optimization, where one aims to identify objects that strike a good balance often optimal in some sense among multiple dimensions. Our focus will be placed on how to find such objects efficiently. In particular, we will consider points that are indexed by an R-tree, and study algorithms that can leverage the R-tree to reduce cost effectively.

Monotonically Increasing Functions Let p be a d-dimensional point in R d. Let f : R d R a function that calculates a score f (p) for p. We say that f is monotonically increasing if the score increases when any coordinate of p increases.

Top-1 Search Let us first look at one of the most natural operations in multi-criteria optimization. Let P be a set of d-dimensional points in R d. Given a monotonically increasing function f, a top-1 query finds the point in P that has the smallest score. In the special case where multiple points have the smallest score, all of them should be returned. f is often referred to as the preference function of the query. Without loss of generality, let us assume that the coordinates of all the points are positive (think: why is this a fair assumption?).

Example If f (x, y) = x + y, the top-1 point is p. y p 1 p p 3 p 5 p p 7 p p 11 p 13 p p 9 p p 1 0 x

Applications of Top-1 Search Find the hotel that minimizes price + distance (to the beach). Find the second-handed car that minimizes price + 0.1 mileage. Find the interviewee that maximizes 0. education + 0. experience + 0. personality....

Top-1 Search is NN Search in Disguise! Assuming that the dataset P is indexed by an R-tree, we can answer a top-1 query by directly applying a nearest neighbor algorithm discussed before. Specifically, the top-1 result is precisely the NN of the origin of the data space (i.e., the query point) according to the distance function f. Think: What is the mindist of an MBR to the query point? p1 p p5 r1 p3 p r u r r 7 r p7 p r3 p13 p11 p p9 r r5 p p1 u u 7 r 1 r r 3 r r 5 r7 p 1 p p 3 p p 5 p p 7 p p 9 p p 11 p 1 p 13 0 u 1 u u 3 u u 5

Drawback of Top-1 Search The Preference Function In general, it is difficult to decide which preference function f to use. For example, assume that the x-dimension corresponds to the price of a hotel and the y-dimension to its distance to the beach. Why is f (x, y) = x + y a good function to use? Why not x + y, or something more complex like x + y? y p 1 p p 3 p 5 p p 7 p p 11 p 13 p p 9 p p 1 0 x

To remedy the drawback, we need to do away with preference functions. But then we fall into a dilemma how do we return top-1 points? The skyline operator achieves the purpose with an interesting idea. Instead of returning only 1 object, how about returning a small set of objects that is guaranteed to cover the result of any top-1 query (i.e., regardless of the preference function).

Dominance A point p 1 dominates p if the coordinate of p 1 is smaller than or equal to p in all dimensions, and strictly smaller in one dimension. y p 1 p p 3 p 5 p p 7 p p 11 p 13 p p 9 p p 1 0 x For example, p dominates p, which dominates p 5. It is clear that dominance is transitive.

Skyline Search Let P be a set of d-dimensional points in R d. The skyline of P is the set of points in P that are not dominated by others. y p 1 p p 3 p 5 p p 7 p p 11 p 13 p p 9 p p 1 0 x The skyline is {p 1, p, p 9, p 1 }.

Now we can talk about how the skyline remedies the aforementioned drawback of top-1 search. Theorem: For any monotonically increasing function, a top-1 point is definitely in the skyline. Proof: Suppose that this is not true, and that there exists a function f for which a top-1 point p is not in the skyline. This means that p is dominated by at least another point p. The monotonicity of f tells us that f (p ) < f (p), which contradicts the fact that p is a top-1 point.

Recall that our goal is to return a small set of points that is guaranteed to cover the top-1 result of any preference function. The next theorem states that the skyline is indeed the smallest subset. Theorem: Every point in the skyline is definitely a top-1 point of some monotonically increasing function. The proof is not required, but the instructor will outline the idea behind.

Next we will proceed to discuss how to find the skyline efficiently using an R-tree. We will consider two scenarios: What if NN search is the only available operator on the R-tree (plus, of course the standard insertion/deletion operators)? Think: Why could this happen? We will discuss how to use NN search as a black box to find the skyline. What if we are allowed to design an algorithm of our own on the R-tree? We will come up with another algorithm that solves the problem optimally.

First Algorithm: Using NN Search as a Black Box Lemma: Let p be the (Euclidean) NN of the origin of the data space, among all the points in the dataset P. Then, p must be in the skyline of P. Proof: Suppose that this is not true. Then, there exists another point p that dominates p. This, however, means that p must be closer to the origin than p, creating a contradiction.

NN Search as a Black Box Motivated by the previous observation, we can find the skyline by repeating the next two operations until the R-tree is empty: Find the NN p of the origin of the data space. Include p into the skyline. Delete p from the tree, as well as all the points that are dominated by p. Think: Why is the algorithm correct?

Example p1 p p5 p1 p p3 p p7 p p p11 p13 p9 p p1 p9 p p1 0 0 The first NN query finds p. The figure on the right shows the dataset after the corresponding deletions.

Example p1 p p1 p p9 p p1 p1 0 0 The second NN query finds p 9. The figure on the right shows the dataset after the corresponding deletions.

Example p1 p p1 0 The third NN query finds p 1 and p 1. The dataset becomes empty after corresponding deletions. Final skyline point = {p, p 9, p 1, p 1 }.

This previous algorithm must access all the points in the dataset (why?). The next algorithm remedies the defect by accessing (typically) only a part of the dataset.

Second Algorithm: Branch and Bound Skyline (BBS) The BBS algorithm can be thought of a variation of the BF algorithm for NN search. It accesses the nodes of the R-tree in ascending order of their mindists from the origin. Different from NN search, however, once an MBR in its entirety is dominated by a skyline point already found, it can be pruned. Next we will illustrate the algorithm through an example.

BBS First, we access the root, and put the MBRs there in a sorted list H, namely, H = {(r 7, ), (r, )}. Note that the sorting key of its MBR is its mindist to the origin. p1 p p5 r1 p3 p r u r r 7 r p7 p r3 p13 p11 p p9 r r5 p p1 u u 7 r 1 r r 3 r r 5 r7 p 1 p p 3 p p 5 p p 7 p p 9 p p 11 p 1 p 13 0 u 1 u u 3 u u 5

BBS The algorithm then visits node u 7, after which H = {(r 3, 13), (r, ), (r, 0), (r 5, )}. p1 p p5 r1 p3 p r u r r 7 r p7 p r3 p13 p11 p p9 r r5 p p1 u u 7 r 1 r r 3 r r 5 r7 p 1 p p 3 p p 5 p p 7 p p 9 p p 11 p 1 p 13 0 u 1 u u 3 u u 5

BBS We now visit u 3 which is a leaf node. Insert all the points therein into H, which becomes: H = {(p, 1), (r, ), (p 9, 9), (r, 0), (p 7, 5), (r 5, )}. p1 p p5 r1 p3 p r u r r 7 r p7 p r3 p13 p11 p p9 r r5 p p1 u u 7 r 1 r r 3 r r 5 r7 p 1 p p 3 p p 5 p p 7 p p 9 p p 11 p 1 p 13 0 u 1 u u 3 u u 5

BBS Now we remove p from H. This is the first skyline point found. H is now {(r, ), (p 9, 9), (r, 0), (p 7, 5), (r 5, )}. p1 p p5 r1 p3 p r u r r 7 r p7 p r3 p13 p11 p p9 r r5 p p1 u u 7 r 1 r r 3 r r 5 r7 p 1 p p 3 p p 5 p p 7 p p 9 p p 11 p 1 p 13 0 u 1 u u 3 u u 5

BBS Next, access u, and update H to {(p 9, 9), (r, 0), (p 7, 5), (r, 1), (r 1, 5), (r 5, )}. p1 p p5 r1 p3 p r u r r 7 r p7 p r3 p13 p11 p p9 r r5 p p1 u u 7 r 1 r r 3 r r 5 r7 p 1 p p 3 p p 5 p p 7 p p 9 p p 11 p 1 p 13 0 u 1 u u 3 u u 5

BBS Remove from H the next point p 9, which is the second skyline point found. H is now {(r, 0), (p 7, 5), (r, 1), (r 1, 5), (r 5, )}. p1 p p5 r1 p3 p r u r r 7 r p7 p r3 p13 p11 p p9 r r5 p p1 u u 7 r 1 r r 3 r r 5 r7 p 1 p p 3 p p 5 p p 7 p p 9 p p 11 p 1 p 13 0 u 1 u u 3 u u 5

BBS The next MBR r in H can be directly discarded, because its lower-left corner is dominated by a skyline point p. This implies that no points in r can possibly be in the skyline. Following the same reasoning, both of the next point p 7 and the following MBR r are also discarded. Currently H = {(r 1, 5), (r 5, )}. Skyline points found so far = {p, p 9 }. p1 p p5 r1 p3 p r u r r 7 r p7 p r3 p13 p11 p p9 r r5 p p1 u u 7 r 1 r r 3 r r 5 r7 p 1 p p 3 p p 5 p p 7 p p 9 p p 11 p 1 p 13 0 u 1 u u 3 u u 5

BBS Next, access the leaf node u 1, after which H becomes H = {(p 3, 0), (p 1, ), (r 5, ), (p, )}. p1 p p5 r1 p3 p r u r r 7 r p7 p r3 p13 p11 p p9 r r5 p p1 u u 7 r 1 r r 3 r r 5 r7 p 1 p p 3 p p 5 p p 7 p p 9 p p 11 p 1 p 13 0 u 1 u u 3 u u 5 The next point p 3 is discarded. Then, we come to p 1, which is added to the skyline, which is now {p 1, p, p 9 }.

BBS Now H = {(r 5, ), (p, )}. The rest of the algorithm proceeds in the same manner. We discover the th skyline point p 1 in r 5. p1 p p5 r1 p3 p r u r r 7 r p7 p r3 p13 p11 p p9 r r5 p p1 u u 7 r 1 r r 3 r r 5 r7 p 1 p p 3 p p 5 p p 7 p p 9 p p 11 p 1 p 13 0 u 1 u u 3 u u 5

Pseudocode of BBS algorithm BBS 1. insert the MBR of the root into a sorted list H /* entries in H are sorted by their mindists to the origin */. SKY = /* current result */ 3. while H is not empty do. remove the the first entry e from H 5. if e is the MBR of a node u. if no point in SKY dominates the lower-left corner e then 7. visit u and insert all MBRs/points there into H. else /* e is a point */ 9. if no point in SKY dominates e then. add e to SKY

Optimality of BBS BBS is optimal, i.e., it accesses the least number of nodes among all algorithms that correctly finds the skyline using the same R-tree (without modifying the structure). To prove this, let us define the search region as the union of the points in R d that are not dominated by any skyline point. For example, in our previous example, the search region is the shaded area below: p1 p p9 p1 0 It is easy to see that any correct algorithm must access all the nodes whose MBRs intersect the search region.

Optimality of BBS We can show that BBS accesses only the nodes whose MBRs intersect the search region. Assume, for contradiction, that the algorithm needs to visit a node u whose MBR r is disjoint with the region. It follows that a skyline point p dominates the lower-left corner of r. Thus, p has a smaller mindist (a.k.a. distance) to the origin than r. The sequence of points/nodes processed by BBS is in ascending order of mindist to the origin. Prove this by yourself. Hence, p is processed before u. However, once p is processed, it enters the skyline, making it impossible for the algorithm to visit u.