Towards Indexing Functions: Answering Scalar Product Queries Arijit Khan, Pouya Yanki, Bojana Dimcheva, Donald Kossmann

Similar documents
Nearest Neighbor Search with Keywords in Spatial Databases

Indexing Objects Moving on Fixed Networks

TASM: Top-k Approximate Subtree Matching

Semantics of Ranking Queries for Probabilistic Data and Expected Ranks

Nearest Neighbor Searching Under Uncertainty. Wuzhou Zhang Supervised by Pankaj K. Agarwal Department of Computer Science Duke University

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Multimedia Databases 1/29/ Indexes for Multimedia Data Indexes for Multimedia Data Indexes for Multimedia Data

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. SAMs - Detailed outline. Spatial Access Methods - problem

PROBABILISTIC SKYLINE QUERIES OVER UNCERTAIN MOVING OBJECTS. Xiaofeng Ding, Hai Jin, Hui Xu. Wei Song

ECS289: Scalable Machine Learning

Large-scale Collaborative Ranking in Near-Linear Time

k-selection Query over Uncertain Data

Composite Quantization for Approximate Nearest Neighbor Search

Pivot Selection Techniques

Estimating the Selectivity of tf-idf based Cosine Similarity Predicates

Prediction of Citations for Academic Papers

K-Nearest Neighbor Temporal Aggregate Queries

Skylines. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland

Enabling Accurate Analysis of Private Network Data

Spatial Database. Ahmad Alhilal, Dimitris Tsaras

Configuring Spatial Grids for Efficient Main Memory Joins

Approximating the Minimum Closest Pair Distance and Nearest Neighbor Distances of Linearly Moving Points

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

Reverse Area Skyline in a Map

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM

Single-tree GMM training

Probabilistic Nearest-Neighbor Query on Uncertain Objects

Randomized Algorithms

Identifying Top k Dominating Objects over Uncertain Data

A Survey on Spatial-Keyword Search

Fantope Regularization in Metric Learning

Time Series Data Cleaning

The τ-skyline for Uncertain Data

12.5 Equations of Lines and Planes

ECS289: Scalable Machine Learning

CS145: INTRODUCTION TO DATA MINING

The Power of Asymmetry in Binary Hashing

Finding Frequent Items in Probabilistic Data

Machine Learning on temporal data

Tracking the Intrinsic Dimension of Evolving Data Streams to Update Association Rules

arxiv: v2 [cs.db] 5 May 2011

SPATIAL INDEXING. Vaibhav Bajpai

Assignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran

Cost-Efficient Processing of Min/Max Queries over Distributed Sensors with Uncertainty

Predictive Nearest Neighbor Queries Over Uncertain Spatial-Temporal Data

Window-aware Load Shedding for Aggregation Queries over Data Streams

A Streaming Algorithm for 2-Center with Outliers in High Dimensions

The Truncated Tornado in TMBB: A Spatiotemporal Uncertainty Model for Moving Objects

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Streaming multiscale anomaly detection

Machine Learning. for Image Retrieval. Edward Chang Associate Professor, Electrical Engineering, UC Santa Barbara CTO, VIMA Technologies

cient and E ective Similarity Search based on Earth Mover s Distance

Probabilistic Similarity Search for Uncertain Time Series

Evaluation of Probabilistic Queries over Imprecise Data in Constantly-Evolving Environments

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

Maintaining Frequent Itemsets over High-Speed Data Streams

Classification Using Decision Trees. Jackknife Estimator: Example 1. Data Mining. Jackknife Estimator: Example 2(cont. Jackknife Estimator: Example 2

1 Handling of Continuous Attributes in C4.5. Algorithm

SBASS: Segment based approach for subsequence searches in sequence databases

Data Canopy. Accelerating Exploratory Statistical Analysis. Abdul Wasay Xinding Wei Niv Dayan Stratos Idreos

Chapter 6. Frequent Pattern Mining: Concepts and Apriori. Meng Jiang CSE 40647/60647 Data Science Fall 2017 Introduction to Data Mining

Spatial Keyword Search by using Point Regional QuadTree Algorithm

Nearest Neighbor Search for Relevance Feedback

A Hybrid Prediction Model for Moving Objects

2005 Journal of Software (, ) A Spatial Index to Improve the Response Speed of WebGIS Servers

Computing Probability Threshold Set Similarity on Probabilistic Sets

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

Nearest Neighbor Search with Keywords: Compression

Learning Time-Series Shapelets

1 Handling of Continuous Attributes in C4.5. Algorithm

P leiades: Subspace Clustering and Evaluation

On Symmetric and Asymmetric LSHs for Inner Product Search

Pointwise Exact Bootstrap Distributions of Cost Curves

1 Approximate Quantiles and Summaries

University of California, Berkeley. ABSTRACT

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics

Initial Sampling for Automatic Interactive Data Exploration

a Short Introduction

Making Nearest Neighbors Easier. Restrictions on Input Algorithms for Nearest Neighbor Search: Lecture 4. Outline. Chapter XI

Anomaly Detection via Over-sampling Principal Component Analysis

Linear Regression 1 / 25. Karl Stratos. June 18, 2018

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Removing trivial associations in association rule discovery

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Geodatabase Programming with Python John Yaist

Query Optimization: Exercise

Mathematics 2203, Test 1 - Solutions

Normalization Techniques in Training of Deep Neural Networks

Tailored Bregman Ball Trees for Effective Nearest Neighbors

Conquering the Complexity of Time: Machine Learning for Big Time Series Data

Universal Trajectory Queries for Moving Object Databases

IBM Research Report. A Lower Bound on the Euclidean Distance for Fast Nearest Neighbor Retrieval in High-dimensional Spaces

Location-Based R-Tree and Grid-Index for Nearest Neighbor Query Processing

Issues Concerning Dimensionality and Similarity Search

Linear Spectral Hashing

Greedy Maximization Framework for Graph-based Influence Functions

Optimal Color Range Reporting in One Dimension

Flexible Aggregate Similarity Search

Transcription:

Towards Indexing Functions: Answering Scalar Product Queries Arijit Khan, Pouya anki, Bojana Dimcheva, Donald Kossmann Systems Group ETH Zurich

Moving Objects Intersection Finding Position at a future time instance t [x = r cos(ωt) y = r sin(wt)] [x = p + ut y = q + vt] Moving Object Database r, ω p, q, u, v 1/ 15

Moving Objects Intersection Finding Find all object pairs that will be within distance S at time instance t A 1 + B 2 + C 3 + D 4 + E 5 + F 6 + G 7 S² Moving Object Database r, ω p, q, u, v 1 = r 2 + p 2 + q 2 + 2rp + 2rq A = 1 2 = 2[u(r-p) + v(r-q)] B = t 3 = -2rp C = 1+ sin(ωt) 4 = -2rq D = 1+ cos(ωt) 5 = -2ru E = t[1+ sin(ωt)] 6 = -2rv F = t[1+ cos(ωt)] 7 = u 2 +v 2 G = t 2 2/ 15

Moving Objects Intersection Finding Find all object pairs that will be within distance S at time instance t A 1 + B 2 + C 3 + D 4 + E 5 + F 6 + G 7 S² Moving Object Database r, ω p, q, u, v 1 = r 2 + p 2 + q 2 + 2rp + 2rq A = 1 2 = 2[u(r-p) + v(r-q)] B = t 3 = -2rp C = 1+ sin(ωt) 4 = -2rq D = 1+ cos(ωt) 5 = -2ru E = t[1+ sin(ωt)] 6 = -2rv F = t[1+ cos(ωt)] 7 = u 2 +v 2 G = t 2 Function (known) Query Parameters (unknown) 2/ 15

Moving Objects Intersection Finding Find all object pairs that will be within distance S at time instance t A 1 + B 2 + C 3 + D 4 + E 5 + F 6 + G 7 S² Moving Object Database r, ω p, q, u, v Scalar Product Query: (A B C D E F G). ( 1 2 3 4 5 6 7 ) S² Query Parameters (unknown) Function (known) Inequality Parameter (unknown) 2/ 15

Moving Objects Intersection Finding Moving Object Database r, ω p, q, u, v Scalar Product Query: (A B C D E F G). ( 1 2 3 4 5 6 7 ) S² Query Parameters (unknown) Function (known) Inequality Parameter (unknown) 2/ 15

More Applications: Complex SQL Function Patient ID S B P1 5 80 P2 6 40 P3 4 70 P4 6 50 Patient Dataset for Heart-Rate Prediction ARIMA Time Series Prediction Model: Heart-Rate at time t = S t + B 3/ 15

More Applications: Complex SQL Function Patient ID S B P1 5 80 P2 6 40 P3 4 70 P4 6 50 Patient Dataset for Heart-Rate Prediction ARIMA Time Series Prediction Model: Heart-Rate at time t = S t + B Find all patients for whom the predicted heart rate at time t is more than an input threshold H. CREATE FUNCTION Critical_Patient ( INPUT double Threshold, double Time RETURN PatientID FROM Patient WHERE S Time + B H ) Unknown 3/ 15

More Applications: Complex SQL Function Patient ID S B P1 5 80 P2 6 40 P3 4 70 P4 6 50 Patient Dataset for Heart-Rate Prediction S Time + B H Unknown Scalar Product Query (Time, 1). (S, B) H Query Parameters (unknown) Function (known) Inequality Parameter (unknown) 4/ 15

More Applications: Complex SQL Function Patient ID S B P1 5 80 P2 6 40 P3 4 70 P4 6 50 Patient Dataset for Heart-Rate Prediction S Time + B H Unknown 4/ 15

Problem Statement Inequality Query Find all data points x that satisfy: (a, F(x)) b Applications: moving-object-intersection finding half-space range search complex SQL functions Top-k Nearest Neighbor Query Find the top-k data points x satisfying (a, F(x)) b, that also minimize: (a, F(x)) b / a Applications: top-k nearest points to hyper plane active learning 5/ 15

Related Work THEOR MACHINE LEARNING Half-space Range Searching - Agarwal et. al. [PODS 98], Matousek et. al. [Computational Geometry 92] Linear Constraint Queries - Goldstein et. al. [PODS 97] Nearest Neighbor Queries - Liu et. al. [ICML 12], Jain et. al. [NIPS 10] DATABASES Top-k Queries with Ranking Function - Chang et. al. [SIGMOD 00], in et. al. [SIGMOD 07], Li et. al. [SIGMOD 05], Ilyas et. al. [ACM Comp. Survey 08], Hristidis et. al. [SIGMOD 01], Ram et. al. [KDD 12] Index for Moving Objects - Nascimento et. al. [R-tree, SAC 98], Sistla et. al. [ICDE 97], Kollios et. al. [PODS 99], Saltenis et. al. [TPR-Tree, SIGMOD 00], Jensen et. al. [B x -Tree, VLDB 04], Tao et. al. [SIGMOD 02], Zhang et. al. [MBR Tree, VLDB J 12] 6/ 15

Related Work THEOR MACHINE LEARNING Half-space Range Searching - Agarwal et. al. [PODS 98], Matousek et. al. [Computational Geometry 92] Linear Constraint Queries - Goldstein et. al. [PODS 97] Nearest Neighbor Queries - Liu et. al. [ICML 12], Jain et. al. [NIPS 10] DATABASES Top-k Queries with Ranking Function - Chang et. al. [SIGMOD 00], in et. al. [SIGMOD 07], Li et. al. [SIGMOD 05], Ilyas et. al. [ACM Comp. Survey 08], Hristidis et. al. [SIGMOD 01], Ram et. al. [KDD 12] Index for Moving Objects - Nascimento et. al. [R-tree, SAC 98], Sistla et. al. [ICDE 97], Kollios et. al. [PODS 99], Saltenis et. al. [TPR-Tree, SIGMOD 00], Jensen et. al. [B x -Tree, VLDB 04], Tao et. al. [SIGMOD 02], Zhang et. al. [MBR Tree, VLDB J 12] 6/ 15

Planar Index: Geometrical Indexing x 1 x 4 x 3 x 2 x 6 x 5 Query Processing using Planar Index 7/ 15

Moving Objects Intersection Finding Find all object pairs that will be within distance S at time instance t A 1 + B 2 + C 3 + D 4 + E 5 + F 6 + G 7 S² Moving Object Database r, ω p, q, u, v Scalar Product Query: (A B C D E F G). ( 1 2 3 4 5 6 7 ) S² Query Parameters (unknown) Function (known) Inequality Parameter (unknown) 7/ 15

Planar Index: Geometrical Indexing x 1 x 4 x 3 x 2 x 6 x 5 Query Processing using Planar Index 7/ 15

Planar Index: Geometrical Indexing r 1 One Planar Index = Collection of Parallel Hyper-planes Passing through the Data Points r 2 r 3 x 1 r 4 r 5 x 4 x 3 x 2 r 6 x 6 x 5 p 6 p 5 p 4 p 3 p 2 p 1 Query Processing using Planar Index 7/ 15

Planar Index: Geometrical Indexing Query: (a, x) b q 2 x 1 x 4 x 3 x 2 x 6 x 5 q 1 Query Processing using Planar Index 7/ 15

Planar Index: Geometrical Indexing r 1 Query: (a, x) b q 2 x 1 Accept x 2 q 1 p 1 Query Processing using Planar Index 7/ 15

Planar Index: Geometrical Indexing Query: (a, x) b q 2 Reject r 5 x 6 x 5 p 5 q 1 Query Processing using Planar Index 7/ 15

Planar Index: Geometrical Indexing Query: (a, x) b q 2 Verify r 3 x 4 x 3 q 1 p 3 Query Processing using Planar Index 7/ 15

Planar Index: Geometrical Indexing Accept Query: (a, x) b Verify x 1 x 4 x 2 Reject x 6 x 5 x 3 Query Processing using Planar Index 7/ 15

Planar Index: Time and Space Complexity Accept Verify Query: (a, x) b Reject Query Processing using Planar Index Index Time: O(n log n) Index Space: O(n) Query Processing Time: O(d log n + t) ~ O(d n) 8/ 15

Multiple Planar Indices Query: (a, x) b Multiple Planar Indices 9/ 15

Best Index Selection at Query Time Query: (a, x) b Accept Reject Accept Reject Verify Planar Index at Right is Better for the Given Query 9/ 15

Top-k Nearest Neighbor Query Query: (a, x) b; Top-k Closest Points to Query Hyper plane x 3 x 4 x 2 x 1 x 6 x 8 x 7 x 5 Top-k Nearest Neighbor Query 10/ 15

Top-k Nearest Neighbor Query Query: (a, x) b; Top-2 Closest Points to Query Hyper plane x 3 x 4 x 2 x 1 x 6 Reject x 8 x 7 x 5 Top-k Nearest Neighbor Query 10/ 15

Top-k Nearest Neighbor Query Query: (a, x) b; Top-2 Closest Points to Query Hyper plane Verify x 3 x 4 x 6 x 2 x 1 Process 6, 5 5 Top-2 Buffer Reject x 8 x 7 x 5 Top-k Nearest Neighbor Query 10/ 15

Top-k Nearest Neighbor Query Lower Bound Distance x 3 x 2 x 1 Query: (a, x) b; Top-2 Closest Points to Query Hyper plane Process 4 x 4 5 4 x 6 Top-2 Buffer x 7 x 5 x 8 Top-k Nearest Neighbor Query 10/ 15

Top-k Nearest Neighbor Query Lower Bound Distance x 3 x 2 x 1 Query: (a, x) b; Top-2 Closest Points to Query Hyper plane Process 3 x 4 3 4 x 6 Top-2 Buffer x 7 x 5 x 8 Top-k Nearest Neighbor Query 10/ 15

Top-k Nearest Neighbor Query Lower Bound Distance x 3 x 4 x 2 x 1 Query: (a, x) b; Top-2 Closest Points to Query Hyper plane Process 2 3 4 x 6 Top-2 Buffer x 7 x 5 x 8 Top-k Nearest Neighbor Query 10/ 15

Top-k Nearest Neighbor Query Lower Bound Distance x 3 x 4 x 2 x 1 Query: (a, x) b; Top-2 Closest Points to Query Hyper plane Prune Process 2 3 4 x 6 Top-2 Buffer Reject x 7 x 5 x 8 Top-k Nearest Neighbor Query 10/ 15

List of Experiments Datasets: - Real-World: CMoment, Ctexture, Electricity Consumption - Synthetic: Independent, Correlated, Anti-Correlated List of Experiments: - Efficiency vs. No of Index - Efficiency vs. No of Dimension - Efficiency vs. Randomness of Query - Efficiency vs. Query Selectivity - Pruning Capacity vs. No of Index - Pruning Capacity vs. No of Dimension - Pruning Capacity vs. Randomness of Query - Pruning Capacity vs. Query Selectivity - Scalability of Index Building, Query Processing - Dynamic Index Updating - Memory Usage of Planar Index Experimentally Evaluated Planar Index in: Moving-Object Intersection Top-k Nearest Neighbor Query 11/ 15

Dataset and Query Datasets: # Data Points # Dimension # Attribute Range CMoment (Real-World) 68,040 9 ( 4.15, 4.59) Independent (Synthetic) 1,000,000 2-14 (1, 100) Query: Q 1 1 + Q 2 2 + + Q d d 75 (Q 1 + Q 2 + + Q d ) Query Selectivity Randomness of Query (QR): Q i (1,n) 12/ 15

Efficiency (Real-World Dataset) # Dimension = 9 # Index = 100 1.13 ~ 4.5 times better than Baseline 13/ 15

Efficiency (Synthetic Dataset) # Index = 100 Dimension = 2 Dimension = 6 12 ~ 170 times better than Baseline 1.6 ~ 400 times better than Baseline 14/ 15

Application: Moving Object Intersection Intersection Finding among 5K 5K Moving Objects Objects moving with uniform velocity 12 ~ 50 times better than Baseline Objects moving with acceleration 27 ~ 55 times better than Baseline 15/ 15

Conclusion Scalar product query widely applicable Planar index one generalized index for many problems Application in moving object intersection finding Future Work: Dynamic updates in planar indices based on past query workload Software and Dataset: (Publicly Available) http://people.inf.ethz.ch/khana/software/scalar.tar.gz