LOCALITY SENSITIVE HASHING FOR BIG DATA

Similar documents
CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

4 Locality-sensitive hashing using stable distributions

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Lecture 14 October 16, 2014

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Randomized Algorithms

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

Composite Quantization for Approximate Nearest Neighbor Search

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing.

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

Tailored Bregman Ball Trees for Effective Nearest Neighbors

Lecture 17 03/21, 2017

CS246 Final Exam, Winter 2011

Algorithms for Data Science: Lecture on Finding Similar Items

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. SAMs - Detailed outline. Spatial Access Methods - problem

Knowledge Discovery and Data Mining 1 (VO) ( )

Nearest Neighbor Search with Keywords

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Locality Sensitive Hashing

Behavioral Simulations in MapReduce

High-Dimensional Indexing by Distributed Aggregation

Slides based on those in:

Ch. 10 Vector Quantization. Advantages & Design

Configuring Spatial Grids for Efficient Main Memory Joins

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Spatial Database. Ahmad Alhilal, Dimitris Tsaras

Multi-Approximate-Keyword Routing Query

Spectral Hashing: Learning to Leverage 80 Million Images

Jeffrey D. Ullman Stanford University

Querying. 1 o Semestre 2008/2009

Large-Scale Behavioral Targeting

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Similarity Search in High Dimensions II. Piotr Indyk MIT

Introduction to Randomized Algorithms III

Ad Placement Strategies

6.854 Advanced Algorithms

Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory

Faster Algorithms for Sparse Fourier Transform

Similarity Search. Stony Brook University CSE545, Fall 2016

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Introduction to Machine Learning CMU-10701

An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window

L11: Pattern recognition principles

compare to comparison and pointer based sorting, binary trees

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Copyright 2000, Kevin Wayne 1

Clustering with k-means and Gaussian mixture distributions

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

cient and E ective Similarity Search based on Earth Mover s Distance

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

ECE 592 Topics in Data Science

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Analysis of Algorithms. Unit 5 - Intractable Problems

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

CSCI-567: Machine Learning (Spring 2019)

Announcements. CSE332: Data Abstractions Lecture 2: Math Review; Algorithm Analysis. Today. Mathematical induction. Dan Grossman Spring 2010

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Faster Algorithms for Sparse Fourier Transform

Lin-Kernighan Heuristic. Simulated Annealing

Bloom Filters, general theory and variants

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)

B490 Mining the Big Data

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction

Graph Analysis Using Map/Reduce

CSE373: Data Structures and Algorithms Lecture 2: Math Review; Algorithm Analysis. Hunter Zahn Summer 2016

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Nearest Neighbor Search in Spatial

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Chapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved.

P, NP, NP-Complete, and NPhard

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

Succinct Approximate Counting of Skewed Data

Lecture 8 HASHING!!!!!

arxiv: v1 [cs.db] 2 Sep 2014

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Binary Decision Diagrams and Symbolic Model Checking

Similarity Join Size Estimation using Locality Sensitive Hashing

A Tight Lower Bound for Dynamic Membership in the External Memory Model

Proximity problems in high dimensions

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

CS246 Final Exam. March 16, :30AM - 11:30AM

Clustering with k-means and Gaussian mixture distributions

Solving Corrupted Quadratic Equations, Provably

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s

Privacy-Preserving Data Mining

SPATIAL INDEXING. Vaibhav Bajpai

Bloom Filters and Locality-Sensitive Hashing

Making Nearest Neighbors Easier. Restrictions on Input Algorithms for Nearest Neighbor Search: Lecture 4. Outline. Chapter XI

TASM: Top-k Approximate Subtree Matching

CS425: Algorithms for Web Scale Data

Analysis of Algorithms I: Perfect Hashing

Transcription:

1 LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales

Outline 2 NN and c-ann Existing LSH methods for Large Data Our approach: SRS [VLDB 2015] Conclusions

NN and c-ann Queries 3 c-ann = x aka. (1+ )-ANN cr q r a b x D = {a, b, x} NN = a d-dimensional space A set of points, D = i=1 n {O i }, in d-dimensional Euclidean space (d is large, e.g., hundreds) Given a query point q, find the closest point, O*, in D Relaxed version: Return a c-ann point: i.e., its distance to q is at most c*dist(o*, q) May return a c-ann point with at least constant probability

Motivations 4 Theoretical interest Fundamental geometric problem: post-office problem Known to be hard due to high-dimensionality Many applications Feature vectors: Data Mining, Multimedia DB Applications based on similarity queries E.g., Quantization in coding/compression Recommendation systems Bioinformatics/Chemoinformatics

Applications 5 SIFT feature matching Image retrieval, understanding, scene-rendering Video retrieval

Challenges 6 Curse of Dimensionality / Concentration of Measure Hard to find algorithms sub-linear in n and polynomial in d Approximate version (c-ann) is not much easier More details next slide Large data size 1KB for a single point with 256 dims 1B pts = 1TB image-net (Stanford & Princeton): 14 million images, ~100 feature vectors per image 1.4B feature vectors High-dimensionality finding similar words in D=198 million tweets [JMLR 2013]

Existing Solutions 7 NN: Linear scan is (practically) the best approach using linear space & time O(d 5 log(n)) query time, O(n 2d + ) space O(dn 1- (d) ) query time, O(dn) space Linear scan: O(dn/B) I/Os, O(dn) space (1+ )-ANN LSH is the best approach using sub-quadratic space O(log(n) + 1/ (d-1)/2 ) query time, O(n*log(1/ )) space Probabilistic test remove exponential dependency on d Fast JLT: O(d*log(d) + -3 log 2 (n)) query time, O(n max(2, ^-2) ) space LSH-based: Õ(dn +o(1) ) query time, Õ(n 1+ +o(1) + nd) space = 1/(1+ ) + o c (1)

Approximate NN for Multimedia Retrieval 8 Cover-tree Spill-tree Reduce to NN search with Hamming distance Dimensionality reduction (e.g., PCA) Quantization-based approaches (e.g., CK-Means) They are heuristic methods (i.e., no quality guarantee) or rely on some hidden characteristics of the data

LSH is the best approach using sub-quadratic space Locality Sensitive Hashing (LSH) 9 Equality search Index: store o into bucket h(o) Query: retrieve every o in the bucket h(q), verify if o = q LSH h LSH-family, Pr[ h(q) = h(o) ] 1/Dist(q, o) h :: R d Zı technically, dependent on r Near-by points (blue) have more chance of colliding with q than faraway points (red) cr q r a b d-dimensional space

10 Const Succ. Prob. LSH: Indexing & Query Processing Indexing For a fixed r sig(o) = h 1 (o), h 2 (o),, h k (o) store o into bucket sig(o) Iteratively increase r Query Processing Search with a fixed r Retrieve and verify points in the bucket sig(q) (r, c*r)-nn problem Repeat this L times (boosting) Galloping search to find the first good r Incurs additional cost + only c 2 quality guarantee Control false positive size cr q r a b d-dimensional space

Locality Sensitive Hashing (LSH) 11 Standard LSH c 2 -ANN many R(c i, c i+1 )-NN problems Superior empirical query performance + theoretical guarantees w/o assumptions on data distributions Main problem: Huge super-linear index size: Õ(log(dist NN ) n 1+ +o(1) ) Typically hundreds of times larger than the data size

Dealing with Large Data /1 12 Approach 1: A single hash table but probing more buckets Multi-probe [VLDB 07] : Princeton Probe ±1, ±2, buckets from sig(q) Heuristic method; sensitive to parameter settings Entropy LSH [SODA 06] : Stanford Create multiple perturbed queries q i from q Probe buckets of sig(q i ) O(n 2/c ) perturbed queries with O(1) hash tables

Dealing with Large Data /2 13 Approach 2: Distributed LSH Simple distributed LSH: Use (key, value) store to implement (distributed) hash tables Key machine ID Network traffic O(n (1) ) Layered LSH [CIKM 12] : Stanford Use (h(key), key + value), where h() is a LSH Network traffic O(log(n) 0.5 ) w.h.p. 13 nodes for 64-dim color histograms of 80 million Tiny Image

Dealing with Large Data /3 14 Approach 3: Parallel LSH PLSH [VLDB13] : MIT + Intel Partition data to each node; broadcast the query and each node process data locally Meticulous hardware-oriented optimizations Fit the indexes into main memory at each node SIMD, Cache-friendly bitmaps, 2-level hashing for TLB optimization, other optimizations for streaming data Build predicative performance models for parameter tuning 100 servers (64 GB memory) for 1B steaming tweets and queries

Dealing with Large Data /4 15 Approach 4: LSH on external memory with index size reduction LSB-forest [SIGMOD 09, TODS 10] : A different reduction from c 2 -ANN to a R(c i, c i+1 )-NN problem C2LSH [SIGMOD 12] : Do not use composite hash keys Perform fine-granular counting number of collisions in m LSH projections O((dn/B) 0.5 ) query, O((dn/B) 1.5 ) space O(n*log(n)/B) query, O(n*log(n)/B) space Small constants in O( ) SRS (Ours) O(n/B) query, O(n/B) space

Weakness of Ext-Memory LSH Methods 16 Existing methods uses super-linear space Thousands (or more) of hash tables needed if rigorous People resorts to hashing into binary code (and using Hamming distance) for multimedia retrieval Can only handle c, where c = x 2, for integer x 2 To enable reusing the hash table (merging buckets) Valuable information lost (due to quantization) Update? (changes to n, and c) Dataset Size LSB-forest C2LSH + SRS (Ours) Audio, 40MB 1500 MB 127 MB 2 MB

SRS: Our Proposed Method 17 Solving c-ann queries with O(n) query time and O(n) space with constant probability Constants hidden in O() is very small Early-termination condition is provably effective Advantages: Small index Rich-functionality Simple Central idea: c-ann query in d dims knn query in m-dims with filtering Model the distribution of m stable random projections

2-stable Random Projection 18 Let D be the 2-stable random projection = standard Normal distribution N(0, 1) For two i.i.d. random variables A ~ D, B ~ D, then x*a + y*b ~ (x 2 +y 2 ) 1/2 * Dı Illustration V r 1 V V.r 1 ~ N(0, ǁvǁ) r 2

Dist(O) and ProjDist(O) and 19 Their Relationship ProjDist(O) Proj(O) O O Proj(Q) Q V.r 1 ~ N(0, ǁvǁ) z 1 V, r 1 ~ N(0, ǁvǁ) z 2 V, r 2 ~ N(0, ǁvǁ) r 1 Q Think of z i as samples O in d dims m 2-stable random projections (z 1, z m ) in m dims z 12 +z 2 2 ~ ǁvǁ 2 * 2 m i.e., scaled Chi-squared distribution of m degrees of freedom m (x): cdf of the standard 2 m distribution r 2 Dist(O) ProjDist(O)

LSH-like Property 20 Intuitive idea: If Dist(O 1 ) Dist(O 2 ) then ProjDist(O 1 ) < ProjDist(O 2 ) with high probability But the inverse is NOT true NN object in the projected space is most likely not the NN object in the original space with few projections, as Many far-away objects projected before the NN/cNN objects But we can bound the expected number of such cases (say T) Solution Perform incremental k-nn search on the projected space till accessing T objects + Early termination test

Indexing 21 Finding the minimum m Input n, c, T max # of points to access by the algorithm Output m : # of 2-stable random projections T T: a better bound on T m = O(n/T). We use T = O(n), so m = O(1) to achieve linear space index Generate m 2-stable random projections n projected points in a m-dimensional space Index these projections using any index that supports incremental knn search, e.g., R-tree Space cost: O(m * n) = O(n)

22 SRS- (T, c, p ) Compute proj(q) Do incremental knn search from proj(q) for k = 1 to T Compute Dist(O k ) Maintain O min = argmin 1 i k Dist(O i ) If early-termination test (c, p ) = TRUE BREAK Early-termination test: c 2 *ProjDist 2 (o ψ k ) m > p Dist 2 τ (o min ) // stopping condition // stopping condition Return O min Main Theorem: c = 4, d = 256, m = 6, T = 0.00242n, B = 1024, p =0.18 Index = 0.0059n, Query = 0.0084n, succ prob = 0.13 SRS- returns a c-nn point with The 3rd probability China-Australia p Workshop -f(m,c) with O(n) I/O cost

Variations of SRS- (T, c, p ) 23 Compute proj(q) Do incremental knn search from proj(q) for k = 1 to T Compute Dist(O k ) Maintain O min = argmin 1 i k Dist(O i ) If early-termination test (c, p ) = TRUE BREAK Return O min // stopping condition // stopping condition 1. SRS- 2. SRS- 3. SRS- (T, c, p ) Better quality; query cost is O(T) Best quality; query cost bounded by O(n); handles c = 1 Better quality; query cost bounded by O(T) All with success The 3rd probability China-Australia at Workshop least p

Other Results 24 Uniformly optimal method for L 2 Can be easily extended to support top-k c-ann queries (k > 1) No previous known guarantee on the correctness of returned results We guarantee the correctness with probability at least p, if SRS- stops due to earlytermination condition 100% in practice (97% in theory)

Experiment Setup 25 Algorithms LSB-forest [SIGMOD 09, TODS 10] C2LSH [SIGMOD 12] SRS-* [VLDB 15] Data Measures Index size, query cost, result quality, success probability

Datasets 26 122GB 12.5PB 820GB 37GB

Tiny Image Dataset (8M pts, 384 dims) 27 Fastest: SRS-, Slowest: C2LSH Quality the other way around SRS- has comparable quality with C2LSH yet has much lower cost. SRS-* dominates LSBforest

Approximate Nearest Neighbor 28 Empirically better than the theoretic guarantee With 15% I/Os of linear scan, returns NN with probability 71% With 62% I/Os of linear scan, returns NN with probability 99.7%

Large Dataset (1 Billion, 128 dims) 29 Works on a PC (4G mem, 500G hard disk)

Theoretical Guarantees Are Important 30 0.78 0.29 Success probability in a hard setting

Summary 31 Index size is the bottleneck for traditional LSH-based methods Limits the data size one node can support Needs many nodes with large memory to scale to billions of data points SRS reduces c-ann queries in arbitrarily high dimensional space to knn queries in low dim space Our index size is approximately d/m of the size of the data file Can be easily generalized to an parallel implementation Ongoing work Scaling to even larger dataset using data partitioning Accommodate rapid updates Other theoretical work for LSH

Q&A 32 Similarity Query Processing Project Homepage: http://www.cse.unsw.edu.au/~weiw/project/simjoin.html

33

Backup Slides: Analysis 34

Stopping Condition 35 near point: the NN point its distance r far points: points whose distance > c * r Then for any > 0 and any o: Pr[ProjDist(o) *r o is a near point] m ( 2 ) Pr[ProjDist(o) *r o is a far point] m ( 2 /c 2 ) Both because ProjDist 2 (o)/dist 2 (o) ~ 2 m Pr[the NN point projected before *r] m ( 2 ) Pr[# of bad points projected before *r < T] > (1 - m ( 2 /c 2 )) * (n/t) Choose such that P1 + P2 1 > 0 P1 P2 Feasible due to good concentration bound for 2 m

Choosing 36 Probability (yaxis) of Seeing an Object at Distance (xaxis) Probability 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Object with distance 1 Object with distance 4 Let c = 4 Mode = m 2 Blue: 4 Red: 4*(c 2 ) = 64 *r 0 20 40 60 80 100 Distance

ProjDist(O T ): Case I Consider cases where both conditions hold (re. near and far points) P1 + P2 1 probability 37 NN point less than n "far" points 0 r cr distance to q in d-dims less than T "far" points 0 "r ProjDist(o) distance to in (q) m-dims in m-dims ProjDist(O T ) O min = the NN point

ProjDist(O T ): Case II Consider cases where both conditions hold (re. near and far points) P1 + P2 1 probability 38 NN point less than n "far" points 0 r cr distance to q in d-dims less than T "far" points 0 "r ProjDist(o) distance to in (q) m-dims in m-dims ProjDist(O T ) O min = a cnn point

Early-termination Condition ( ) 39 Omit the proof here Also relies on the fact that the squared sum of m projected distances follows a scaled 2 m distribution Key to Handle the case where c = 1 Returns the NN point with guaranteed probability Impossible to handle by LSH-based methods Guarantees the correctness of top-k cann points returned when stopped by this condition No such guarantee by any previous method