LOCALITY SENSITIVE HASHING FOR BIG DATA

1 LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales

Outline 2 NN and c-ann Existing LSH methods for Large Data Our approach: SRS [VLDB 2015] Conclusions

NN and c-ann Queries 3 c-ann = x aka. (1+ )-ANN cr q r a b x D = {a, b, x} NN = a d-dimensional space A set of points, D = i=1 n {O i }, in d-dimensional Euclidean space (d is large, e.g., hundreds) Given a query point q, find the closest point, O*, in D Relaxed version: Return a c-ann point: i.e., its distance to q is at most c*dist(o*, q) May return a c-ann point with at least constant probability

Motivations 4 Theoretical interest Fundamental geometric problem: post-office problem Known to be hard due to high-dimensionality Many applications Feature vectors: Data Mining, Multimedia DB Applications based on similarity queries E.g., Quantization in coding/compression Recommendation systems Bioinformatics/Chemoinformatics

Applications 5 SIFT feature matching Image retrieval, understanding, scene-rendering Video retrieval

Challenges 6 Curse of Dimensionality / Concentration of Measure Hard to find algorithms sub-linear in n and polynomial in d Approximate version (c-ann) is not much easier More details next slide Large data size 1KB for a single point with 256 dims 1B pts = 1TB image-net (Stanford & Princeton): 14 million images, ~100 feature vectors per image 1.4B feature vectors High-dimensionality finding similar words in D=198 million tweets [JMLR 2013]

Existing Solutions 7 NN: Linear scan is (practically) the best approach using linear space & time O(d 5 log(n)) query time, O(n 2d + ) space O(dn 1- (d) ) query time, O(dn) space Linear scan: O(dn/B) I/Os, O(dn) space (1+ )-ANN LSH is the best approach using sub-quadratic space O(log(n) + 1/ (d-1)/2 ) query time, O(n*log(1/ )) space Probabilistic test remove exponential dependency on d Fast JLT: O(d*log(d) + -3 log 2 (n)) query time, O(n max(2, ^-2) ) space LSH-based: Õ(dn +o(1) ) query time, Õ(n 1+ +o(1) + nd) space = 1/(1+ ) + o c (1)

Approximate NN for Multimedia Retrieval 8 Cover-tree Spill-tree Reduce to NN search with Hamming distance Dimensionality reduction (e.g., PCA) Quantization-based approaches (e.g., CK-Means) They are heuristic methods (i.e., no quality guarantee) or rely on some hidden characteristics of the data

LSH is the best approach using sub-quadratic space Locality Sensitive Hashing (LSH) 9 Equality search Index: store o into bucket h(o) Query: retrieve every o in the bucket h(q), verify if o = q LSH h LSH-family, Pr[ h(q) = h(o) ] 1/Dist(q, o) h :: R d Zı technically, dependent on r Near-by points (blue) have more chance of colliding with q than faraway points (red) cr q r a b d-dimensional space

10 Const Succ. Prob. LSH: Indexing & Query Processing Indexing For a fixed r sig(o) = h 1 (o), h 2 (o),, h k (o) store o into bucket sig(o) Iteratively increase r Query Processing Search with a fixed r Retrieve and verify points in the bucket sig(q) (r, c*r)-nn problem Repeat this L times (boosting) Galloping search to find the first good r Incurs additional cost + only c 2 quality guarantee Control false positive size cr q r a b d-dimensional space

Locality Sensitive Hashing (LSH) 11 Standard LSH c 2 -ANN many R(c i, c i+1 )-NN problems Superior empirical query performance + theoretical guarantees w/o assumptions on data distributions Main problem: Huge super-linear index size: Õ(log(dist NN ) n 1+ +o(1) ) Typically hundreds of times larger than the data size

Dealing with Large Data /1 12 Approach 1: A single hash table but probing more buckets Multi-probe [VLDB 07] : Princeton Probe ±1, ±2, buckets from sig(q) Heuristic method; sensitive to parameter settings Entropy LSH [SODA 06] : Stanford Create multiple perturbed queries q i from q Probe buckets of sig(q i ) O(n 2/c ) perturbed queries with O(1) hash tables

Dealing with Large Data /2 13 Approach 2: Distributed LSH Simple distributed LSH: Use (key, value) store to implement (distributed) hash tables Key machine ID Network traffic O(n (1) ) Layered LSH [CIKM 12] : Stanford Use (h(key), key + value), where h() is a LSH Network traffic O(log(n) 0.5 ) w.h.p. 13 nodes for 64-dim color histograms of 80 million Tiny Image

Dealing with Large Data /3 14 Approach 3: Parallel LSH PLSH [VLDB13] : MIT + Intel Partition data to each node; broadcast the query and each node process data locally Meticulous hardware-oriented optimizations Fit the indexes into main memory at each node SIMD, Cache-friendly bitmaps, 2-level hashing for TLB optimization, other optimizations for streaming data Build predicative performance models for parameter tuning 100 servers (64 GB memory) for 1B steaming tweets and queries

Dealing with Large Data /4 15 Approach 4: LSH on external memory with index size reduction LSB-forest [SIGMOD 09, TODS 10] : A different reduction from c 2 -ANN to a R(c i, c i+1 )-NN problem C2LSH [SIGMOD 12] : Do not use composite hash keys Perform fine-granular counting number of collisions in m LSH projections O((dn/B) 0.5 ) query, O((dn/B) 1.5 ) space O(n*log(n)/B) query, O(n*log(n)/B) space Small constants in O( ) SRS (Ours) O(n/B) query, O(n/B) space

Weakness of Ext-Memory LSH Methods 16 Existing methods uses super-linear space Thousands (or more) of hash tables needed if rigorous People resorts to hashing into binary code (and using Hamming distance) for multimedia retrieval Can only handle c, where c = x 2, for integer x 2 To enable reusing the hash table (merging buckets) Valuable information lost (due to quantization) Update? (changes to n, and c) Dataset Size LSB-forest C2LSH + SRS (Ours) Audio, 40MB 1500 MB 127 MB 2 MB

SRS: Our Proposed Method 17 Solving c-ann queries with O(n) query time and O(n) space with constant probability Constants hidden in O() is very small Early-termination condition is provably effective Advantages: Small index Rich-functionality Simple Central idea: c-ann query in d dims knn query in m-dims with filtering Model the distribution of m stable random projections

2-stable Random Projection 18 Let D be the 2-stable random projection = standard Normal distribution N(0, 1) For two i.i.d. random variables A ~ D, B ~ D, then x*a + y*b ~ (x 2 +y 2 ) 1/2 * Dı Illustration V r 1 V V.r 1 ~ N(0, ǁvǁ) r 2

Dist(O) and ProjDist(O) and 19 Their Relationship ProjDist(O) Proj(O) O O Proj(Q) Q V.r 1 ~ N(0, ǁvǁ) z 1 V, r 1 ~ N(0, ǁvǁ) z 2 V, r 2 ~ N(0, ǁvǁ) r 1 Q Think of z i as samples O in d dims m 2-stable random projections (z 1, z m ) in m dims z 12 +z 2 2 ~ ǁvǁ 2 * 2 m i.e., scaled Chi-squared distribution of m degrees of freedom m (x): cdf of the standard 2 m distribution r 2 Dist(O) ProjDist(O)

LSH-like Property 20 Intuitive idea: If Dist(O 1 ) Dist(O 2 ) then ProjDist(O 1 ) < ProjDist(O 2 ) with high probability But the inverse is NOT true NN object in the projected space is most likely not the NN object in the original space with few projections, as Many far-away objects projected before the NN/cNN objects But we can bound the expected number of such cases (say T) Solution Perform incremental k-nn search on the projected space till accessing T objects + Early termination test

Indexing 21 Finding the minimum m Input n, c, T max # of points to access by the algorithm Output m : # of 2-stable random projections T T: a better bound on T m = O(n/T). We use T = O(n), so m = O(1) to achieve linear space index Generate m 2-stable random projections n projected points in a m-dimensional space Index these projections using any index that supports incremental knn search, e.g., R-tree Space cost: O(m * n) = O(n)

22 SRS- (T, c, p ) Compute proj(q) Do incremental knn search from proj(q) for k = 1 to T Compute Dist(O k ) Maintain O min = argmin 1 i k Dist(O i ) If early-termination test (c, p ) = TRUE BREAK Early-termination test: c 2 *ProjDist 2 (o ψ k ) m > p Dist 2 τ (o min ) // stopping condition // stopping condition Return O min Main Theorem: c = 4, d = 256, m = 6, T = 0.00242n, B = 1024, p =0.18 Index = 0.0059n, Query = 0.0084n, succ prob = 0.13 SRS- returns a c-nn point with The 3rd probability China-Australia p Workshop -f(m,c) with O(n) I/O cost

Variations of SRS- (T, c, p ) 23 Compute proj(q) Do incremental knn search from proj(q) for k = 1 to T Compute Dist(O k ) Maintain O min = argmin 1 i k Dist(O i ) If early-termination test (c, p ) = TRUE BREAK Return O min // stopping condition // stopping condition 1. SRS- 2. SRS- 3. SRS- (T, c, p ) Better quality; query cost is O(T) Best quality; query cost bounded by O(n); handles c = 1 Better quality; query cost bounded by O(T) All with success The 3rd probability China-Australia at Workshop least p

Other Results 24 Uniformly optimal method for L 2 Can be easily extended to support top-k c-ann queries (k > 1) No previous known guarantee on the correctness of returned results We guarantee the correctness with probability at least p, if SRS- stops due to earlytermination condition 100% in practice (97% in theory)

Experiment Setup 25 Algorithms LSB-forest [SIGMOD 09, TODS 10] C2LSH [SIGMOD 12] SRS-* [VLDB 15] Data Measures Index size, query cost, result quality, success probability

Datasets 26 122GB 12.5PB 820GB 37GB

Tiny Image Dataset (8M pts, 384 dims) 27 Fastest: SRS-, Slowest: C2LSH Quality the other way around SRS- has comparable quality with C2LSH yet has much lower cost. SRS-* dominates LSBforest

Approximate Nearest Neighbor 28 Empirically better than the theoretic guarantee With 15% I/Os of linear scan, returns NN with probability 71% With 62% I/Os of linear scan, returns NN with probability 99.7%

Large Dataset (1 Billion, 128 dims) 29 Works on a PC (4G mem, 500G hard disk)

Theoretical Guarantees Are Important 30 0.78 0.29 Success probability in a hard setting

Summary 31 Index size is the bottleneck for traditional LSH-based methods Limits the data size one node can support Needs many nodes with large memory to scale to billions of data points SRS reduces c-ann queries in arbitrarily high dimensional space to knn queries in low dim space Our index size is approximately d/m of the size of the data file Can be easily generalized to an parallel implementation Ongoing work Scaling to even larger dataset using data partitioning Accommodate rapid updates Other theoretical work for LSH

Q&A 32 Similarity Query Processing Project Homepage: http://www.cse.unsw.edu.au/~weiw/project/simjoin.html

Backup Slides: Analysis 34

Stopping Condition 35 near point: the NN point its distance r far points: points whose distance > c * r Then for any > 0 and any o: Pr[ProjDist(o) *r o is a near point] m ( 2 ) Pr[ProjDist(o) *r o is a far point] m ( 2 /c 2 ) Both because ProjDist 2 (o)/dist 2 (o) ~ 2 m Pr[the NN point projected before *r] m ( 2 ) Pr[# of bad points projected before *r < T] > (1 - m ( 2 /c 2 )) * (n/t) Choose such that P1 + P2 1 > 0 P1 P2 Feasible due to good concentration bound for 2 m

Choosing 36 Probability (yaxis) of Seeing an Object at Distance (xaxis) Probability 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Object with distance 1 Object with distance 4 Let c = 4 Mode = m 2 Blue: 4 Red: 4*(c 2 ) = 64 *r 0 20 40 60 80 100 Distance

ProjDist(O T ): Case I Consider cases where both conditions hold (re. near and far points) P1 + P2 1 probability 37 NN point less than n "far" points 0 r cr distance to q in d-dims less than T "far" points 0 "r ProjDist(o) distance to in (q) m-dims in m-dims ProjDist(O T ) O min = the NN point

ProjDist(O T ): Case II Consider cases where both conditions hold (re. near and far points) P1 + P2 1 probability 38 NN point less than n "far" points 0 r cr distance to q in d-dims less than T "far" points 0 "r ProjDist(o) distance to in (q) m-dims in m-dims ProjDist(O T ) O min = a cnn point

Early-termination Condition ( ) 39 Omit the proof here Also relies on the fact that the squared sum of m projected distances follows a scaled 2 m distribution Key to Handle the case where c = 1 Returns the NN point with guaranteed probability Impossible to handle by LSH-based methods Guarantees the correctness of top-k cann points returned when stopped by this condition No such guarantee by any previous method