Similarity Search in High Dimensions II. Piotr Indyk MIT

Similar documents
Lecture 14 October 16, 2014

Locality Sensitive Hashing

Lecture 17 03/21, 2017

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Beyond Locality-Sensitive Hashing

4 Locality-sensitive hashing using stable distributions

Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing.

Approximate Nearest Neighbor (ANN) Search in High Dimensions

Geometry of Similarity Search

Proximity problems in high dimensions

Approximate Voronoi Diagrams

Algorithms for Data Science: Lecture on Finding Similar Items

Locality-sensitive Hashing without False Negatives

Succinct Data Structures for Approximating Convex Functions with Applications

Bloom Filters and Locality-Sensitive Hashing

A Comparison of Extended Fingerprint Hashing and Locality Sensitive Hashing for Binary Audio Fingerprints

Algorithms for Nearest Neighbors

Hash-based Indexing: Application, Impact, and Realization Alternatives

Heavy Hitters. Piotr Indyk MIT. Lecture 4

Polynomial Representations of Threshold Functions and Algorithmic Applications. Joint with Josh Alman (Stanford) and Timothy M.

arxiv:cs/ v1 [cs.cg] 7 Feb 2006

Lecture 2. Frequency problems

Lower bounds on Locality Sensitive Hashing

Approximate Nearest Neighbor Problem in High Dimensions. Alexandr Andoni

Optimal Data-Dependent Hashing for Approximate Near Neighbors

arxiv: v2 [cs.ds] 21 May 2017

16 Embeddings of the Euclidean metric

Search Algorithms. Analysis of Algorithms. John Reif, Ph.D. Prepared by

LOCALITY SENSITIVE HASHING FOR BIG DATA

Streaming Algorithms for Set Cover. Piotr Indyk With: Sepideh Mahabadi, Ali Vakilian

Fast Dimension Reduction

arxiv: v3 [cs.ds] 7 Jan 2016

Data Structures and Algorithms

A Fast and Simple Algorithm for Computing Approximate Euclidean Minimum Spanning Trees

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

Geometric Problems in Moderate Dimensions

14.1 Finding frequent elements in stream

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Approximate Pattern Matching and the Query Complexity of Edit Distance

On Approximating the Depth and Related Problems

Data structures Exercise 1 solution. Question 1. Let s start by writing all the functions in big O notation:

String Matching. Thanks to Piotr Indyk. String Matching. Simple Algorithm. for s 0 to n-m. Match 0. for j 1 to m if T[s+j] P[j] then

Bloom Filters, Minhashes, and Other Random Stuff

arxiv: v2 [cs.ds] 10 Sep 2016

Making Nearest Neighbors Easier. Restrictions on Input Algorithms for Nearest Neighbor Search: Lecture 4. Outline. Chapter XI

An Algorithmist s Toolkit Nov. 10, Lecture 17

Distribution-specific analysis of nearest neighbor search and classification

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

Algorithms for Streaming Data

APPROXIMATION ALGORITHMS FOR PROXIMITY AND CLUSTERING PROBLEMS YOGISH SABHARWAL

Trusting the Cloud with Practical Interactive Proofs

Proximity in the Age of Distraction: Robust Approximate Nearest Neighbor Search

ITEC2620 Introduction to Data Structures

Introduction to Algorithms March 10, 2010 Massachusetts Institute of Technology Spring 2010 Professors Piotr Indyk and David Karger Quiz 1

Computational Complexity - Pseudocode and Recursions

Approximate Clustering via Core-Sets

Optimal Lower Bounds for Locality Sensitive Hashing (except when q is tiny)

Streaming and communication complexity of Hamming distance

arxiv: v2 [cs.ds] 17 Apr 2018

CSCI Honor seminar in algorithms Homework 2 Solution

Cell-Probe Lower Bounds for the Partial Match Problem

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

An efficient approximation for point-set diameter in higher dimensions

Nearest Neighbor Searching Under Uncertainty. Wuzhou Zhang Supervised by Pankaj K. Agarwal Department of Computer Science Duke University

Declaring Independence via the Sketching of Sketches. Until August Hire Me!

A New Algorithm for Finding Closest Pair of Vectors

Non-Uniform Graph Partitioning

Linear Spectral Hashing

Set Similarity Search Beyond MinHash

Sorting algorithms. Sorting algorithms

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

arxiv: v1 [cs.cg] 5 Sep 2018

Transforming Hierarchical Trees on Metric Spaces

(1 + ε)-approximation for Facility Location in Data Streams

Space-Time Tradeoffs for Approximate Nearest Neighbor Searching

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Spectral Hashing: Learning to Leverage 80 Million Images

Similarity searching, or how to find your neighbors efficiently

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Chapter 11. Min Cut Min Cut Problem Definition Some Definitions. By Sariel Har-Peled, December 10, Version: 1.

Optimal Las Vegas Locality Sensitive Data Structures

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s

COMS E F15. Lecture 22: Linearity Testing Sparse Fourier Transform

PCP Theorem and Hardness of Approximation

Lecture 4: Codes based on Concatenation

Optimal compression of approximate Euclidean distances

R ij = 2. Using all of these facts together, you can solve problem number 9.

CS151 Complexity Theory. Lecture 15 May 22, 2017

Chapter 5. Divide and Conquer CLRS 4.3. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Randomized Algorithms III Min Cut

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

2. Exact String Matching

Faster Algorithms for Sparse Fourier Transform

Introduction to Quantum Information Processing CS 467 / CS 667 Phys 667 / Phys 767 C&O 481 / C&O 681

Data-Dependent Hashing via Nonlinear Spectral Gaps

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

COMS E F15. Lecture 17: Sublinear-time algorithms

Randomized Sorting Algorithms Quick sort can be converted to a randomized algorithm by picking the pivot element randomly. In this case we can show th

Transcription:

Similarity Search in High Dimensions II Piotr Indyk MIT

Approximate Near(est) Neighbor c-approximate Nearest Neighbor: build data structure which, for any query q returns p P, p-q cr, where r is the distance to the nearest neighbor of q c-approximate r-near Neighbor: build data structure which, for any query q: If there is a point p P, p-q r it returns p P, p-q cr r q cr

Algorithms for c-approximate Near Neighbor Space Time Comment Norm Ref dn+n O(1/ε2 ) d * logn /ε 2 (or 1) c=1+ ε Hamm, l 2 [KOR 98, IM 98] n Ω(1/ε2 ) O(1) [AIP 06] dn+n 1+ρ(c) dn ρ(c) ρ(c)=1/c Hamm, l 2 [IM 98], [GIM 98],[Cha 02] ρ(c)<1/c l 2 [DIIM 04] dn * logs dn σ(c) σ(c)=o(log c/c) Hamm, l 2 [Ind 01] dn+n 1+ρ(c) dn ρ(c) ρ(c)=1/c 2 + o(1) l 2 [AI 06] σ(c)=o(1/c) l 2 [Pan 06]

Reductions c(1+γ)-approx Nearest Neighbor reduces to c-approx Near Neighbor Easy: Space multiplied by (log Δ)/γ, where Δ is the spread, i.e., all distances in P are in [1 Δ] Query time multiplied by log((log Δ)/γ) Probability of failure multiplied by (log Δ)/γ Idea: Build data structures with r= ½, ½ (1+γ), ½ (1+γ) 2,, O(Δ) To answer query, do binary search on values of r Hard: replace log Δ by log n

General reduction [Har Peled-Indyk-Motwani 11] Assume we have a data structure for dynamic c-near Neighbor in under a metric D which, for parameters n,f has: T(n,c,f) construction time S(n,c,f) space bound Q(n,c,f) query time U(n,c,f) update time Then we get a data structure for c(1+o(γ))-nearest Neighbor with: O(T(n,c,f)/γ log 2 n+nlogn[q(n,c,f)+u(n,c,f)]) expected construction time O(S(n,c,f)/γ log 2 n) space bound Q(n,c,f) O(logn) query time O(f logn) failure probability Generalizes, improves, simplifies and merges [Indyk- Motwani 98] and [Har Peled 01]

Intro Basic idea: use different scales (i.e., radiuses r) for different clouds of points At most n 2 total Would like (log n) 2 per point, on the average We will see a simplified reduction: From approximate nearest neighbor to exact near neighbor Simplifying assumption Actual reduction a little more complex, but follows the same idea

Example

Notation UB P (r) = p P B(p, r) CC P (r) is a partitioning of P induced by the connected components of UB P (r) r med is the smallest value of r such that UB P (r) has a component of size at least n/2 + 1 UB med =UB P (r med ) CC med =CC P (r med ) Simplifying assumption: UB P (r med ) has a component of size exactly n/2+1

A simplified reduction Set r top = Θ(nr med log(n)/γ) Exact near neighbor data structures NN: For i=0 log 1+γ (2r top /r med ), create NN(P, r med (1+γ) i /2) For each component C CC med recurse on C Recurse on P P that contains one point per each component C CC med (at most n/2 points) Note that the recursion has depth O(log n) Inner radius =r med /2 Outer radius =r top

Search 1. Use NN(P, r med /2) to check whether D(q,P)<r med /2 If yes, recurse on the component C containing q 2. Else use NN(P, r top ) to check whether D(q,P)>r top If yes, recurse on P 3. Else perform binary search on NN(P, r med (1+γ) i /2) Correctness for Cases 1 and 3 follows from the procedure Case 2 need a little work: Each contraction introduces an additive error up to n r med But the distance to nearest neighbor lower-bounded by r top = Θ(nr med log(n)/γ) Accumulated relative error at most (1+n r med/ r top ) O(log n) =(1+γ/log(n)) O(log n)

Space Let B(n) be the maximum number of points stored by the data structure Space = O(B(n) log 1+γ (r top /r med )) We have B(n)= max k,n1+n2+ +nk=n Σ i B(n i )+B(k)+n subject to k n/2,1 ni n/2# This solves to O(n log n)

Construction time Estimating r med Selects a point p uniformly at random from P Return r =median of the set D(p,p ) over p P We have r med r (n 1)r med with probability >1/2 Approximating CC P (r*) n queries and updates to NN with r*

Algorithms for c-approximate Near Neighbor Space Time Comment Norm Ref dn+n O(1/ε2 ) d * logn /ε 2 (or 1) c=1+ ε Hamm, l 2 [KOR 98, IM 98] n Ω(1/ε2 ) O(1) [AIP 06] dn+n 1+ρ(c) dn ρ(c) ρ(c)=1/c Hamm, l 2 [IM 98], [GIM 98],[Cha 02] ρ(c)<1/c l 2 [DIIM 04] dn * logs dn σ(c) σ(c)=o(log c/c) Hamm, l 2 [Ind 01] dn+n 1+ρ(c) dn ρ(c) ρ(c)=1/c 2 + o(1) l 2 [AI 06] σ(c)=o(1/c) l 2 [Pan 06]

Locality-Sensitive Hashing [Indyk-Motwani 98] Idea: construct hash functions g: R d U such that for any points p,q: If p-q r, then Pr[g(p)=g(q)] is high not-so-small If p-q >cr, then Pr[g(p)=g(q)] is small Then we can solve the problem by hashing Related work: [Paturi-Rajasekaran- Reif 95, Greene-Parnas-Yao 94, Karp- Waarts-Zweig 95, Califano-Rigoutsos 93, Broder 97] q p

LSH A family H of functions h: R d U is called (P 1,P 2,r,cr)-sensitive, if for any p,q: if p-q <r then Pr[ h(p)=h(q) ] > P 1 if p-q >cr then Pr[ h(p)=h(q) ] < P 2 Example: Hamming distance KOR 98: h(p) = Σ i S p i u i mod 2 IM 98: h(p)=p i, i.e., the i-th bit of p Probabilities: Pr[ h(p)=h(q) ] = 1-H(p,q)/d p=10010010 q=11010110

Algorithm We use functions of the form g(p)=<h 1 (p),h 2 (p),,h k (p)> Preprocessing: Select g 1 g L For all p P, hash p to buckets g 1 (p) g L (p) Query: Retrieve the points from buckets g 1 (q), g 2 (q),, until Either the points from all L buckets have been retrieved, or Total number of points retrieved exceeds 3L Answer the query based on the retrieved points Total time: O(dL)

Analysis [IM 98, Gionis-Indyk-Motwani 99] Lemma1: the algorithm solves c- approximate NN with: Number of hash functions: L=n ρ, ρ=log(1/p1)/log(1/p2) Constant success probability per query q Lemma 2: for Hamming LSH functions, we have ρ=1/c

Proof of Lemma 1 by picture Points in {0,1} d Collision prob. for k=1..3, L=1..3 (recall: L=#indices, k=#h s ) Distance ranges from 0 to d=10