Similarity searching, or how to find your neighbors efficiently

Similar documents
Geometry of Similarity Search

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Doubling metric spaces and embeddings. Assaf Naor

Transforming Hierarchical Trees on Metric Spaces

Finite Metric Spaces & Their Embeddings: Introduction and Basic Tools

Making Nearest Neighbors Easier. Restrictions on Input Algorithms for Nearest Neighbor Search: Lecture 4. Outline. Chapter XI

Navigating nets: Simple algorithms for proximity search

On Stochastic Decompositions of Metric Spaces

Lecture 18: March 15

16 Embeddings of the Euclidean metric

arxiv: v1 [cs.cg] 5 Sep 2018

Proximity problems in high dimensions

Algorithms for Nearest Neighbors

Optimal compression of approximate Euclidean distances

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform

Nearest Neighbor Preserving Embeddings

Fréchet embeddings of negative type metrics

The black-box complexity of nearest neighbor search

On Approximate Nearest Neighbors in. Non-Euclidean Spaces. Piotr Indyk. Department of Computer Science. Stanford University.

Metric structures in L 1 : Dimension, snowflakes, and average distortion

Fast Dimension Reduction

Notes taken by Costis Georgiou revised by Hamed Hatami

Cell-Probe Proofs and Nondeterministic Cell-Probe Complexity

A Union of Euclidean Spaces is Euclidean. Konstantin Makarychev, Northwestern Yury Makarychev, TTIC

Overcoming the l 1 Non-Embeddability Barrier: Algorithms for Product Metrics

On the Optimality of the Dimensionality Reduction Method

Distribution-specific analysis of nearest neighbor search and classification

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction

25 Minimum bandwidth: Approximation via volume respecting embeddings

On-line embeddings. Piotr Indyk Avner Magen Anastasios Sidiropoulos Anastasios Zouzias

A Fast and Simple Algorithm for Computing Approximate Euclidean Minimum Spanning Trees

Partitioning Metric Spaces

Approximate Pattern Matching and the Query Complexity of Edit Distance

Lecture: Expanders, in theory and in practice (2 of 2)

Embeddings of finite metric spaces in Euclidean space: a probabilistic view

Fast Dimension Reduction

Inderjit Dhillon The University of Texas at Austin

Lecture 15: Random Projections

(1 + ε)-approximation for Facility Location in Data Streams

Approximate Nearest Neighbor (ANN) Search in High Dimensions

COMS E F15. Lecture 17: Sublinear-time algorithms

An Algorithmist s Toolkit Nov. 10, Lecture 17

Randomized Algorithms

On metric characterizations of some classes of Banach spaces

Rademacher-Sketch: A Dimensionality-Reducing Embedding for Sum-Product Norms, with an Application to Earth-Mover Distance

Hardness of Embedding Metric Spaces of Equal Size

COMS 4771 Introduction to Machine Learning. Nakul Verma

Markov chains in smooth Banach spaces and Gromov hyperbolic metric spaces

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

A Las Vegas approximation algorithm for metric 1-median selection

Learnability and the Doubling Dimension

CIS 520: Machine Learning Oct 09, Kernel Methods

Efficient Sketches for Earth-Mover Distance, with Applications

Topic: Balanced Cut, Sparsest Cut, and Metric Embeddings Date: 3/21/2007

Algorithms for Picture Analysis. Lecture 07: Metrics. Axioms of a Metric

Lower bounds for Edit Distance and Product Metrics via Poincaré-Type Inequalities

Pavel Zezula, Giuseppe Amato,

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Efficient Approximation of Large LCS in Strings Over Not Small Alphabet

Geometric Problems in Moderate Dimensions

Approximating TSP on Metrics with Bounded Global Growth

Diffusion Geometries, Diffusion Wavelets and Harmonic Analysis of large data sets.

Space-Time Tradeoffs for Approximate Nearest Neighbor Searching

Hölder Homeomorphisms and Approximate Nearest Neighbors

arxiv: v1 [cs.ds] 21 Dec 2018

Ordinal Embedding: Approximation Algorithms and Dimensionality Reduction

A Unified Theorem on SDP Rank Reduction. yyye

Oblivious String Embeddings and Edit Distance Approximations

Multimedia Databases 1/29/ Indexes for Multimedia Data Indexes for Multimedia Data Indexes for Multimedia Data

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 11 Luca Trevisan February 29, 2016

Lecture Notes 1: Vector spaces

Algorithms, Geometry and Learning. Reading group Paris Syminelakis

Vertical Versus Horizontal Poincare Inequalities. Assaf Naor Courant Institute

Random projection trees and low dimensional manifolds. Sanjoy Dasgupta and Yoav Freund University of California, San Diego

Geometric Optimization Problems over Sliding Windows

18.409: Topics in TCS: Embeddings of Finite Metric Spaces. Lecture 6

Lecture 14 October 16, 2014

Computational Metric Embeddings. Anastasios Sidiropoulos

Lower bounds on the size of semidefinite relaxations. David Steurer Cornell

Friday Four Square! Today at 4:15PM, Outside Gates

A Treehouse with Custom Windows: Minimum Distortion Embeddings into Bounded Treewidth Graphs

Inapproximability for planar embedding problems

On Approximating the Depth and Related Problems

Non-Asymptotic Theory of Random Matrices Lecture 4: Dimension Reduction Date: January 16, 2007

Nearest-Neighbor Searching Under Uncertainty

Clustering Perturbation Resilient

The main results about probability measures are the following two facts:

Using the Johnson-Lindenstrauss lemma in linear and integer programming

Local regression, intrinsic dimension, and nonparametric sparsity

5. DIVIDE AND CONQUER I

ON ASSOUAD S EMBEDDING TECHNIQUE

Sparser Johnson-Lindenstrauss Transforms

Basic Properties of Metric and Normed Spaces

arxiv: v1 [stat.ml] 5 Apr 2017

Lecture 10: Bourgain s Theorem

A An Overview of Complexity Theory for the Algorithm Designer

Succinct Data Structures for Approximating Convex Functions with Applications

Locality Sensitive Hashing

Sketching in Adversarial Environments

Pivot Selection Techniques

Transcription:

Similarity searching, or how to find your neighbors efficiently Robert Krauthgamer Weizmann Institute of Science CS Research Day for Prospective Students May 1, 2009

Background Geometric spaces and techniques are useful in tackling computational problems Arise in diverse application areas, e.g. data analysis, machine learning, networking, combinatorial optimization Definition: A metric space is a set of points M endowed with distance function d M (, ) Points can model various data objects e.g. documents, images, biological sequences (or network nodes) Distances can model (dis)similarity between objects (or latency) Common examples: Euclidean, L p -norms, Hyperbolic, Hamming distance, Edit distance, Earthmover distance Arises also in Linear and Semidefinite Programming (LP,SDP) relaxations Similarity searching, or how to find your neighbors efficiently 2

Similarity Searching [Basic Problem] Nearest Neighbor Search (NNS): Preprocess: a dataset S of n points Query: given point q, quickly find closest a S, i.e. argmin a S d M (a,q) Naive solution: No preprocessing, query time O(n) Ultimate goal (holy grail): q a S Preprocessing O(n), and query time O(log n) Key problem, many applications, extensively studied: Difficult in high dimension (curse of dimensionality) Algorithms for (1+ε)-approximation in L 1 and in L 2 : Query time O(log n) and preprocessing n O(1/ε^2) Query time O(n 1/(1+ ε) ) and preprocessing O(n 1+1/(1+ ε) ) [Indyk-Motwani 98, Kushilevitz-Ostrovsky-Rabani 98] Similarity searching, or how to find your neighbors efficiently 3

NNS in General Metrics Black-box model: Access to pairwise distances (only). Suppose M is a uniform metric i.e. d(x,y) = 1 for all x,y M, Depicts difficulty of NNS for high-dimensional data Such data sets exist in R log n 1 1 1 Is this the only obstacle to efficient NNS? What about data that looks low-dimensional? q a δ For some queries, time Ω(n), even for approximate NNS. Similarity searching, or how to find your neighbors efficiently 4 X

A Metric Notion of Dimension Definition: Ball B(x,r) = all points within distance r>0 from x M. The dimension of M, denoted dim(m), is the minimum k such that every ball can be covered by 2 k balls of half the radius Defined by [Gupta-K.-Lee 03], inspired by [Assouad 83, Clarkson 97]. Call a metric doubling if dim(m) = O(1) Captures every norm on R k Robust to: taking subsets, union of sets, small distortion in distances, etc. Here 2 k 7. Unlike previous suggestions based on cardinality B(x,r) [Plaxton-Richa- Rajaraman 97, Karger-Ruhl 02, Faloutsos-Kamel 94, K.-Lee 03, ] Similarity searching, or how to find your neighbors efficiently 5

NNS in Doubling Metrics Theorem [K.-Lee 04a]: There is a simple (1+ε)-NNS scheme Query time: (1/ε) O(dim(S)) log Φ. [Φ=d max /d min is spread] Preprocessing: n 2 O(dim(S)). Insertion/deletion time: 2 O(dim(S)) log Φ loglog Φ. Outperforms previous schemes [Plaxton-Richa-Rajaraman 98, Clarkson 99, Karger-Ruhl 02] Simpler, wider applicability, deterministic, no apriori info Nearly matches the Euclidean low-dim. case [Arya et al. 94] Explains empirical successes it s just easy Subsequent enhancements Optimal storage O(n) [Beygelzimer-Kakade-Langford 06] Also implemented and obtained very good empirical results Bounds independent of Φ [K.-Lee 04b, Mendel-HarPeled 05, Gottlieb-Cole 06] Improved for Euclidean metrics [Indyk-Naor 06, Dasgupta-Fruend 08] Similarity searching, or how to find your neighbors efficiently 6

Nets M Motivation: Approximate the metric at one scale r>0. Provide a spatial sample of the metric space E.g., grids in R 2. 1/2 1/4 1/8 General approach: Choose representatives Y iteratively Definition: Y S is called an r-net if both: 1. For all y 1,y 2 Y, d(y 1,y 2 ) r [packing] 2. For all x M\Y, d(x,y) < r [covering] r y 1 y 4 y 5 y 3 y 2 Similarity searching, or how to find your neighbors efficiently 7 X

Navigating Nets NNS scheme (vanilla version): Preprocessing: Compute a 2 i -net Y i for all i Z. Add local links. q Query algorithm: Iteratively go to finer nets Find net-point y i Y i closest to query q From a 2 i -net point to nearby 2 i-1 -net points d(q,y i ) OPT+2 i d(y i,y i-1 ) 2OPT+2 i +2 i-1 Thus: # local links 2 O(dim(S)). q Similarity searching, or how to find your neighbors efficiently 8 X

Embeddings [Basic Technique] An embedding of M into l 1 is a mapping f: M R m We say f has distortion K 1 if d M (x,y) kf(x)-f(y)k 1 K d M (x,y) x,y M Example: Tree metric: Embedding into R 2 : f(x)=(1,0) x y z distortion=1 f(y)=(1,0) w distortion 4/3 f(w)?? f(z)=(-1,0) Another example: discrete cube {0,1} r with d(x,y)= x-y 2 under identity map: x-y 2 x-y 1 r x-y 2 distortion = r? Nah Very powerful concept, many applications (including NNS) Similarity searching, or how to find your neighbors efficiently 9 X

A Few Embeddings Theorems Every n-point metric embeds into l 2 (thus into l 1 ) with distortion O(log n) [Bourgain 86] Tight on expanders Every n-point tree metric embeds into l 1 isometrically into l 2 with distortion O(loglog n) 1/2 [Matousek 99] If M is doubling, d M embeds into l 2 (thus into l 1 ) with distortion O(1) [Assouad 83] Every doubling tree metric embeds into l 2 with distortion O(1) [Gupta-K.-Lee 03] Similarity searching, or how to find your neighbors efficiently 10 X

Some Open Problems [Embeddings] Dimension reduction in l 2 : Conjecture: If M is Euclidean (a subset of l 2 ) and doubling, then it embeds with distortion O(1) into low-dimensional l 2 (or l 1 ) Known: dimension O(log n) where n=# points (not doubling dimension) [Johnson-Lindenstrauss 84] Known: Can embed metric d M [Assouad 83] Planar graphs: Conjecture: Every planar metric embeds into l 1 with distortion O(1) Similarity searching, or how to find your neighbors efficiently 11

Edit Distance [Specific Metric] Edit Distance (ED) between two strings x,y Σ d : Minimum number of character insertions / deletions / substitutions to transform one string into the other Extensively used, many applications, variants Examples: ED(and, an)=1 ED(0101010, 1010101) = 2 Computational problems: 1. Computing ED for two input strings Currently, best runtime is quadratic O(d 2 /log 2 d) [Masek-Paterson 80] 2. Quick approximation (near-linear time) Currently, best approximation is 2 O( d) [Andoni-Onak 08] Smoothed model: O(1/ε) approximation in time d 1+ε [Andoni-K. 07] 3. Estimate ED in restricted computational model Sublinear time, data-stream model, or limited communication model 4. NNS under ED (polynomial preprocessing, sublinear query) Currently, best bounds are obtained via embedding into L 1 Similarity searching, or how to find your neighbors efficiently 12

Ulam s metric Definitions: A string s Σ d is called a permutation if it consists of distinct symbols Ulam s metric = Edit distance on the set of permutations [Ulam 72] For simplicity, suppose Σ={1,,d} and ignore substitutions X= y= 123456789 234657891 Motivations: A permutation can model ranking, deck of cards, sequence of genes, A special case of edit distance, useful to develop techniques Similarity searching, or how to find your neighbors efficiently 13

Embedding of permutations Theorem [Charikar-K. 06]: Edit distance on permutations (aka Ulam s metric) embeds into l 1 with distortion O(log d). Proof. Define Intuition: by distortion bound is optimal [Andoni-K. 07] sign(f a,b (P)) indicates whether a appears before b in P Thus, f a,b (P)-f a,b (Q) measures if {a,b} is an inversion in P vs. Q Claim 1: f(p)-f(q) 1 O(log d) ED(P,Q) Assume wlog ED(P,Q)=2, i.e. Q obtained from P by moving one symbol s General case then follows by triangle inequality on P=P 0,P 1,,P t =Q namely f(p)-f(q) 1 t j=1 f(p j )-f(p j-1 ) 1 Total contribution From coordinates where s {a,b}: 2 k (1/k) O(log d) From other coordinates: k k(1/k 1/(k+1)) O(log d) Similarity searching, or how to find your neighbors efficiently 14

Embedding of permutations Theorem [Charikar-K. 06]: Edit distance on permutations (aka Ulam s metric) embeds into l 1 with distortion O(log d). Proof. Define by Claim 1: f(p)-f(q) 1 O(log d) ED(P,Q) X Claim 2: f(p)-f(q) 1 ¼ED(P,Q) [alt. proof by Gopalan-Jayram-K.] Assume wlog that P=identity pivot Edit Q into P=identity using quicksort: Choose a random pivot, Q= 234657891 Delete all characters inverted wrt to pivot Repeat recursively on left and right portions Now argue ED(P,Q) 2 E[ #quicksort deletions ] 4 f(p)-f(q) 1 Surviving subsequence is increasing, thus ED(P,Q) 2 #deletions For every inversion (a,b) in Q: Pr[a deleted by pivot b] 1 / Q -1 [a]-q -1 [b]+1 2 f a,b (P) f a,b (Q) Similarity searching, or how to find your neighbors efficiently 15

Open Problems [Sublinear Algorithms] Estimate in sublinear time the distance between input permutation P and the identity [testing distance to monotonicity] If distance = δ d, use only Õ(1/δ) queries to P An O(log d)-approximation follows from the embedding Actually, a factor 2-approximation is known Lower bound for approximation=1+ε? Question: deciding whether distance is <d/100 or >d/99 requires d Ω(1) queries to P? Similar but for block operations [transposition distance]: Approximation 3 is known (exercise) Is there a lower bound for 1+ε approximation? Remark: distance is not known to be computable in polynomial time Similarity searching, or how to find your neighbors efficiently 16

Research Objectives Two intertwined goals: 1. Understand the complexity of metric spaces from a mathematical and computational perspective E.g., identify geometric properties that reveal a simple underlying structure 2. Develop algorithmic techniques that exploit such characteristics E.g., design tools that dissect a metric into smaller pieces, or faithfully convert them into simpler metrics Concrete directions: Find new NNS algorithms for different metrics (in particular for highdimension) Classify and characterize metric spaces via embeddings Similarity searching, or how to find your neighbors efficiently 17