Locality Sensitive Hashing

Size: px
Start display at page:

Download "Locality Sensitive Hashing"

Transcription

1 Locality Sensitive Hashing February 1, LSH in Hamming space The following discussion focuses on the notion of Locality Sensitive Hashing which was first introduced in [5]. We focus in the case of Hamming metric but LSH can be seen as a general framework which applies to several metrics e.g. the l metric. Definition 1 (Hamming distance). Given two strings x, y {0, 1} d, the Hamming distance d H (x, y) is the number of positions at which x and y differ. For example, let x = and y = Then, d H (x, y) =. We focus on the problem of Approximate Nearest Neighbor search in subsets of ({0, 1} d, d H ) when the dimension is high (assume d log n). Instead of solving directly the Approximate Nearest Neighbor problem, we solve the Approximate Near Neighbor problem which is defined as follows. Definition (Approximate Near Neighbor problem). Let P {0, 1} d. Given r > 0 and ɛ > 0 build a data structure s.t. for any query q {0, 1} d do the following: if p P s.t. d H (p, q) r then report point p P s.t. d H (p, q) (1 + ɛ) r, if p P d H (p, q) > (1 + ɛ) r then report no. The data structure that we present here is randomized and there is a probability of failure. More preciselly, the following will be proven. Theorem 3. Let P {0, 1} d. Given r > 0 and ɛ > 0 the LSH data structure satisfies the following: fix any query q {0, 1} d, if p P s.t. d H (p, q) r then if the preprocessing succeeds for q then the data structure reports point p P s.t. d H (p, q) (1 + ɛ) r, if p P d H (p, q) > (1 + ɛ) r then report no. The preprocessing succeeds for q with constant probability. The space required is O(dn + n ln p 1 1+ɛ log n), the preprocessing time is O(dn ln p log n) and the query time O(dn 1 1+ɛ log n) The method is based on the idea of using hash functions which have the nice property that they probably map similar strings (or generally points) to the same buckets. Definition 4. Let r 1 < r and p 1 > p. We call a family H of hash functions (r 1, r, p 1, p )- sensitive if for any x, y {0, 1} d, d H (x, y) r 1 = P r[h(x) = h(y)] p 1, 1

2 d H (x, y) r = P r[h(x) = h(y)] p. In the Hamming metric case we define the following family of functions. Definition 5 (Family of hash functions). Let H = {h i (x) = x i x = (x 1,, x d ), i {1,, d}}. Obviously, H = d. Pick uniformly at random h H. Then P r[h(x) = h(y)] = 1 d H(x,y) d. Corollary 6. The family H is (r, cr, 1 r d, 1 cr d )-sensitive, where r > 0, c > 1. However probabilities 1 r d, 1 cr d can be close to each other. Definition 7. Given parameter k, define new family G(H): G(H) = {g : {0, 1} d {0, 1} k g(x) = h 1 (x),, h k (x)}. In other words, a function g chosen uniformly at random from G(H) projects point p {0, 1} d into k randomly and independently chosen coordinates. Obviously, G(F ) = d k. Now, we choose uniformly at random L functions g 1,, g L G(F ). Preprocessing algorithm. for i from 1 to L do Pick uniformly at random g i G(F ). For each p P, assign p in bucket g i (p) (in hash table T i ). The preprocessing time is O(L n d k). The space usage: L hash tables and n pointers to strings per table = O(L n). In order to store the n points we need O(d n) space. Query algorithm. for i from 1 to L do for each string p in bucket g i (q) do if number of retrieved strings > 3L then return no end if if d H (q, p) < cr then return p end if The query time is O(L(K + d)). Let p any r-near neighbor of q. The execution of our algorithm is successful if both events happen: A: i {1,, L} s.t. g i (p ) = g i (q). B: Less than 3L useless strings lie in g i (q), i {1,, L}. Let p 1 = 1 r d, p = 1 cr d. Given j, P r[g j(p ) = g j (q)] p k 1. Setting k = log 1/p n yields P r[g j (p ) = g j (q)] n ln 1/p 1 ln 1/p.

3 Hence, P r[a] (1 n ln 1/p 1 ln 1/p ) L. Setting L = n ln 1/p 1 ln 1/p P r[a] (1 1 L )L 1 e. we obtain: Let p P s.t. d H (p, q) c r. Given j, P r[g j (p ) = g j (q)] p k = 1 n. The expected number of strings p P s.t. d H (p, q) cr and also lie in the same bucket with q, is L. Hence 1, P r[b] 1 3. After setting the parameters we conclude: Query time: O(dn ln 1/p 1 ln 1/p log n), Space: O(dn + n 1+ ln 1/p 1 ln 1/p log n), Preprocessing time O(dn 1+ ln 1/p 1 ln 1/p log n). We finally notice that ln 1/p 1 c=1+ɛ = ln 1/p and we omit the technical details. ln(1 r/d) ln(1 (1 + ɛ)r/d) ɛ High probability. The probability can be amplified by repetition. We can achieve 1 n c for any constant c > 0 by building O(log n) data structures as in Theorem 3. Solving the Approximate Nearest Neighbor problem. The idea is to do binary search over the range of distances 1,..., d. Better complexity bounds can be obtained by binary search over the distances 1, (1 + ɛ), (1 + ɛ),..., d. However, in other metrics it is not obvious that someone can solve the Approximate Nearest Neighbor problem with Approximate Near Neighbor data structures. A solution to this problem is obtained in [4] and can be stated as follows. Theorem 8. Let P be a given set of n points in a metric space, and let c = 1 + ɛ > 1, f (0, 1), and γ (1/n, 1) be prescribed parameters. Assume that we are given a data structure for the (c, r)-approximate near neighbor that uses space S(n, c, f), has query time Q(n, c, f), and has failure probability f. Then there exists a data structure for answering c(1 + O(γ))-NN queries in time O(log n)q(n, c, f) with failure probability O(f log n). The resulting data structure uses O(S(n, c, f)/γ log n) space. LSH in l In the previous section we have seen an LSH family for the Hamming metric. It is known that the data structure obtained there can be used in order to solve the problem in l. This is obtained by a non-trivial reduction which translates the ANN problem in l to the ANN problem in the Hamming space [4]. The first LSH function directly applicable to the l metric can be described as follows. Definition 9 (LSH family for l ). Let p R d and v N(0, 1) d. Let also w a parameter (to be defined later) and t [0, w] chosen uniformly at random. Then, h(p) = p, v + t. w 1 Recall Markov s inquality: P r(x α) E[X], where α > 0. α Meaning the d-dimensional standard normal distribution. 3

4 Now let p, q R d. We have Pr[h(p) = h(q)] = Pr[ p, v q, v = x] (1 x w ) dx Now we have seen 3 that p, v q, v = p q, v N(0, p q ). Hence, Pr[h(p) = h(q) w] = exp( π p q p q ) (1 x w ) dx Notice that in l we can assume wlog that r = 1. Then for approximation ratio 1 + ɛ we need to make these two probabilities as distinct as possible: Pr[h(p) = h(q) p q = 1, w] = x π exp( x ) (1 x w ) dx Pr[h(p) = h(q) p q = 1 + ɛ, w] = exp( x π(1 + ɛ) (1 + ɛ) ) (1 x w ) dx. By the previous discussion on the LSH framework we can see that we need to focus on minimizing the term ρ w = Indeed in [3], they prove the following. Lemma 10. There exists w such that ρ w 1 1+ɛ. log(1/ Pr[h(p) = h(q) p q = 1, w]) log(1/ Pr[h(p) = h(q) p q = 1 + ɛ, w]). The above has been verified by numerical computations. Some intuition behind the LSH family. We will now try to give a more intuitive description of the LSH family defined above. First we randomly project the points 4 and then apply a randomly shifted grid 5 with cell-sidewidth w. Now the g : R d N k functions which are implied by the above discussion 6 is just the id of the corresponding cell in the randomly shifted grid. Better results. In [1], they achieve better exponent (roughly 1/(1 + ɛ) ) which is known to be nearly optimal. In [] they achieve even better exponent by designing an algorithmic sceme which depends on the dataset and it is no more oblivious to the points, namely data-dependent LSH. References [1] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117 1, jl.pdf 4 Target dimension log n/ɛ = appoximately preserve distances 5 Random shift implies positive probability of including two close points in the same cell. 6 Recall that in the LSH scheme we concatenate functions of the first family H. 4

5 [] A. Andoni and I. Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. In Proc. of the 47th Annual ACM on Symposium on Theory of Computing, STOC 15, pages , New York, NY, USA, 015. ACM. [3] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the Twentieth Annual Symposium on Computational Geometry, SCG 04, pages 53 6, New York, NY, USA, 004. ACM. [4] Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(14):31 350, 01. [5] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 30th Annual ACM Symp. Theory of Computing, STOC 98, pages ,

Lecture 14 October 16, 2014

Lecture 14 October 16, 2014 CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 14 October 16, 2014 Scribe: Jao-ke Chin-Lee 1 Overview In the last lecture we covered learning topic models with guest lecturer Rong Ge.

More information

Lecture 17 03/21, 2017

Lecture 17 03/21, 2017 CS 224: Advanced Algorithms Spring 2017 Prof. Piotr Indyk Lecture 17 03/21, 2017 Scribe: Artidoro Pagnoni, Jao-ke Chin-Lee 1 Overview In the last lecture we saw semidefinite programming, Goemans-Williamson

More information

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Optimal Data-Dependent Hashing for Approximate Near Neighbors Optimal Data-Dependent Hashing for Approximate Near Neighbors Alexandr Andoni 1 Ilya Razenshteyn 2 1 Simons Institute 2 MIT, CSAIL April 20, 2015 1 / 30 Nearest Neighbor Search (NNS) Let P be an n-point

More information

Approximate Nearest Neighbor (ANN) Search in High Dimensions

Approximate Nearest Neighbor (ANN) Search in High Dimensions Chapter 17 Approximate Nearest Neighbor (ANN) Search in High Dimensions By Sariel Har-Peled, February 4, 2010 1 17.1 ANN on the Hypercube 17.1.1 Hypercube and Hamming distance Definition 17.1.1 The set

More information

Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing.

Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing. COMS 4995-3: Advanced Algorithms Feb 15, 2017 Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing. Instructor: Alex Andoni Scribes: Weston Jackson, Edo Roth 1 Introduction Today s lecture is

More information

Optimal Lower Bounds for Locality Sensitive Hashing (except when q is tiny)

Optimal Lower Bounds for Locality Sensitive Hashing (except when q is tiny) Innovations in Computer Science 20 Optimal Lower Bounds for Locality Sensitive Hashing (except when q is tiny Ryan O Donnell Yi Wu 3 Yuan Zhou 2 Computer Science Department, Carnegie Mellon University,

More information

Lower bounds on Locality Sensitive Hashing

Lower bounds on Locality Sensitive Hashing Lower bouns on Locality Sensitive Hashing Rajeev Motwani Assaf Naor Rina Panigrahy Abstract Given a metric space (X, X ), c 1, r > 0, an p, q [0, 1], a istribution over mappings H : X N is calle a (r,

More information

Geometry of Similarity Search

Geometry of Similarity Search Geometry of Similarity Search Alex Andoni (Columbia University) Find pairs of similar images how should we measure similarity? Naïvely: about n 2 comparisons Can we do better? 2 Measuring similarity 000000

More information

Similarity Search in High Dimensions II. Piotr Indyk MIT

Similarity Search in High Dimensions II. Piotr Indyk MIT Similarity Search in High Dimensions II Piotr Indyk MIT Approximate Near(est) Neighbor c-approximate Nearest Neighbor: build data structure which, for any query q returns p P, p-q cr, where r is the distance

More information

16 Embeddings of the Euclidean metric

16 Embeddings of the Euclidean metric 16 Embeddings of the Euclidean metric In today s lecture, we will consider how well we can embed n points in the Euclidean metric (l 2 ) into other l p metrics. More formally, we ask the following question.

More information

Proximity problems in high dimensions

Proximity problems in high dimensions Proximity problems in high dimensions Ioannis Psarros National & Kapodistrian University of Athens March 31, 2017 Ioannis Psarros Proximity problems in high dimensions March 31, 2017 1 / 43 Problem definition

More information

4 Locality-sensitive hashing using stable distributions

4 Locality-sensitive hashing using stable distributions 4 Locality-sensitive hashing using stable distributions 4. The LSH scheme based on s-stable distributions In this chapter, we introduce and analyze a novel locality-sensitive hashing family. The family

More information

Lecture 8 January 30, 2014

Lecture 8 January 30, 2014 MTH 995-3: Intro to CS and Big Data Spring 14 Inst. Mark Ien Lecture 8 January 3, 14 Scribe: Kishavan Bhola 1 Overvie In this lecture, e begin a probablistic method for approximating the Nearest Neighbor

More information

Beyond Locality-Sensitive Hashing

Beyond Locality-Sensitive Hashing Beyond Locality-Sensitive Hashing Alexandr Andoni Microsoft Research andoni@microsoft.com Piotr Indyk MIT indyk@mit.edu Huy L. Nguy ên Princeton hlnguyen@princeton.edu Ilya Razenshteyn MIT ilyaraz@mit.edu

More information

A Geometric Approach to Lower Bounds for Approximate Near-Neighbor Search and Partial Match

A Geometric Approach to Lower Bounds for Approximate Near-Neighbor Search and Partial Match A Geometric Approach to Lower Bounds for Approximate Near-Neighbor Search and Partial Match Rina Panigrahy Microsoft Research Silicon Valley rina@microsoft.com Udi Wieder Microsoft Research Silicon Valley

More information

Bloom Filters and Locality-Sensitive Hashing

Bloom Filters and Locality-Sensitive Hashing Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,

More information

A Comparison of Extended Fingerprint Hashing and Locality Sensitive Hashing for Binary Audio Fingerprints

A Comparison of Extended Fingerprint Hashing and Locality Sensitive Hashing for Binary Audio Fingerprints A Comparison of Extended Fingerprint Hashing and Locality Sensitive Hashing for Binary Audio Fingerprints ABSTRACT Kimberly Moravec Department of Computer Science University College London Malet Place,

More information

LSH Forest: Practical Algorithms Made Theoretical

LSH Forest: Practical Algorithms Made Theoretical LSH Forest: Practical Algorithms Made Theoretical Alexandr Andoni Columbia University Ilya Razenshteyn MIT CSAIL February 6, 07 Negev Shekel Nosatzki Columbia University Abstract We analyze LSH Forest

More information

LOCALITY PRESERVING HASHING. Electrical Engineering and Computer Science University of California, Merced Merced, CA 95344, USA

LOCALITY PRESERVING HASHING. Electrical Engineering and Computer Science University of California, Merced Merced, CA 95344, USA LOCALITY PRESERVING HASHING Yi-Hsuan Tsai Ming-Hsuan Yang Electrical Engineering and Computer Science University of California, Merced Merced, CA 95344, USA ABSTRACT The spectral hashing algorithm relaxes

More information

Nearest Neighbor Preserving Embeddings

Nearest Neighbor Preserving Embeddings Nearest Neighbor Preserving Embeddings Piotr Indyk MIT Assaf Naor Microsoft Research Abstract In this paper we introduce the notion of nearest neighbor preserving embeddings. These are randomized embeddings

More information

Set Similarity Search Beyond MinHash

Set Similarity Search Beyond MinHash Set Similarity Search Beyond MinHash ABSTRACT Tobias Christiani IT University of Copenhagen Copenhagen, Denmark tobc@itu.dk We consider the problem of approximate set similarity search under Braun-Blanquet

More information

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Lecture 5: Hashing. David Woodruff Carnegie Mellon University Lecture 5: Hashing David Woodruff Carnegie Mellon University Hashing Universal hashing Perfect hashing Maintaining a Dictionary Let U be a universe of keys U could be all strings of ASCII characters of

More information

Faster Johnson-Lindenstrauss style reductions

Faster Johnson-Lindenstrauss style reductions Faster Johnson-Lindenstrauss style reductions Aditya Menon August 23, 2007 Outline 1 Introduction Dimensionality reduction The Johnson-Lindenstrauss Lemma Speeding up computation 2 The Fast Johnson-Lindenstrauss

More information

On Approximating the Depth and Related Problems

On Approximating the Depth and Related Problems On Approximating the Depth and Related Problems Boris Aronov Polytechnic University, Brooklyn, NY Sariel Har-Peled UIUC, Urbana, IL 1: Motivation: Operation Inflicting Freedom Input: R - set red points

More information

On Symmetric and Asymmetric LSHs for Inner Product Search

On Symmetric and Asymmetric LSHs for Inner Product Search Behnam Neyshabur Nathan Srebro Toyota Technological Institute at Chicago, Chicago, IL 6637, USA BNEYSHABUR@TTIC.EDU NATI@TTIC.EDU Abstract We consider the problem of designing locality sensitive hashes

More information

Algorithms for Data Science: Lecture on Finding Similar Items

Algorithms for Data Science: Lecture on Finding Similar Items Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar

More information

Succinct Data Structures for Approximating Convex Functions with Applications

Succinct Data Structures for Approximating Convex Functions with Applications Succinct Data Structures for Approximating Convex Functions with Applications Prosenjit Bose, 1 Luc Devroye and Pat Morin 1 1 School of Computer Science, Carleton University, Ottawa, Canada, K1S 5B6, {jit,morin}@cs.carleton.ca

More information

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013. The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and

More information

Algorithms for Nearest Neighbors

Algorithms for Nearest Neighbors Algorithms for Nearest Neighbors Background and Two Challenges Yury Lifshits Steklov Institute of Mathematics at St.Petersburg http://logic.pdmi.ras.ru/~yura McGill University, July 2007 1 / 29 Outline

More information

Dimension Reduction in Kernel Spaces from Locality-Sensitive Hashing

Dimension Reduction in Kernel Spaces from Locality-Sensitive Hashing Dimension Reduction in Kernel Spaces from Locality-Sensitive Hashing Alexandr Andoni Piotr Indy April 11, 2009 Abstract We provide novel methods for efficient dimensionality reduction in ernel spaces.

More information

Distribution-specific analysis of nearest neighbor search and classification

Distribution-specific analysis of nearest neighbor search and classification Distribution-specific analysis of nearest neighbor search and classification Sanjoy Dasgupta University of California, San Diego Nearest neighbor The primeval approach to information retrieval and classification.

More information

Random Feature Maps for Dot Product Kernels Supplementary Material

Random Feature Maps for Dot Product Kernels Supplementary Material Random Feature Maps for Dot Product Kernels Supplementary Material Purushottam Kar and Harish Karnick Indian Institute of Technology Kanpur, INDIA {purushot,hk}@cse.iitk.ac.in Abstract This document contains

More information

Super-Bit Locality-Sensitive Hashing

Super-Bit Locality-Sensitive Hashing Super-Bit Locality-Sensitive Hashing Jianqiu Ji, Jianmin Li, Shuicheng Yan, Bo Zhang, Qi Tian State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science

More information

compare to comparison and pointer based sorting, binary trees

compare to comparison and pointer based sorting, binary trees Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:

More information

Tail Inequalities Randomized Algorithms. Sariel Har-Peled. December 20, 2002

Tail Inequalities Randomized Algorithms. Sariel Har-Peled. December 20, 2002 Tail Inequalities 497 - Randomized Algorithms Sariel Har-Peled December 0, 00 Wir mssen wissen, wir werden wissen (We must know, we shall know) David Hilbert 1 Tail Inequalities 1.1 The Chernoff Bound

More information

Locality-sensitive Hashing without False Negatives

Locality-sensitive Hashing without False Negatives Locality-sensitive Hashing without False Negatives Rasmus Pagh IT University of Copenhagen, Denmark Abstract We consider a new construction of locality-sensitive hash functions for Hamming space that is

More information

A New Algorithm for Finding Closest Pair of Vectors

A New Algorithm for Finding Closest Pair of Vectors A New Algorithm for Finding Closest Pair of Vectors Ning Xie Shuai Xu Yekun Xu Abstract Given n vectors x 0, x 1,..., x n 1 in {0, 1} m, how to find two vectors whose pairwise Hamming distance is minimum?

More information

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Optimal Data-Dependent Hashing for Approximate Near Neighbors Optimal Data-Dependent Hashing for Approximate Near Neighbors Alexandr Andoni Ilya Razenshteyn January 7, 015 Abstract We show an optimal data-dependent hashing scheme for the approximate near neighbor

More information

Parameter-free Locality Sensitive Hashing for Spherical Range Reporting

Parameter-free Locality Sensitive Hashing for Spherical Range Reporting Parameter-free Locality Sensitive Hashing for Spherical Range Reporting Thomas D. Ahle, Martin Aumüller, and Rasmus Pagh IT University of Copenhagen, Denmark, {thdy, maau, pagh}@itu.dk December, 206 Abstract

More information

Proximity in the Age of Distraction: Robust Approximate Nearest Neighbor Search

Proximity in the Age of Distraction: Robust Approximate Nearest Neighbor Search Proximity in the Age of Distraction: Robust Approximate Nearest Neighbor Search Sariel Har-Peled Sepideh Mahabadi November 24, 2015 Abstract We introduce a new variant of the nearest neighbor search problem,

More information

An Algorithmist s Toolkit Nov. 10, Lecture 17

An Algorithmist s Toolkit Nov. 10, Lecture 17 8.409 An Algorithmist s Toolkit Nov. 0, 009 Lecturer: Jonathan Kelner Lecture 7 Johnson-Lindenstrauss Theorem. Recap We first recap a theorem (isoperimetric inequality) and a lemma (concentration) from

More information

Introduction Long transparent proofs The real PCP theorem. Real Number PCPs. Klaus Meer. Brandenburg University of Technology, Cottbus, Germany

Introduction Long transparent proofs The real PCP theorem. Real Number PCPs. Klaus Meer. Brandenburg University of Technology, Cottbus, Germany Santaló s Summer School, Part 3, July, 2012 joint work with Martijn Baartse (work supported by DFG, GZ:ME 1424/7-1) Outline 1 Introduction 2 Long transparent proofs for NP R 3 The real PCP theorem First

More information

Coupling. 2/3/2010 and 2/5/2010

Coupling. 2/3/2010 and 2/5/2010 Coupling 2/3/2010 and 2/5/2010 1 Introduction Consider the move to middle shuffle where a card from the top is placed uniformly at random at a position in the deck. It is easy to see that this Markov Chain

More information

Linear Spectral Hashing

Linear Spectral Hashing Linear Spectral Hashing Zalán Bodó and Lehel Csató Babeş Bolyai University - Faculty of Mathematics and Computer Science Kogălniceanu 1., 484 Cluj-Napoca - Romania Abstract. assigns binary hash keys to

More information

Distance-Sensitive Bloom Filters

Distance-Sensitive Bloom Filters Distance-Sensitive Bloom Filters Adam Kirsch Michael Mitzenmacher Abstract A Bloom filter is a space-efficient data structure that answers set membership queries with some chance of a false positive. We

More information

Cell-Probe Proofs and Nondeterministic Cell-Probe Complexity

Cell-Probe Proofs and Nondeterministic Cell-Probe Complexity Cell-obe oofs and Nondeterministic Cell-obe Complexity Yitong Yin Department of Computer Science, Yale University yitong.yin@yale.edu. Abstract. We study the nondeterministic cell-probe complexity of static

More information

arxiv: v1 [cs.db] 2 Sep 2014

arxiv: v1 [cs.db] 2 Sep 2014 An LSH Index for Computing Kendall s Tau over Top-k Lists Koninika Pal Saarland University Saarbrücken, Germany kpal@mmci.uni-saarland.de Sebastian Michel Saarland University Saarbrücken, Germany smichel@mmci.uni-saarland.de

More information

Approximating the Minimum Closest Pair Distance and Nearest Neighbor Distances of Linearly Moving Points

Approximating the Minimum Closest Pair Distance and Nearest Neighbor Distances of Linearly Moving Points Approximating the Minimum Closest Pair Distance and Nearest Neighbor Distances of Linearly Moving Points Timothy M. Chan Zahed Rahmati Abstract Given a set of n moving points in R d, where each point moves

More information

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT) Metric Embedding of Task-Specific Similarity Greg Shakhnarovich Brown University joint work with Trevor Darrell (MIT) August 9, 2006 Task-specific similarity A toy example: Task-specific similarity A toy

More information

Lattice-based Locality Sensitive Hashing is Optimal

Lattice-based Locality Sensitive Hashing is Optimal Lattice-based Locality Sensitive Hashing is Optimal Kartheeyan Chandrasearan 1, Daniel Dadush 2, Venata Gandiota 3, and Elena Grigorescu 4 1 University of Illinois, Urbana-Champaign, USA arthe@illinois.edu

More information

Optimal Las Vegas Locality Sensitive Data Structures

Optimal Las Vegas Locality Sensitive Data Structures Optimal Las Vegas Locality Sensitive Data Structures Full Version Thomas Dybdahl Ahle IT University of Copenhagen June 27 2018 Abstract We show that approximate similarity (near neighbour) search can be

More information

Improved Consistent Sampling, Weighted Minhash and L1 Sketching

Improved Consistent Sampling, Weighted Minhash and L1 Sketching Improved Consistent Sampling, Weighted Minhash and L1 Sketching Sergey Ioffe Google Inc., 1600 Amphitheatre Pkwy, Mountain View, CA 94043, sioffe@google.com Abstract We propose a new Consistent Weighted

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

Reporting Neighbors in High-Dimensional Euclidean Space

Reporting Neighbors in High-Dimensional Euclidean Space Reporting Neighbors in High-Dimensional Euclidean Space Dror Aiger Haim Kaplan Micha Sharir Abstract We consider the following problem, which arises in many database and web-based applications: Given a

More information

Randomness and Computation March 13, Lecture 3

Randomness and Computation March 13, Lecture 3 0368.4163 Randomness and Computation March 13, 2009 Lecture 3 Lecturer: Ronitt Rubinfeld Scribe: Roza Pogalnikova and Yaron Orenstein Announcements Homework 1 is released, due 25/03. Lecture Plan 1. Do

More information

Chapter 11. Min Cut Min Cut Problem Definition Some Definitions. By Sariel Har-Peled, December 10, Version: 1.

Chapter 11. Min Cut Min Cut Problem Definition Some Definitions. By Sariel Har-Peled, December 10, Version: 1. Chapter 11 Min Cut By Sariel Har-Peled, December 10, 013 1 Version: 1.0 I built on the sand And it tumbled down, I built on a rock And it tumbled down. Now when I build, I shall begin With the smoke from

More information

1 Estimating Frequency Moments in Streams

1 Estimating Frequency Moments in Streams CS 598CSC: Algorithms for Big Data Lecture date: August 28, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri 1 Estimating Frequency Moments in Streams A significant fraction of streaming literature

More information

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform

Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform Some Useful Background for Talk on the Fast Johnson-Lindenstrauss Transform Nir Ailon May 22, 2007 This writeup includes very basic background material for the talk on the Fast Johnson Lindenstrauss Transform

More information

arxiv: v3 [cs.ds] 7 Jan 2016

arxiv: v3 [cs.ds] 7 Jan 2016 0 CoveringLSH: Locality-sensitive Hashing without False Negatives RASMUS PAGH, IT University of Copenhagen arxiv:1507.03225v3 [cs.ds] 7 Jan 2016 We consider a new construction of locality-sensitive hash

More information

On the Optimality of the Dimensionality Reduction Method

On the Optimality of the Dimensionality Reduction Method On the Optimality of the Dimensionality Reduction Method Alexandr Andoni MIT andoni@mit.edu Piotr Indyk MIT indyk@mit.edu Mihai Pǎtraşcu MIT mip@mit.edu Abstract We investigate the optimality of (1+)-approximation

More information

Trace Reconstruction Revisited

Trace Reconstruction Revisited Trace Reconstruction Revisited Andrew McGregor 1, Eric Price 2, and Sofya Vorotnikova 1 1 University of Massachusetts Amherst {mcgregor,svorotni}@cs.umass.edu 2 IBM Almaden Research Center ecprice@mit.edu

More information

1 Randomized Computation

1 Randomized Computation CS 6743 Lecture 17 1 Fall 2007 1 Randomized Computation Why is randomness useful? Imagine you have a stack of bank notes, with very few counterfeit ones. You want to choose a genuine bank note to pay at

More information

A list-decodable code with local encoding and decoding

A list-decodable code with local encoding and decoding A list-decodable code with local encoding and decoding Marius Zimand Towson University Department of Computer and Information Sciences Baltimore, MD http://triton.towson.edu/ mzimand Abstract For arbitrary

More information

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction Chapter 11 High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction High-dimensional vectors are ubiquitous in applications (gene expression data, set of movies watched by Netflix customer,

More information

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

CSE 190, Great ideas in algorithms: Pairwise independent hash functions CSE 190, Great ideas in algorithms: Pairwise independent hash functions 1 Hash functions The goal of hash functions is to map elements from a large domain to a small one. Typically, to obtain the required

More information

Higher Cell Probe Lower Bounds for Evaluating Polynomials

Higher Cell Probe Lower Bounds for Evaluating Polynomials Higher Cell Probe Lower Bounds for Evaluating Polynomials Kasper Green Larsen MADALGO, Department of Computer Science Aarhus University Aarhus, Denmark Email: larsen@cs.au.dk Abstract In this paper, we

More information

Approximate Voronoi Diagrams

Approximate Voronoi Diagrams CS468, Mon. Oct. 30 th, 2006 Approximate Voronoi Diagrams Presentation by Maks Ovsjanikov S. Har-Peled s notes, Chapters 6 and 7 1-1 Outline Preliminaries Problem Statement ANN using PLEB } Bounds and

More information

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

Linear Sketches A Useful Tool in Streaming and Compressive Sensing Linear Sketches A Useful Tool in Streaming and Compressive Sensing Qin Zhang 1-1 Linear sketch Random linear projection M : R n R k that preserves properties of any v R n with high prob. where k n. M =

More information

The Tensor Product of Two Codes is Not Necessarily Robustly Testable

The Tensor Product of Two Codes is Not Necessarily Robustly Testable The Tensor Product of Two Codes is Not Necessarily Robustly Testable Paul Valiant Massachusetts Institute of Technology pvaliant@mit.edu Abstract. There has been significant interest lately in the task

More information

Measure and Integration: Solutions of CW2

Measure and Integration: Solutions of CW2 Measure and Integration: s of CW2 Fall 206 [G. Holzegel] December 9, 206 Problem of Sheet 5 a) Left (f n ) and (g n ) be sequences of integrable functions with f n (x) f (x) and g n (x) g (x) for almost

More information

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15) Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.

More information

1 Distributional problems

1 Distributional problems CSCI 5170: Computational Complexity Lecture 6 The Chinese University of Hong Kong, Spring 2016 23 February 2016 The theory of NP-completeness has been applied to explain why brute-force search is essentially

More information

Computer Science Dept.

Computer Science Dept. A NOTE ON COMPUTATIONAL INDISTINGUISHABILITY 1 Oded Goldreich Computer Science Dept. Technion, Haifa, Israel ABSTRACT We show that following two conditions are equivalent: 1) The existence of pseudorandom

More information

A Las Vegas approximation algorithm for metric 1-median selection

A Las Vegas approximation algorithm for metric 1-median selection A Las Vegas approximation algorithm for metric -median selection arxiv:70.0306v [cs.ds] 5 Feb 07 Ching-Lueh Chang February 8, 07 Abstract Given an n-point metric space, consider the problem of finding

More information

1 Maintaining a Dictionary

1 Maintaining a Dictionary 15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition

More information

Analysis of Algorithms I: Perfect Hashing

Analysis of Algorithms I: Perfect Hashing Analysis of Algorithms I: Perfect Hashing Xi Chen Columbia University Goal: Let U = {0, 1,..., p 1} be a huge universe set. Given a static subset V U of n keys (here static means we will never change the

More information

Robust local testability of tensor products of LDPC codes

Robust local testability of tensor products of LDPC codes Robust local testability of tensor products of LDPC codes Irit Dinur 1, Madhu Sudan, and Avi Wigderson 3 1 Hebrew University, Jerusalem, Israel. dinuri@cs.huji.ac.il Massachusetts Institute of Technology,

More information

Hash-based Indexing: Application, Impact, and Realization Alternatives

Hash-based Indexing: Application, Impact, and Realization Alternatives : Application, Impact, and Realization Alternatives Benno Stein and Martin Potthast Bauhaus University Weimar Web-Technology and Information Systems Text-based Information Retrieval (TIR) Motivation Consider

More information

Lecture Lecture 9 October 1, 2015

Lecture Lecture 9 October 1, 2015 CS 229r: Algorithms for Big Data Fall 2015 Lecture Lecture 9 October 1, 2015 Prof. Jelani Nelson Scribe: Rachit Singh 1 Overview In the last lecture we covered the distance to monotonicity (DTM) and longest

More information

Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane

Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane Randomized Algorithms Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane Sotiris Nikoletseas Professor CEID - ETY Course 2017-2018 Sotiris Nikoletseas, Professor Randomized

More information

Locality Sensitive Hashing Revisited: Filling the Gap Between Theory and Algorithm Analysis

Locality Sensitive Hashing Revisited: Filling the Gap Between Theory and Algorithm Analysis Locality Sensitive Hashing Revisited: Filling the Gap Between Theory and Algorithm Analysis Hongya Wang Jiao Cao School of Computer Science and Technology Donghua University Shanghai, China {hywang@dhu.edu.cn,

More information

Learning convex bodies is hard

Learning convex bodies is hard Learning convex bodies is hard Navin Goyal Microsoft Research India navingo@microsoft.com Luis Rademacher Georgia Tech lrademac@cc.gatech.edu Abstract We show that learning a convex body in R d, given

More information

HAMMING DISTANCE FROM IRREDUCIBLE POLYNOMIALS OVER F Introduction and Motivation

HAMMING DISTANCE FROM IRREDUCIBLE POLYNOMIALS OVER F Introduction and Motivation HAMMING DISTANCE FROM IRREDUCIBLE POLYNOMIALS OVER F 2 GILBERT LEE, FRANK RUSKEY, AND AARON WILLIAMS Abstract. We study the Hamming distance from polynomials to classes of polynomials that share certain

More information

Lecture 9: List decoding Reed-Solomon and Folded Reed-Solomon codes

Lecture 9: List decoding Reed-Solomon and Folded Reed-Solomon codes Lecture 9: List decoding Reed-Solomon and Folded Reed-Solomon codes Error-Correcting Codes (Spring 2016) Rutgers University Swastik Kopparty Scribes: John Kim and Pat Devlin 1 List decoding review Definition

More information

Locality-Sensitive Hashing for Chi2 Distance

Locality-Sensitive Hashing for Chi2 Distance IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XXXX, NO. XXX, JUNE 010 1 Locality-Sensitive Hashing for Chi Distance David Gorisse, Matthieu Cord, and Frederic Precioso Abstract In

More information

Lecture 12: Lower Bounds for Element-Distinctness and Collision

Lecture 12: Lower Bounds for Element-Distinctness and Collision Quantum Computation (CMU 18-859BB, Fall 015) Lecture 1: Lower Bounds for Element-Distinctness and Collision October 19, 015 Lecturer: John Wright Scribe: Titouan Rigoudy 1 Outline In this lecture, we will:

More information

Algortithms for the Min-Cut problem

Algortithms for the Min-Cut problem Algortithms for the Min-Cut problem Hongwei Jin Department of Applied Mathematics Illinois Insititute of Technology April 30, 2013 Outline 1 Introduction Problem Definition Previous Works 2 Karger s Algorithm

More information

Some notes on streaming algorithms continued

Some notes on streaming algorithms continued U.C. Berkeley CS170: Algorithms Handout LN-11-9 Christos Papadimitriou & Luca Trevisan November 9, 016 Some notes on streaming algorithms continued Today we complete our quick review of streaming algorithms.

More information

Lecture 4: Codes based on Concatenation

Lecture 4: Codes based on Concatenation Lecture 4: Codes based on Concatenation Error-Correcting Codes (Spring 206) Rutgers University Swastik Kopparty Scribe: Aditya Potukuchi and Meng-Tsung Tsai Overview In the last lecture, we studied codes

More information

2 Completing the Hardness of approximation of Set Cover

2 Completing the Hardness of approximation of Set Cover CSE 533: The PCP Theorem and Hardness of Approximation (Autumn 2005) Lecture 15: Set Cover hardness and testing Long Codes Nov. 21, 2005 Lecturer: Venkat Guruswami Scribe: Atri Rudra 1 Recap We will first

More information

Hashing. Martin Babka. January 12, 2011

Hashing. Martin Babka. January 12, 2011 Hashing Martin Babka January 12, 2011 Hashing Hashing, Universal hashing, Perfect hashing Input data is uniformly distributed. A dynamic set is stored. Universal hashing Randomised algorithm uniform choice

More information

Closest String and Closest Substring Problems

Closest String and Closest Substring Problems January 8, 2010 Problem Formulation Problem Statement I Closest String Given a set S = {s 1, s 2,, s n } of strings each length m, find a center string s of length m minimizing d such that for every string

More information

1 Approximate Counting by Random Sampling

1 Approximate Counting by Random Sampling COMP8601: Advanced Topics in Theoretical Computer Science Lecture 5: More Measure Concentration: Counting DNF Satisfying Assignments, Hoeffding s Inequality Lecturer: Hubert Chan Date: 19 Sep 2013 These

More information

Approximate Nearest Neighbor Problem in High Dimensions. Alexandr Andoni

Approximate Nearest Neighbor Problem in High Dimensions. Alexandr Andoni Approximate Nearest Neighbor Problem in High Dimensions by Alexandr Andoni Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at

More information

Lecture 4 Thursday Sep 11, 2014

Lecture 4 Thursday Sep 11, 2014 CS 224: Advanced Algorithms Fall 2014 Lecture 4 Thursday Sep 11, 2014 Prof. Jelani Nelson Scribe: Marco Gentili 1 Overview Today we re going to talk about: 1. linear probing (show with 5-wise independence)

More information

LOCALITY SENSITIVE HASHING FOR BIG DATA

LOCALITY SENSITIVE HASHING FOR BIG DATA 1 LOCALITY SENSITIVE HASHING FOR BIG DATA Wei Wang, University of New South Wales Outline 2 NN and c-ann Existing LSH methods for Large Data Our approach: SRS [VLDB 2015] Conclusions NN and c-ann Queries

More information

Navigating nets: Simple algorithms for proximity search

Navigating nets: Simple algorithms for proximity search Navigating nets: Simple algorithms for proximity search [Extended Abstract] Robert Krauthgamer James R. Lee Abstract We present a simple deterministic data structure for maintaining a set S of points in

More information

PRGs for space-bounded computation: INW, Nisan

PRGs for space-bounded computation: INW, Nisan 0368-4283: Space-Bounded Computation 15/5/2018 Lecture 9 PRGs for space-bounded computation: INW, Nisan Amnon Ta-Shma and Dean Doron 1 PRGs Definition 1. Let C be a collection of functions C : Σ n {0,

More information

Practical and Optimal LSH for Angular Distance

Practical and Optimal LSH for Angular Distance Practical and Optimal LSH for Angular Distance Alexandr Andoni Columbia University Piotr Indyk MIT Thijs Laarhoven TU Eindhoven Ilya Razenshteyn MIT Ludwig Schmidt MIT Abstract We show the existence of

More information