Hash-based Indexing: Application, Impact, and Realization Alternatives

Similar documents
Fuzzy-Fingerprints for Text-Based Information Retrieval

Algorithms for Nearest Neighbors

Syntax versus Semantics:

Algorithms for Data Science: Lecture on Finding Similar Items

Linear Spectral Hashing

Lecture 14 October 16, 2014

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Topical Sequence Profiling

CSCE 561 Information Retrieval System Models

Ruslan Salakhutdinov Joint work with Geoff Hinton. University of Toronto, Machine Learning Group

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

Similarity Search in High Dimensions II. Piotr Indyk MIT

Term Filtering with Bounded Error

Probabilistic Near-Duplicate. Detection Using Simhash

arxiv: v1 [cs.db] 2 Sep 2014

4 Locality-sensitive hashing using stable distributions

Locality Sensitive Hashing

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Probabilistic Hashing for Efficient Search and Learning

Spectral Hashing: Learning to Leverage 80 Million Images

Mining Data Streams. The Stream Model Sliding Windows Counting 1 s

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

Scalable Algorithms for Distribution Search

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74

Faster Algorithms for Sparse Fourier Transform

Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing.

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

Why duplicate detection?

Bloom Filters, Minhashes, and Other Random Stuff

Introduction to Information Retrieval

Putting Suffix-Tree-Stemming to Work

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s

Efficient Data Reduction and Summarization

Faster Algorithms for Sparse Fourier Transform

CS145: INTRODUCTION TO DATA MINING

Chap 2: Classical models for information retrieval

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

Fast similarity searching making the virtual real. Stephen Pickett, GSK

Ad Placement Strategies

Databases. DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)

Fully Understanding the Hashing Trick

A Survey on Spatial-Keyword Search

Information Retrieval CS Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Wavelet decomposition of data streams. by Dragana Veljkovic

Machine Learning for natural language processing

Efficient identification of Tanimoto nearest neighbors

Variable Latent Semantic Indexing

Cuckoo Hashing and Cuckoo Filters

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

A Comparison of Extended Fingerprint Hashing and Locality Sensitive Hashing for Binary Audio Fingerprints

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

Information Retrieval and Organisation

Applying the Science of Similarity to Computer Forensics. Jesse Kornblum

Distinguishing Topic from Genre

Nearest Neighbor Search with Keywords

Patent Searching using Bayesian Statistics

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval and Web Search

CMPS 561 Boolean Retrieval. Ryan Benton Sept. 7, 2011

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Approximating the Covariance Matrix with Low-rank Perturbations

Jeffrey D. Ullman Stanford University

Lecture 17 03/21, 2017

Behavioral Data Mining. Lecture 2

Geometry of Similarity Search

Approximate Nearest Neighbor (ANN) Search in High Dimensions

B490 Mining the Big Data

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

Neighborhood-Based Local Sensitivity

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Introduction to Cryptography

The Power of Asymmetry in Binary Hashing

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Similarity Join Size Estimation using Locality Sensitive Hashing

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

CPSC 467: Cryptography and Computer Security

Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane

INTRODUCTION TO HASHING Dr. Thomas Hicks Trinity University. Data Set - SSN's from UTSA Class

IR Models: The Probabilistic Model. Lecture 8

A Review: Geographic Information Systems & ArcGIS Basics

Composite Quantization for Approximate Nearest Neighbor Search

Querying. 1 o Semestre 2008/2009

CS5112: Algorithms and Data Structures for Applications

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan

Multi-Label Informed Latent Semantic Indexing

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Introduction to Randomized Algorithms III

Integer Sorting on the word-ram

Efficient Bandit Algorithms for Online Multiclass Prediction

CS6931 Database Seminar. Lecture 6: Set Operations on Massive Data

Estimation of the Click Volume by Large Scale Regression Analysis May 15, / 50

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

Classification. Team Ravana. Team Members Cliffton Fernandes Nikhil Keswaney

m p -dissimilarity: A data dependent dissimilarity measure

has its own advantages and drawbacks, depending on the questions facing the drug discovery.

Lower bounds on Locality Sensitive Hashing

arxiv: v1 [cs.it] 16 Aug 2017

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

CS249: ADVANCED DATA MINING

Transcription:

: Application, Impact, and Realization Alternatives Benno Stein and Martin Potthast Bauhaus University Weimar Web-Technology and Information Systems

Text-based Information Retrieval (TIR) Motivation Consider a set of documents D. Term query the most common retrieval task: Find all documents D D containing a set of query terms. Best practice: Index D using an inverted file. Implemented by well-known web search engines.

Text-based Information Retrieval (TIR) Motivation Document query given a document d: Find all documents D D with a high similarty to d. Use cases: plagiarism analysis, query by example. Naive approach: Compare d with each d D. In detail: Construct document models for D and d obtaining D and d. Employ a similarity function ϕ : D D [0, 1]. Is it possible to be faster than the naive approach?

Background Nearest Neighbour Search Given a set D of m-dimensional points and a point d: Find the point d D which is nearest to d. Finding d cannot be done better than in O( D ) time if m exceeds 10. In our case: 1.000 < m < 100.000 [Weber et. al. 1998]

Background Approximate Nearest Neighbour Search Given a set D of m-dimensional points and a point d: Find some points D D from a certain ε-neighbourhood of d. ε-neighbourhood Finding D can be done in O(1) time with high probabilty by means of hashing. [Indyk and Motwani 1998] The dimensionality m does not affect the runtime of their algorithm.

Text-based Information Retrieval (TIR) Nearest Neighbour Search Retrieval tasks Grouping Categorization Near-duplicate detection Use cases focused search, efficient search (cluster hypothesis) preparation of search results Index-based retrieval Similarity search Partial document similarity Complete document similarity plagiarism analysis query by example Classification directory maintenance Approximate retrieval results are often acceptable.

Similarity Hashing With standard hash functions collisions occur accidently. In similarity hashing collisions shall occur purposefully where the purpose is high similarity. Given a similarity function ϕ a hash function h ϕ : D U with U N resembles ϕ if it has the following property [Stein 2005]: h ϕ (d) = h ϕ (d ) ϕ(d,d ) 1 ε with d,d D, 0 < ε 1

Similarity Hashing Index Construction Given a similarity hash function h ϕ a hash index µ h : D D width D = P(D) is constructed using a hash table T a standard hash function h : U {1,..., T }

Similarity Hashing Index Construction Given a similarity hash function h ϕ a hash index µ h : D D width D = P(D) is constructed using a hash table T a standard hash function h : U {1,..., T } To index a set of documents D given their models D, compute for each d D its hash value h ϕ (d) store a reference to d in T at storage position h(h ϕ (d)) To search for documents similar to d given its model d, return the bucket in T at storage position h(h ϕ (d))

Similarity Hash Functions Fuzzy-Fingerprinting (FF) [Stein 2005] A priori probabilities of prefix classes in BNC Distribution of prefix classes in sample Normalization and difference computation Fuzzification {213235632, 157234594} Fingerprint All words having the same prefix belong to the same prefix class.

Similarity Hash Functions Locality-Sensitive Hashing (LSH) [Indyk and Motwani 1998, Datar et. al. 2004] a 2 a 1 d a k Vector space with sample document and random vectors T a. i d Dot product computation Real number line {213235632} Fingerprint The results of the k dot products are summed.

Similarity Hash Functions Adjusting Recall and Precision Recall: h ϕ

Similarity Hash Functions Adjusting Recall and Precision Recall: (FF) # fuzzy schemes. (LSH) # random vector sets. h' ϕ h ϕ A set of hash values per document is called fingerprint.

Similarity Hash Functions Adjusting Recall and Precision Recall: (FF) # fuzzy schemes. (LSH) # random vector sets. h' ϕ h ϕ A set of hash values per document is called fingerprint. Precision: (FF) # prefix classes or # intervals per fuzzy scheme. (LSH) # random vectors.

Experimental Setting Three test collections for three retrieval situations 1. Web results: 100.000 documents from a focused search. Documents as Web retrieval systems return them. 2. RCV1: 100.000 documents from the Reuters Corpus. Documents as corporations organize them. 3. Plagiarism corpus: 3.000 documents with high similarity. Documents as they appear in plagiarism analysis. Retrieval tasks: Similarity Search, Near-Duplicate Detection Precision and Recall were recorded for similarity thresholds ranging from 0 to 1.

Results 1 0.8 Web results RCV1 Recall 0.6 0.4 0.2 FF LSH 0 0 0.2 0.4 0.6 0.8 1 Similarity FF LSH 0 0.2 0.4 0.6 0.8 1

Results 1 0.8 Web results RCV1 Recall 0.6 0.4 0.2 FF LSH 0 0 0.2 0.4 0.6 0.8 1 FF LSH 0 0.2 0.4 0.6 0.8 1 1 0.8 Web results Similarity FF LSH RCV1 FF LSH Precision 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Similarity 0 0.2 0.4 0.6 0.8 1

Results 1 0.8 Plagiarism corpus FF LSH Recall 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Similarity 1 0.8 Precision 0.6 0.4 0.2 FF Plagiarism corpus LSH 0 0 0.2 0.4 0.6 0.8 1 Similarity

Summary Near-duplicate detection in plagiarism situation: FF outperforms LSH in terms of Precsion and Recall. Retrieval tasks in standard corpora: FF outperforms LSH in terms of Precision. FF and LSH perform equally good in terms of Recall. Conclusions: Both hash-based indexing methods may be applied in TIR. FF has an advantage in precision which directly affects runtime performance. None of the hash-based indexing methods is limited to TIR. The only prerequisite is a reasonable vector representation.

Thank you!