Compact Indexes for Flexible Top-k Retrieval
|
|
- Teresa Crawford
- 5 years ago
- Views:
Transcription
1 Compact Indexes for Flexible Top-k Retrieval Simon Gog Matthias Petri Institute of Theoretical Informatics, Karlsruhe Institute of Technology Computing and Information Systems, The University of Melbourne July th 5
2 Top-k document retrieval Given Collection D = {d,..., d N } Each d i is a string over alphabet Σ = [, σ] terminated by sentinel character (also #) D = D d N, with d N =. Bag of words query Q = {q, q,..., q m } (unordered set of size m) Problem Given a collection D, a query Q of length m, and a similarity measure S : D P =m (Σ ) R. Calculate the top-k documents of D with regard to Q and S. That is a sorted list of document identifiers T = {τ,..., τ k }, with S(d τi, Q) S(d τi+, Q) for i < k and S(d τk, Q) S(d j, Q) for j T.
3 Example Fix a concatenation C of D. i = C word = LA O LA # O LA LA LA # O O LA # C = d d d d S sfreq (d, q) := f d,q (i.e. single-term frequency ranking) S sfreq (d, LA) =, S sfreq (d, LA) =, S sfreq (d, LA) =, S sfreq (d, LA) =. Top-: T = {, }
4 Previous and related work Optimal time (O( q + k)) and space solution for single-term frequency ranking by Navarro & Nekreich (SODA ) Multi-term ranking for term frequency using linear space and time dependent on n m by Hon et al. (J. ACM 4) Larson et al. (CPM 4): Reduction of boolean matrix multiplication to problem of finding elements which contain both terms of a two-term query
5 Previous and related work Optimal time (O( q + k)) and space solution for single-term frequency ranking by Navarro & Nekreich (SODA ) Multi-term ranking for term frequency using linear space and time dependent on n m by Hon et al. (J. ACM 4) Larson et al. (CPM 4): Reduction of boolean matrix multiplication to problem of finding elements which contain both terms of a two-term query Our goal A practical solution for multi-term queries and a wide range of similarity measures, like...
6 Okapi BM5 similarity measure Successful IR similarity measure: S BM5 Q,d = q Q ( ) (k + )f d,q N FD,q +.5 ) f Q,q ln k ( b + b n d n avg + f F D,q +.5 d,q }{{}}{{} =w Q,q =w d,q depends on document-dependent factors: f d,q term frequency (# of occurrences of term q in d) F D,q document frequency (# of distinct ds which contain q) n d length of document d
7 Okapi BM5 similarity measure Successful IR similarity measure: S BM5 Q,d = q Q ( ) (k + )f d,q N FD,q +.5 ) f Q,q ln k ( b + b n d n avg + f F D,q +.5 d,q }{{}}{{} =w Q,q =w d,q depends on document-dependent factors: f d,q term frequency (# of occurrences of term q in d) F D,q document frequency (# of distinct ds which contain q) n d length of document d
8 Okapi BM5 similarity measure Successful IR similarity measure: S BM5 Q,d = q Q ( ) (k + )f d,q N FD,q +.5 ) f Q,q ln k ( b + b n d n avg + f F D,q +.5 d,q }{{}}{{} =w Q,q =w d,q depends on document-dependent factors: f d,q term frequency (# of occurrences of term q in d) F D,q document frequency (# of distinct ds which contain q) n d length of document d
9 Okapi BM5 similarity measure Successful IR similarity measure: S BM5 Q,d = q Q ( ) (k + )f d,q N FD,q +.5 ) f Q,q ln k ( b + b n d n avg + f F D,q +.5 d,q }{{}}{{} =w Q,q =w d,q depends on document-dependent factors: f d,q term frequency (# of occurrences of term q in d) F D,q document frequency (# of distinct ds which contain q) n d length of document d
10 Previous practical solution The GREEDY framework of Culpepper et al. (ESA ) [] consists of a Compressed Suffix Array (CSA) of concatenation D Wavelet Tree of the Document Array of D CSA provides phrase search and snippet extraction This functionality is missing in Inverted Indexes (II)
11 The GREEDY framework T = b = ω ω ω ω # ω ω ω 4 ω # ω ω 4 ω ω # ω 5 ω 5 # ω CSA D = Interval of q = ω in D corresponds to the (multi)set of documents which contain q.
12 The GREEDY framework Represent document array D as wavelet tree WTD Example: Search top- documents (frequency based ranking) Top documents containing ω :
13 The GREEDY framework Represent document array D as wavelet tree WTD Example: Search top- documents (frequency based ranking) push state Top documents containing ω :
14 The GREEDY framework Represent document array D as wavelet tree WTD Example: Search top- documents (frequency based ranking) pop state Top documents containing ω :
15 The GREEDY framework Represent document array D as wavelet tree WTD Example: Search top- documents (frequency based ranking) expand (O() time) and push Top documents containing ω :
16 The GREEDY framework Represent document array D as wavelet tree WTD Example: Search top- documents (frequency based ranking) pop state Top documents containing ω :
17 The GREEDY framework Represent document array D as wavelet tree WTD Example: Search top- documents (frequency based ranking) expand and push Top documents containing ω :
18 The GREEDY framework Represent document array D as wavelet tree WTD Example: Search top- documents (frequency based ranking) pop and report Top documents containing ω : d ( times)
19 The GREEDY framework Represent document array D as wavelet tree WTD Example: Search top- documents (frequency based ranking) Top documents containing ω : d ( times)
20 The GREEDY framework Represent document array D as wavelet tree WTD Example: Search top- documents (frequency based ranking) Top documents containing ω : d ( times)
21 The GREEDY framework Represent document array D as wavelet tree WTD Example: Search top- documents (frequency based ranking) Top documents containing ω : d ( times), d ( times)
22 The GREEDY framework (conclusion) Elegant and simple alogrithm Size: CSA(T ) + WT (D) = nh k (T ) + n log D + o(n log D ) using a plain WT Worst case time depends on # distinct docs in lex. range Culpepper et al. (ESA ) showed that it is practical for single-term frequency ranking, however large index size (character-based indexing) only for small collections
23 Generalize and improve GREEDY multi-term (state consists of multiple intervals) ranked-and or ranked-or version more complex similarity measures: TF IDF, BM5, LMDS Implemenation: 64-bit, word-alphabet Three tricks to achieve a better score estimation: overestimate max d Dv {f d,q }: interval size - unqiue docs in interval + Use repetition array to get unique docs in O() min d Dv {n d } can be determined in O() time (sort doc IDs according to length)
24 Generalize and improve GREEDY multi-term (state consists of multiple intervals) ranked-and or ranked-or version more complex similarity measures: TF IDF, BM5, LMDS Implemenation: 64-bit, word-alphabet Three tricks to achieve a better score estimation: overestimate max d Dv {f d,q }: interval size - unqiue docs in interval + Use repetition array to get unique docs in O() min d Dv {n d } can be determined in O() time (sort doc IDs according to length)
25 Generalize and improve GREEDY multi-term (state consists of multiple intervals) ranked-and or ranked-or version more complex similarity measures: TF IDF, BM5, LMDS Implemenation: 64-bit, word-alphabet Three tricks to achieve a better score estimation: overestimate max d Dv {f d,q }: interval size - unqiue docs in interval + Use repetition array to get unique docs in O() min d Dv {n d } can be determined in O() time (sort doc IDs according to length)
26 Generalize and improve GREEDY multi-term (state consists of multiple intervals) ranked-and or ranked-or version more complex similarity measures: TF IDF, BM5, LMDS Implemenation: 64-bit, word-alphabet Three tricks to achieve a better score estimation: overestimate max d Dv {f d,q }: interval size - unqiue docs in interval + Use repetition array to get unique docs in O() min d Dv {n d } can be determined in O() time (sort doc IDs according to length)
27 Document Frequency: Binary Suffix Tree (BST) v # H = v O v {,, } LA.. 8 v 4 v 5 v 6 # O LA.. O.. {} v 7 LA 6 #.. {} v 8 LA.. {} 5 v 9 {,, } O.. LA v v # O.. O v LA.. 4 {} 7 v v v v 4 v 5 v 6 v 7 v 8 v 9 v v v LA v O.. v 9
28 Document Frequency Sadakane s [] solution. H is at most n bits add o(n)-bit select structure For [l, r] CSA.search(q): document_frequency(h, [l, r]) s r l + y select(h, r, ) if l = then 4 return s (y r + ) 5 else 6 x select(h, l, ) 7 return s (y r + (x l + ))
29 Document Frequency for Subsets D v Solution for document subsets represented in WTD For each in H, record repeated doc ID (O(n log N) bits). Build WT over R. Map [l, r] to WTR (via rank/select). Traverse WTD and WTR simultaneously.
30 Document Frequency for Subsets: Repetition Array v # H = R = v O v {,, } LA.. 8 v v v v 4 v 5 v 6 # O LA.. O.. v 4 v 5 7 v 6 {} v 7 LA 6 v 7 {} v 8 #.. LA.. {} 5 v 8 v 9 {,, } O.. v 9 LA v v # O.. O v LA.. 4 v v v LA {} v O.. v 9
31 Space Reduction ˆR: omit entries in R, which belong to ST root v # H = R = v O v {,, } LA.. 8 v 4 v 5 v 6 O # LA.. O.. 7 v v v v 4 v 5 v 6 {} v 7 LA 6 v 7 #.. {} v 8 LA.. {} 5 v 8 v 9 {,, } O.. LA v v # O.. O v LA.. 4 v 9 v v v LA {} v O.. v 9
32 Space Reduction ˆR: omit entries in R, which belong to ST root ˆRl and ˆD l : If phrase length is restricted to l, sorting intervals with common prefix of length l does not change correctness of method. v # H = R = v O v {,, } LA.. 8 v 4 v 5 v 6 O # LA.. O.. 7 v v v v 4 v 5 v 6 {} v 7 LA 6 v 7 #.. {} v 8 LA.. {} 5 v 8 v 9 {,, } O.. LA v v # O.. O v LA.. 4 v 9 v v v LA {} v O.. v 9
33 Space Reduction ˆR: omit entries in R, which belong to ST root ˆRl and ˆD l : If phrase length is restricted to l, sorting intervals with common prefix of length l does not change correctness of method. Both, WT- ˆR l and WT-D l contain frequency information. Given WT- ˆR l, omit duplicates for intervals with common prefix of length l in WT-D l sort intervals in our example: D = {,,,,,,,,, } instead of D = {,,,,,,,,,,,,, }
34 Implementation Based on SDSL components. Available at Includes engineered II implementation (block-max WAND + document reordering for better compression).
35 Collection Statistics GOV CLUEWEB9 n,468,78,575 4,579,89,95 N 5,5,79 5,,4 n avg 9 88 n min n max 7,49 7,44 σ 9,77,9 9,4,65 C raw 46 GiB. TiB C word 7 GiB 8 GiB Notice Input is the word parsing C word (generated by Indri) Queries from the TREC 5/6 efficiency track
36 Index Sizes GOV CLUEWEB9 CSA DF WT-D l WT- ˆR l Size [GiB] 5 5 I-D n I-D n ˆR n I-D ˆR I-D n I-D n ˆR n I-D ˆR Details: Horizontal line: size of word parsing PII: typically 5%-6% (+original text), our II: 7 GiB for GOV
37 Evaluated States Explored search space % % % % k Estimation range size (f Dv,q) only range size (f Dv,q) and min. doc. length repeats (δ Dv,q) and min. doc length
38 Query Times () Time [ms] k k I-D n I-D n ˆR n I-D ˆR INVIDX-W k Mean Time per State [µs] 9 6 I-D n I-D n ˆR n I-D ˆR Query Length [Words] Left: BM5 Ranked-OR retrieval on GOV. Right: Time per state.
39 Query Times () k Ranked-AND MWE Ranked-AND I-D n I-D n ˆR n I-D ˆR INVIDX-E k k Time [ms] Time [ms] k k= k= k= TF IDF BM5 LMDS Ranked-AND BM5 runtime for unparsed and MWE-parsed queries (left) and Ranked-OR runtime for different similarity measures and indexes (right).
40 Conclusion Extended GREEDY approach to multi-term queries and complex scoring functions Conceptional very simple search engine Flexible: ranked-and/or, scoring function II still better for bag of word queries (for precomputed scores, i.e. fixed scoring function) But: self-index based solution provides more functionality phrase search (experiment with multi-word expressions) text extraction query completion (without query log) First self-index based system on scale Future work: WTs of higher arity, faster state processing
41 J. Shane Culpepper, Gonzalo Navarro, Simon J. Puglisi, and Andrew Turpin. Top-k ranked document search in general text databases. In Proc. ESA, pages 94 5,. Kunihiko Sadakane. Succinct data structures for flexible text retrieval systems. J. Discrete Alg., 5():, 7.
Preview: Text Indexing
Simon Gog gog@ira.uka.de - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Text Indexing Motivation Problems Given a text
More informationarxiv: v1 [cs.ds] 22 Nov 2012
Faster Compact Top-k Document Retrieval Roberto Konow and Gonzalo Navarro Department of Computer Science, University of Chile {rkonow,gnavarro}@dcc.uchile.cl arxiv:1211.5353v1 [cs.ds] 22 Nov 2012 Abstract:
More informationText Indexing: Lecture 6
Simon Gog gog@kit.edu - 0 Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Reviewing the last two lectures We have seen two top-k document retrieval frameworks. Question
More informationA Faster Grammar-Based Self-Index
A Faster Grammar-Based Self-Index Travis Gagie 1 Pawe l Gawrychowski 2 Juha Kärkkäinen 3 Yakov Nekrich 4 Simon Puglisi 5 Aalto University Max-Planck-Institute für Informatik University of Helsinki University
More informationTheoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts
Theoretical Computer Science 410 (2009) 4402 4413 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Dynamic rank/select structures with
More informationarxiv: v1 [cs.ds] 15 Feb 2012
Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl
More informationRank and Select Operations on Binary Strings (1974; Elias)
Rank and Select Operations on Binary Strings (1974; Elias) Naila Rahman, University of Leicester, www.cs.le.ac.uk/ nyr1 Rajeev Raman, University of Leicester, www.cs.le.ac.uk/ rraman entry editor: Paolo
More informationIR Models: The Probabilistic Model. Lecture 8
IR Models: The Probabilistic Model Lecture 8 ' * ) ( % $ $ +#! "#! '& & Probability of Relevance? ' ', IR is an uncertain process Information need to query Documents to index terms Query terms and index
More informationSmaller and Faster Lempel-Ziv Indices
Smaller and Faster Lempel-Ziv Indices Diego Arroyuelo and Gonzalo Navarro Dept. of Computer Science, Universidad de Chile, Chile. {darroyue,gnavarro}@dcc.uchile.cl Abstract. Given a text T[1..u] over an
More informationA Space-Efficient Frameworks for Top-k String Retrieval
A Space-Efficient Frameworks for Top-k String Retrieval Wing-Kai Hon, National Tsing Hua University Rahul Shah, Louisiana State University Sharma V. Thankachan, Louisiana State University Jeffrey Scott
More informationPractical Indexing of Repetitive Collections using Relative Lempel-Ziv
Practical Indexing of Repetitive Collections using Relative Lempel-Ziv Gonzalo Navarro and Víctor Sepúlveda CeBiB Center for Biotechnology and Bioengineering, Chile Department of Computer Science, University
More informationarxiv: v1 [cs.ds] 19 Apr 2011
Fixed Block Compression Boosting in FM-Indexes Juha Kärkkäinen 1 and Simon J. Puglisi 2 1 Department of Computer Science, University of Helsinki, Finland juha.karkkainen@cs.helsinki.fi 2 Department of
More informationLecture 18 April 26, 2012
6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and
More informationForbidden Patterns. {vmakinen leena.salmela
Forbidden Patterns Johannes Fischer 1,, Travis Gagie 2,, Tsvi Kopelowitz 3, Moshe Lewenstein 4, Veli Mäkinen 5,, Leena Salmela 5,, and Niko Välimäki 5, 1 KIT, Karlsruhe, Germany, johannes.fischer@kit.edu
More informationBoolean and Vector Space Retrieval Models
Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) 1
More informationarxiv: v1 [cs.ds] 25 Nov 2009
Alphabet Partitioning for Compressed Rank/Select with Applications Jérémy Barbay 1, Travis Gagie 1, Gonzalo Navarro 1 and Yakov Nekrich 2 1 Department of Computer Science University of Chile {jbarbay,
More informationIndexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile
Indexing LZ77: The Next Step in Self-Indexing Gonzalo Navarro Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl Part I: Why Jumping off the Cliff The Past Century Self-Indexing:
More informationProbabilistic Information Retrieval
Probabilistic Information Retrieval Sumit Bhatia July 16, 2009 Sumit Bhatia Probabilistic Information Retrieval 1/23 Overview 1 Information Retrieval IR Models Probability Basics 2 Document Ranking Problem
More informationSelf-Indexed Grammar-Based Compression
Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca
More informationRanked Retrieval (2)
Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 11: Probabilistic Information Retrieval 1 Outline Basic Probability Theory Probability Ranking Principle Extensions 2 Basic Probability Theory For events A
More informationSelf-Indexed Grammar-Based Compression
Fundamenta Informaticae XXI (2001) 1001 1025 1001 IOS Press Self-Indexed Grammar-Based Compression Francisco Claude David R. Cheriton School of Computer Science University of Waterloo fclaude@cs.uwaterloo.ca
More informationNew Lower and Upper Bounds for Representing Sequences
New Lower and Upper Bounds for Representing Sequences Djamal Belazzougui 1 and Gonzalo Navarro 2 1 LIAFA, Univ. Paris Diderot - Paris 7, France. dbelaz@liafa.jussieu.fr 2 Department of Computer Science,
More informationSmall-Space Dictionary Matching (Dissertation Proposal)
Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length
More informationAlphabet Friendly FM Index
Alphabet Friendly FM Index Author: Rodrigo González Santiago, November 8 th, 2005 Departamento de Ciencias de la Computación Universidad de Chile Outline Motivations Basics Burrows Wheeler Transform FM
More informationA Simple Alphabet-Independent FM-Index
A Simple Alphabet-Independent -Index Szymon Grabowski 1, Veli Mäkinen 2, Gonzalo Navarro 3, Alejandro Salinger 3 1 Computer Engineering Dept., Tech. Univ. of Lódź, Poland. e-mail: sgrabow@zly.kis.p.lodz.pl
More informationOptimal Dynamic Sequence Representations
Optimal Dynamic Sequence Representations Gonzalo Navarro Yakov Nekrich Abstract We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on
More informationRetrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1
Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1 Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR)
More informationEfficient Accessing and Searching in a Sequence of Numbers
Regular Paper Journal of Computing Science and Engineering, Vol. 9, No. 1, March 2015, pp. 1-8 Efficient Accessing and Searching in a Sequence of Numbers Jungjoo Seo and Myoungji Han Department of Computer
More informationCS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002 Recap of last time Index size Index construction techniques Dynamic indices Real world considerations 2 Back of the envelope
More informationFast String Kernels. Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200
Fast String Kernels Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200 Alex.Smola@anu.edu.au joint work with S.V.N. Vishwanathan Slides (soon) available
More informationCompressed Index for Dynamic Text
Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution
More informationDynamic Entropy-Compressed Sequences and Full-Text Indexes
Dynamic Entropy-Compressed Sequences and Full-Text Indexes VELI MÄKINEN University of Helsinki and GONZALO NAVARRO University of Chile First author funded by the Academy of Finland under grant 108219.
More informationCompressed Representations of Sequences and Full-Text Indexes
Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Università di Pisa GIOVANNI MANZINI Università del Piemonte Orientale VELI MÄKINEN University of Helsinki AND GONZALO NAVARRO
More informationBreaking a Time-and-Space Barrier in Constructing Full-Text Indices
Breaking a Time-and-Space Barrier in Constructing Full-Text Indices Wing-Kai Hon Kunihiko Sadakane Wing-Kin Sung Abstract Suffix trees and suffix arrays are the most prominent full-text indices, and their
More informationOn Compressing and Indexing Repetitive Sequences
On Compressing and Indexing Repetitive Sequences Sebastian Kreft a,1,2, Gonzalo Navarro a,2 a Department of Computer Science, University of Chile Abstract We introduce LZ-End, a new member of the Lempel-Ziv
More informationSimple Compression Code Supporting Random Access and Fast String Matching
Simple Compression Code Supporting Random Access and Fast String Matching Kimmo Fredriksson and Fedor Nikitin Department of Computer Science and Statistics, University of Joensuu PO Box 111, FIN 80101
More informationReducing the Space Requirement of LZ-Index
Reducing the Space Requirement of LZ-Index Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile {darroyue, gnavarro}@dcc.uchile.cl 2 Dept. of
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search IR models: Vector Space Model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Brosing boolean vector probabilistic
More informationOutline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting
Outline for today Information Retrieval Efficient Scoring and Ranking Recap on ranked retrieval Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Efficient
More informationSuccinct Suffix Arrays based on Run-Length Encoding
Succinct Suffix Arrays based on Run-Length Encoding Veli Mäkinen Gonzalo Navarro Abstract A succinct full-text self-index is a data structure built on a text T = t 1 t 2...t n, which takes little space
More informationMaschinelle Sprachverarbeitung
Maschinelle Sprachverarbeitung Retrieval Models and Implementation Ulf Leser Content of this Lecture Information Retrieval Models Boolean Model Vector Space Model Inverted Files Ulf Leser: Maschinelle
More informationCompressed Representations of Sequences and Full-Text Indexes
Compressed Representations of Sequences and Full-Text Indexes PAOLO FERRAGINA Dipartimento di Informatica, Università di Pisa, Italy GIOVANNI MANZINI Dipartimento di Informatica, Università del Piemonte
More informationRanked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-017 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 1 Boolean Retrieval Thus far,
More informationFinite State Transducers
Finite State Transducers Eric Gribkoff May 29, 2013 Original Slides by Thomas Hanneforth (Universitat Potsdam) Outline 1 Definition of Finite State Transducer 2 Examples of FSTs 3 Definition of Regular
More informationCAIM: Cerca i Anàlisi d Informació Massiva
1 / 21 CAIM: Cerca i Anàlisi d Informació Massiva FIB, Grau en Enginyeria Informàtica Slides by Marta Arias, José Balcázar, Ricard Gavaldá Department of Computer Science, UPC Fall 2016 http://www.cs.upc.edu/~caim
More informationvector space retrieval many slides courtesy James Amherst
vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the
More informationSpace-Efficient Re-Pair Compression
Space-Efficient Re-Pair Compression Philip Bille, Inge Li Gørtz, and Nicola Prezza Technical University of Denmark, DTU Compute {phbi,inge,npre}@dtu.dk Abstract Re-Pair [5] is an effective grammar-based
More informationRanking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval
Ranking-II Temporal Representation and Retrieval Models Temporal Information Retrieval Ranking in Information Retrieval Ranking documents important for information overload, quickly finding documents which
More informationQuerying. 1 o Semestre 2008/2009
Querying Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2008/2009 Outline 1 2 3 4 5 Outline 1 2 3 4 5 function sim(d j, q) = 1 W d W q W d is the document norm W q is the
More informationChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign
Axiomatic Analysis and Optimization of Information Retrieval Models ChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign http://www.cs.uiuc.edu/homes/czhai
More informationFaster Compact On-Line Lempel-Ziv Factorization
Faster Compact On-Line Lempel-Ziv Factorization Jun ichi Yamamoto, Tomohiro I, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda Department of Informatics, Kyushu University, Nishiku, Fukuoka, Japan
More information15 Text search. P.D. Dr. Alexander Souza. Winter term 11/12
Algorithms Theory 15 Text search P.D. Dr. Alexander Souza Text search Various scenarios: Dynamic texts Text editors Symbol manipulators Static texts Literature databases Library systems Gene databases
More informationString Range Matching
String Range Matching Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi Department of Computer Science, University of Helsinki Helsinki, Finland firstname.lastname@cs.helsinki.fi Abstract. Given strings
More informationDefine M to be a binary n by m matrix such that:
The Shift-And Method Define M to be a binary n by m matrix such that: M(i,j) = iff the first i characters of P exactly match the i characters of T ending at character j. M(i,j) = iff P[.. i] T[j-i+.. j]
More informationLRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations
LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations Jérémy Barbay 1, Johannes Fischer 2, and Gonzalo Navarro 1 1 Department of Computer Science, University of Chile, {jbarbay gnavarro}@dcc.uchile.cl
More informationSpace-Efficient Construction Algorithm for Circular Suffix Tree
Space-Efficient Construction Algorithm for Circular Suffix Tree Wing-Kai Hon, Tsung-Han Ku, Rahul Shah and Sharma Thankachan CPM2013 1 Outline Preliminaries and Motivation Circular Suffix Tree Our Indexes
More informationStronger Lempel-Ziv Based Compressed Text Indexing
Stronger Lempel-Ziv Based Compressed Text Indexing Diego Arroyuelo 1, Gonzalo Navarro 1, and Kunihiko Sadakane 2 1 Dept. of Computer Science, Universidad de Chile, Blanco Encalada 2120, Santiago, Chile.
More informationSuffix Sorting Algorithms
Suffix Sorting Algorithms Timo Bingmann Text-Indexierung Vorlesung 2016-12-01 INSTITUTE OF THEORETICAL INFORMATICS ALGORITHMICS KIT University of the State of Baden-Wuerttemberg and National Research Center
More informationDealing with Text Databases
Dealing with Text Databases Unstructured data Boolean queries Sparse matrix representation Inverted index Counts vs. frequencies Term frequency tf x idf term weights Documents as vectors Cosine similarity
More informationJumbled String Matching: Motivations, Variants, Algorithms
Jumbled String Matching: Motivations, Variants, Algorithms Zsuzsanna Lipták University of Verona (Italy) Workshop Combinatorial structures for sequence analysis in bioinformatics Milano-Bicocca, 27 Nov
More informationBoolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).
Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval
More informationCompact Data Strutures
(To compress is to Conquer) Compact Data Strutures Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017 Agenda
More informationSuccincter text indexing with wildcards
University of British Columbia CPM 2011 June 27, 2011 Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview Problem overview
More informationLanguage Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13
Language Models LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing Web Search Slides based on the books: 13 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis
More informationTerm Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan
Term Weighting and the Vector Space Model borrowing from: Pandu Nayak and Prabhakar Raghavan IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes
More informationBLAST: Basic Local Alignment Search Tool
.. CSC 448 Bioinformatics Algorithms Alexander Dekhtyar.. (Rapid) Local Sequence Alignment BLAST BLAST: Basic Local Alignment Search Tool BLAST is a family of rapid approximate local alignment algorithms[2].
More informationarxiv: v2 [cs.ds] 6 Jul 2015
Online Self-Indexed Grammar Compression Yoshimasa Takabatake 1, Yasuo Tabei 2, and Hiroshi Sakamoto 1 1 Kyushu Institute of Technology {takabatake,hiroshi}@donald.ai.kyutech.ac.jp 2 PRESTO, Japan Science
More informationBurrows-Wheeler Transforms in Linear Time and Linear Bits
Burrows-Wheeler Transforms in Linear Time and Linear Bits Russ Cox (following Hon, Sadakane, Sung, Farach, and others) 18.417 Final Project BWT in Linear Time and Linear Bits Three main parts to the result.
More informationUniversità degli studi di Udine
Università degli studi di Udine Computing LZ77 in Run-Compressed Space This is a pre print version of the following article: Original Computing LZ77 in Run-Compressed Space / Policriti, Alberto; Prezza,
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationNatural Language Processing. Topics in Information Retrieval. Updated 5/10
Natural Language Processing Topics in Information Retrieval Updated 5/10 Outline Introduction to IR Design features of IR systems Evaluation measures The vector space model Latent semantic indexing Background
More informationWithin-Document Term-Based Index Pruning with Statistical Hypothesis Testing
Within-Document Term-Based Index Pruning with Statistical Hypothesis Testing Sree Lekha Thota and Ben Carterette Department of Computer and Information Sciences University of Delaware, Newark, DE, USA
More informationCSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182
CSE182-L7 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding 10-07 CSE182 Bell Labs Honors Pattern matching 10-07 CSE182 Just the Facts Consider the set of all substrings
More informationApproximate String Matching with Lempel-Ziv Compressed Indexes
Approximate String Matching with Lempel-Ziv Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2 and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,
More informationRETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationPV211: Introduction to Information Retrieval
PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 11: Probabilistic Information Retrieval Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk
More informationarxiv: v1 [cs.ds] 21 Nov 2012
The Rightmost Equal-Cost Position Problem arxiv:1211.5108v1 [cs.ds] 21 Nov 2012 Maxime Crochemore 1,3, Alessio Langiu 1 and Filippo Mignosi 2 1 King s College London, London, UK {Maxime.Crochemore,Alessio.Langiu}@kcl.ac.uk
More informationCOMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES
COMPRESSED INDEXING DATA STRUCTURES FOR BIOLOGICAL SEQUENCES DO HUY HOANG (B.C.S. (Hons), NUS) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE SCHOOL OF COMPUTING NATIONAL
More informationApproximate String Matching with Ziv-Lempel Compressed Indexes
Approximate String Matching with Ziv-Lempel Compressed Indexes Luís M. S. Russo 1, Gonzalo Navarro 2, and Arlindo L. Oliveira 1 1 INESC-ID, R. Alves Redol 9, 1000 LISBOA, PORTUGAL lsr@algos.inesc-id.pt,
More informationModern Information Retrieval
Modern Information Retrieval Chapter 3 Modeling Introduction to IR Models Basic Concepts The Boolean Model Term Weighting The Vector Model Probabilistic Model Retrieval Evaluation, Modern Information Retrieval,
More informationSuccinct Data Structures for Text and Information Retrieval
Succinct Data Structures for Text and Information Retrieval Simon Gog 1 Matthias Petri 2 1 Institute of Theoretical Informatics Karslruhe Insitute of Technology 2 Computing and Information Systems The
More informationSuccinct Data Structures for NLP-at-Scale
Succinct Data Structures for NLP-at-Scale Matthias Petri Trevor Cohn Computing and Information Systems The University of Melbourne, Australia first.last@unimelb.edu.au November 20, 2016 Who are we? Trevor
More informationCSE 202 Homework 4 Matthias Springer, A
CSE 202 Homework 4 Matthias Springer, A99500782 1 Problem 2 Basic Idea PERFECT ASSEMBLY N P: a permutation P of s i S is a certificate that can be checked in polynomial time by ensuring that P = S, and
More informationInternal Pattern Matching Queries in a Text and Applications
Internal Pattern Matching Queries in a Text and Applications Tomasz Kociumaka Jakub Radoszewski Wojciech Rytter Tomasz Waleń Abstract We consider several types of internal queries: questions about subwords
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationAlgorithm Theory. 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore. Christian Schindelhauer
Algorithm Theory 13 Text Search - Knuth, Morris, Pratt, Boyer, Moore Institut für Informatik Wintersemester 2007/08 Text Search Scenarios Static texts Literature databases Library systems Gene databases
More informationarxiv: v1 [cs.ds] 30 Nov 2018
Faster Attractor-Based Indexes Gonzalo Navarro 1,2 and Nicola Prezza 3 1 CeBiB Center for Biotechnology and Bioengineering 2 Dept. of Computer Science, University of Chile, Chile. gnavarro@dcc.uchile.cl
More informationOn-line String Matching in Highly Similar DNA Sequences
On-line String Matching in Highly Similar DNA Sequences Nadia Ben Nsira 1,2,ThierryLecroq 1,,MouradElloumi 2 1 LITIS EA 4108, Normastic FR3638, University of Rouen, France 2 LaTICE, University of Tunis
More information5 10 12 32 48 5 10 12 32 48 4 8 16 32 64 128 4 8 16 32 64 128 2 3 5 16 2 3 5 16 5 10 12 32 48 4 8 16 32 64 128 2 3 5 16 docid score 5 10 12 32 48 O'Neal averaged 15.2 points 9.2 rebounds and 1.0 assists
More informationChap 2: Classical models for information retrieval
Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81 Outline Basic IR Models 1 Basic
More informationOnline Sorted Range Reporting and Approximating the Mode
Online Sorted Range Reporting and Approximating the Mode Mark Greve Progress Report Department of Computer Science Aarhus University Denmark January 4, 2010 Supervisor: Gerth Stølting Brodal Online Sorted
More informationEECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have
EECS 229A Spring 2007 * * Solutions to Homework 3 1. Problem 4.11 on pg. 93 of the text. Stationary processes (a) By stationarity and the chain rule for entropy, we have H(X 0 ) + H(X n X 0 ) = H(X 0,
More informationString Searching with Ranking Constraints and Uncertainty
Louisiana State University LSU Digital Commons LSU Doctoral Dissertations Graduate School 2015 String Searching with Ranking Constraints and Uncertainty Sudip Biswas Louisiana State University and Agricultural
More informationInformation Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture; IIR Sections
More informationLecture 4 Ranking Search Results. Many thanks to Prabhakar Raghavan for sharing most content from the following slides
Lecture 4 Ranking Search Results Many thanks to Prabhakar Raghavan for sharing most content from the following slides Recap of the previous lecture Index construction Doing sorting with limited main memory
More informationInformation Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning, Pandu Nayak, and Prabhakar Raghavan Lecture 14: Learning to Rank Sec. 15.4 Machine learning for IR
More informationLecture 1 : Data Compression and Entropy
CPS290: Algorithmic Foundations of Data Science January 8, 207 Lecture : Data Compression and Entropy Lecturer: Kamesh Munagala Scribe: Kamesh Munagala In this lecture, we will study a simple model for
More informationEfficient Enumeration of Regular Languages
Efficient Enumeration of Regular Languages Margareta Ackerman and Jeffrey Shallit University of Waterloo, Waterloo ON, Canada mackerma@uwaterloo.ca, shallit@graceland.uwaterloo.ca Abstract. The cross-section
More information