Par$$oned Elias- Fano indexes. Giuseppe O)aviano Rossano Venturini

Size: px
Start display at page:

Download "Par$$oned Elias- Fano indexes. Giuseppe O)aviano Rossano Venturini"

Transcription

1 Par$$oned Elias- Fano indexes Giuseppe O)aviano Rossano Venturini

2 Inverted indexes Core data structure of Informa$on Retrieval Documents are sequences of terms 1: [it is what it is not] 2: [what is a] 3: [it is a banana] 1, 2, 3 are document iden$fiers (docids) We want to find the documents that match a query, that is a set of terms

3 Inverted indexes 1: [it is what it is not] 2: [what is a] 3: [it is a banana] Query: [what] Results: 1, 2 Query: [it AND is] Results: 1, 3 Query: [banana OR not] Results: 1, 3 Query: [what AND is] Results: 1, 2 Query: [what AND not] Results: 1

4 Inverted index Pos$ng list: list of docids containing a term Inverted index: pos$ng lists for each term 1: [it is what it is not] 2: [what is a] 3: [it is a banana] a 2, 3 banana 3 is 3, 1, 2 it 1, 3 not 1 what 2, 1

5 Inverted indexes 1: [it is what it is not] 2: [what is a] 3: [it is a banana] a 2, 3 banana 3 is 3, 1, 2 it 1, 3 not 1 what 2, 1 Query: it AND is a 2, 3 banana 3 is 1, 2, 3 it 1, 3 not 1 what 1, 2

6 Docid- sorted inverted indexes Efficient intersec$on One- pass scan of the pos$ng lists Efficient compression Reduces to storage of monotonically increasing lists of integers

7 Index compression Inverted indexes can be very large Example: ClueWeb09 Track B Collec$on 50M documents 15B pos$ngs No compression (32 bits per integer): 60GB Spoiler: with compression, can go as low as 11GB

8 Sequences in pos$ng lists Generally, a pos$ng lists is composed of Sequence of docids: strictly monotonic Sequence of frequencies/scores: strictly posi$ve Sequence of posi$ons: concatena$on of strictly monotonic lists We focus on docids and frequencies

9 Elias Fano! Monotonic Sequences a i + i Docids Strictly monotonic Interpola$ve coding i a i i=0 a i a i - 1 Non- nega$ve Gap coding a i - 1 Scores Strictly non- nega$ve (Posi$ve)

10 Gap compression Replace list of integers with gaps between adjacent integers Since list is increasing, all gaps are posi$ve a 2, 3 banana 3 is 1, 2, 3 it 1, 3 not 1 what 1, 2 a 2, 1 banana 3 is 1, 1, 1 it 1, 2 not 1 what 1, 1

11 Elias- Fano encoding Data structure from the 70s, mostly in the succinct data structures niche Natural encoding of monotonically increasing sequences of integers Recently successfully applied to the representa$on of inverted indexes Used by Facebook Graph Search!

12 Elias- Fano representa$on Example: 2, 3, 5, 7, 11, 13, 19 w l upper bits l lower bits 0, 0, 1, 1, 2, 3, 4 Extract the upper bits 0, 0, 1, 0, 1, 1, 1 Compute the gaps Encode gaps with unary code Elias- Fano representa$on of the sequence

13 Elias- Fano representa$on Example: 2, 3, 5, 7, 11, 13, 19 w l upper bits l lower bits Elias- Fano representa$on of the sequence Maximum upper value: [U / 2 l ] where U is largest sequence value Sequence length is n Example: [19 / 2 2 ] = 4 = 100 In upper bits there are [U / 2 l ] zeros and n ones Space [U / 2 l ] + n + nl bits

14 Elias- Fano representa$on Example: 2, 3, 5, 7, 11, 13, 19 w l upper bits l lower bits Can show that l = [log(u/n)] is op$mal Space [U / 2 l ] + n + nl bits (2 + log(u/n))n bits

15 Elias- Fano representa$on With addi$onal (small) data structures can support fast search and random access Almost- O(1)- $me skips for intersec$on queries (2 + log(u/n))n- bits space independent of values distribu1on! is this a good thing?

16 Term- document matrix Alterna$ve interpreta$on of inverted index a 2, 3 banana 3 is 1, 2, 3 it 1, 3 not 1 what 1, a X X banana X is X X X it X X not X what X X Gaps are distances between the Xs

17 Gaps are usually small Assume that documents from the same domain have similar docids unipi.it/a unipi.it/b unipi.it/c unipi.it/d livorno.it livorno.it/nemici pisa X X X X X Cluster of docids Pos$ng lists contain long runs of very close integers That is, long runs of very small gaps

18 Gap compression Gaps are mostly small integers Small integers can be compressed efficiently with universal codes Example: unary code 1: 1 2: 01 3: 001 4: 0001 Example: 2, 3, 5, 9 - > 2, 1, 2, 4 - >

19 Gap compression Gap compression historically the most common representa$on of inverted indexes Drawbacks Decoding is sequen$al, can be slow Bemer compression - > even slower decoding

20 Elias- Fano and clustering Consider the following two lists 1, 2, 3, 4,, n 1, U n random values between 1 and U Both have n elements and last value U Elias- Fano compresses both to the exact same number of bits! But first list is fare more compressible: it is sufficient to store n and U Elias- Fano doesn t exploit clustering

21 Par11oned Elias- Fano XXX XXXXX XXXXXXXXX X X X X XXXX X XXX Par$$on the sequence into chunks Add pointers to the beginning of each chunk Represent each chunk and the sequence of pointers with Elias- Fano If the chunks approximate the clusters, compression improves

22 Par$$on op$miza$on We want to find, among all the possible par$$ons, the one that takes up less space Exhaus$ve search is exponen1al Dynamic programming can be done quadra1c Our solu$on: (1 + ε)- approximate solu$on in linear $me O(n log(1/ε)/log(1 + ε)) Reduce to a shortest path in a sparsified DAG

23 Par$$on op$miza$on Nodes correspond to sequence elements Edges to poten$al chunks Paths = Sequence par$$ons

24 Par$$on op$miza$on Each edge weight is the cost of the chunk defined by the edge endpoints Shortest path = Minimum cost par$$on Edge costs can be computed in O(1) but number of edges is quadra$c!

25 Idea n i

26 Idea n.1 General DAG sparsifica1on technique Quan$ze edge costs in classes of cost between (1 + ε 1 ) i and (1 + ε 1 ) i + 1 For each node and each cost class, keep only one maximal edge O(log n / log (1 + ε 1 )) edges per node! Shortest path in sparsified DAG at most (1 + ε 1 ) $mes more expensive than in original DAG Sparsified DAG can be computed on the fly

27 Idea n.2 XXXXXXXXX X X X X If we split a chunk at an arbitrary posi$on New cost Old cost + cost of new pointer If chunk is big enough, loss is negligible We keep only edges with cost O(1 / ε 2 ) New DAG has O(n log (1 / ε 2 ) / log (1 + ε 1 )) edges! Approxima$on factor becomes (1 + ε 2 ) (1 + ε 1 )

28 Dependency on ε 1 No$ce the scale! Bits per posbng Time in seconds

29 Dependency on ε 2 Here we go from O(n log n) to O(n) Bits per posbng Time in seconds

30 Results on GOV2 and ClueWeb09 GOV2 ClueWeb09 Elias- Fano 5.6GB 14GB Par$$oned Elias- Fano 3.0GB 11GB

31 Results on GOV2 and ClueWeb09 Gov2 ClueWeb09 space doc freq space doc freq GB bpi bpi GB bpi bpi EF single 7.66 (+64.7%) 7.53 (+83.4%) 3.14 (+32.4%) (+23.1%) 7.46 (+27.7%) 2.44 (+11.0%) EF uniform 5.17 (+11.2%) 4.63 (+12.9%) 2.58 (+8.4%) (+11.5%) 6.58 (+12.6%) 2.39 (+8.8%) EF -optimal Interpolative 4.57 ( 1.8%) 4.03 ( 1.8%) 2.33 ( 1.8%) ( 8.3%) 5.33 ( 8.8%) 2.04 ( 7.1%) OptPFD 5.22 (+12.3%) 4.72 (+15.1%) 2.55 (+7.4%) (+11.6%) 6.42 (+9.8%) 2.56 (+16.4%) Varint-G8IU (+202.2%) (+158.2%) 8.98 (+278.3%) (+148.3%) (+88.1%) 8.98 (+308.8%) Gov2 ClueWeb09 TREC 05 TREC 06 TREC 05 TREC 06 Gov2 ClueWeb09 TREC 05 TREC 06 TREC 05 TREC 06 EF single 2.1 (+10%) 4.7 (+1%) 13.6 ( 5%) 15.8 ( 9%) EF uniform 2.1 (+9%) 5.1 (+10%) 15.5 (+8%) 18.9 (+9%) EF -optimal Interpolative 7.5 (+291%) 20.4 (+343%) 55.7 (+289%) 76.5 (+341%) OptPFD 2.2 (+14%) 5.7 (+24%) 16.6 (+16%) 21.9 (+26%) Varint-G8IU 1.5 ( 20%) 4.0 ( 13%) 11.1 ( 23%) 14.8 ( 15%) OR queries EF single 80.7 (+8%) (+10%) (+0%) ( 2%) EF uniform 72.1 ( 3%) ( 3%) ( 3%) ( 4%) EF -optimal Interpolative (+62%) (+62%) (+53%) (+51%) OptPFD 69.5 ( 7%) ( 7%) ( 10%) ( 12%) Varint-G8IU 67.4 ( 10%) ( 10%) ( 15%) ( 17%) AND queries

32 Thanks for your amen$on! Ques$ons?

Par$$oned Elias- Fano Indexes

Par$$oned Elias- Fano Indexes Par$$oned Elias- Fano Indexes Giuseppe O)aviano ISTI- CNR, Pisa Rossano Venturini Università di Pisa Inverted indexes Docid Document 1: [it is what it is not] 2: [what is a] 3: [it is a banana] a 2, 3

More information

Nearest Neighbor Search with Keywords: Compression

Nearest Neighbor Search with Keywords: Compression Nearest Neighbor Search with Keywords: Compression Yufei Tao KAIST June 3, 2013 In this lecture, we will continue our discussion on: Problem (Nearest Neighbor Search with Keywords) Let P be a set of points

More information

Interes'ng- Phrase Mining for Ad- Hoc Text Analy'cs

Interes'ng- Phrase Mining for Ad- Hoc Text Analy'cs Interes'ng- Phrase Mining for Ad- Hoc Text Analy'cs Klaus Berberich (kberberi@mpi- inf.mpg.de) Joint work with: Srikanta Bedathur, Jens DiIrich, Nikos Mamoulis, and Gerhard Weikum (originally presented

More information

SKA machine learning perspec1ves for imaging, processing and analysis

SKA machine learning perspec1ves for imaging, processing and analysis 1 SKA machine learning perspec1ves for imaging, processing and analysis Slava Voloshynovskiy Stochas1c Informa1on Processing Group University of Geneva Switzerland with contribu,on of: D. Kostadinov, S.

More information

Space and Time-Efficient Data Structures for Massive Datasets

Space and Time-Efficient Data Structures for Massive Datasets Space and Time-Efficient Data Structures for Massive Datasets Giulio Ermanno Pibiri giulio.pibiri@di.unipi.it Supervisor Rossano Venturini Computer Science Department University of Pisa 17/10/2016 1 Evidence

More information

Querying. 1 o Semestre 2008/2009

Querying. 1 o Semestre 2008/2009 Querying Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2008/2009 Outline 1 2 3 4 5 Outline 1 2 3 4 5 function sim(d j, q) = 1 W d W q W d is the document norm W q is the

More information

CSE P 501 Compilers. Value Numbering & Op;miza;ons Hal Perkins Winter UW CSE P 501 Winter 2016 S-1

CSE P 501 Compilers. Value Numbering & Op;miza;ons Hal Perkins Winter UW CSE P 501 Winter 2016 S-1 CSE P 501 Compilers Value Numbering & Op;miza;ons Hal Perkins Winter 2016 UW CSE P 501 Winter 2016 S-1 Agenda Op;miza;on (Review) Goals Scope: local, superlocal, regional, global (intraprocedural), interprocedural

More information

Algorithms for NLP. Classifica(on III. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Classifica(on III. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Classifica(on III Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley The Perceptron, Again Start with zero weights Visit training instances one by one Try to classify If correct,

More information

Streaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:www.mmds.org.

Streaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:www.mmds.org. Streaming - 2 Bloom Filters, Distinct Item counting, Computing moments credits:www.mmds.org http://www.mmds.org Outline More algorithms for streams: 2 Outline More algorithms for streams: (1) Filtering

More information

Graphical Models. Lecture 3: Local Condi6onal Probability Distribu6ons. Andrew McCallum

Graphical Models. Lecture 3: Local Condi6onal Probability Distribu6ons. Andrew McCallum Graphical Models Lecture 3: Local Condi6onal Probability Distribu6ons Andrew McCallum mccallum@cs.umass.edu Thanks to Noah Smith and Carlos Guestrin for some slide materials. 1 Condi6onal Probability Distribu6ons

More information

Probability and Structure in Natural Language Processing

Probability and Structure in Natural Language Processing Probability and Structure in Natural Language Processing Noah Smith, Carnegie Mellon University 2012 Interna@onal Summer School in Language and Speech Technologies Quick Recap Yesterday: Bayesian networks

More information

Graphical Models. Lecture 10: Variable Elimina:on, con:nued. Andrew McCallum

Graphical Models. Lecture 10: Variable Elimina:on, con:nued. Andrew McCallum Graphical Models Lecture 10: Variable Elimina:on, con:nued Andrew McCallum mccallum@cs.umass.edu Thanks to Noah Smith and Carlos Guestrin for some slide materials. 1 Last Time Probabilis:c inference is

More information

Machine learning for Dynamic Social Network Analysis

Machine learning for Dynamic Social Network Analysis Machine learning for Dynamic Social Network Analysis Manuel Gomez Rodriguez Max Planck Ins7tute for So;ware Systems UC3M, MAY 2017 Interconnected World SOCIAL NETWORKS TRANSPORTATION NETWORKS WORLD WIDE

More information

Networks. Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource

Networks. Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource Networks Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource Networks in biology Protein-Protein Interaction Network of Yeast Transcriptional regulatory network of E.coli Experimental

More information

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16 Logis@cs CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Sign

More information

Gradient Descent for High Dimensional Systems

Gradient Descent for High Dimensional Systems Gradient Descent for High Dimensional Systems Lab versus Lab 2 D Geometry Op>miza>on Poten>al Energy Methods: Implemented Equa3ons for op3mizer 3 2 4 Bond length High Dimensional Op>miza>on Applica3ons:

More information

Alphabet Friendly FM Index

Alphabet Friendly FM Index Alphabet Friendly FM Index Author: Rodrigo González Santiago, November 8 th, 2005 Departamento de Ciencias de la Computación Universidad de Chile Outline Motivations Basics Burrows Wheeler Transform FM

More information

Exponen'al growth and differen'al equa'ons

Exponen'al growth and differen'al equa'ons Exponen'al growth and differen'al equa'ons But first.. Thanks for the feedback! Feedback about M102 Which of the following do you find useful? 70 60 50 40 30 20 10 0 How many resources students typically

More information

Summary of Last Lectures

Summary of Last Lectures Lossless Coding IV a k p k b k a 0.16 111 b 0.04 0001 c 0.04 0000 d 0.16 110 e 0.23 01 f 0.07 1001 g 0.06 1000 h 0.09 001 i 0.15 101 100 root 1 60 1 0 0 1 40 0 32 28 23 e 17 1 0 1 0 1 0 16 a 16 d 15 i

More information

Shannon-Fano-Elias coding

Shannon-Fano-Elias coding Shannon-Fano-Elias coding Suppose that we have a memoryless source X t taking values in the alphabet {1, 2,..., L}. Suppose that the probabilities for all symbols are strictly positive: p(i) > 0, i. The

More information

Electricity & Magnetism Lecture 3: Electric Flux and Field Lines

Electricity & Magnetism Lecture 3: Electric Flux and Field Lines Electricity & Magnetism Lecture 3: Electric Flux and Field Lines Today s Concepts: A) Electric Flux B) Field Lines Gauss Law Electricity & Magne@sm Lecture 3, Slide 1 Your Comments What the heck is epsilon

More information

Lecture 18 April 26, 2012

Lecture 18 April 26, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and

More information

Algorithms: COMP3121/3821/9101/9801

Algorithms: COMP3121/3821/9101/9801 NEW SOUTH WALES Algorithms: COMP3121/3821/9101/9801 Aleks Ignjatović School of Computer Science and Engineering University of New South Wales TOPIC 4: THE GREEDY METHOD COMP3121/3821/9101/9801 1 / 23 The

More information

Register Alloca.on. CMPT 379: Compilers Instructor: Anoop Sarkar. anoopsarkar.github.io/compilers-class

Register Alloca.on. CMPT 379: Compilers Instructor: Anoop Sarkar. anoopsarkar.github.io/compilers-class Register Alloca.on CMPT 379: Compilers Instructor: Anoop Sarkar anoopsarkar.github.io/compilers-class 1 Register Alloca.on Intermediate code uses unlimited temporaries Simplifying code genera.on and op.miza.on

More information

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address

More information

Compression. What. Why. Reduce the amount of information (bits) needed to represent image Video: 720 x 480 res, 30 fps, color

Compression. What. Why. Reduce the amount of information (bits) needed to represent image Video: 720 x 480 res, 30 fps, color Compression What Reduce the amount of information (bits) needed to represent image Video: 720 x 480 res, 30 fps, color Why 720x480x20x3 = 31,104,000 bytes/sec 30x60x120 = 216 Gigabytes for a 2 hour movie

More information

Classical Mechanics Lecture 21

Classical Mechanics Lecture 21 Classical Mechanics Lecture 21 Today s Concept: Simple Harmonic Mo7on: Mass on a Spring Mechanics Lecture 21, Slide 1 The Mechanical Universe, Episode 20: Harmonic Motion http://www.learner.org/vod/login.html?pid=565

More information

R ij = 2. Using all of these facts together, you can solve problem number 9.

R ij = 2. Using all of these facts together, you can solve problem number 9. Help for Homework Problem #9 Let G(V,E) be any undirected graph We want to calculate the travel time across the graph. Think of each edge as one resistor of 1 Ohm. Say we have two nodes: i and j Let the

More information

Wavelets & Mul,resolu,on Analysis

Wavelets & Mul,resolu,on Analysis Wavelets & Mul,resolu,on Analysis Square Wave by Steve Hanov More comics at http://gandolf.homelinux.org/~smhanov/comics/ Problem set #4 will be posted tonight 11/21/08 Comp 665 Wavelets & Mul8resolu8on

More information

Predicate abstrac,on and interpola,on. Many pictures and examples are borrowed from The So'ware Model Checker BLAST presenta,on.

Predicate abstrac,on and interpola,on. Many pictures and examples are borrowed from The So'ware Model Checker BLAST presenta,on. Predicate abstrac,on and interpola,on Many pictures and examples are borrowed from The So'ware Model Checker BLAST presenta,on. Outline. Predicate abstrac,on the idea in pictures 2. Counter- example guided

More information

Compact Indexes for Flexible Top-k Retrieval

Compact Indexes for Flexible Top-k Retrieval Compact Indexes for Flexible Top-k Retrieval Simon Gog Matthias Petri Institute of Theoretical Informatics, Karlsruhe Institute of Technology Computing and Information Systems, The University of Melbourne

More information

Succinct Data Structures for Approximating Convex Functions with Applications

Succinct Data Structures for Approximating Convex Functions with Applications Succinct Data Structures for Approximating Convex Functions with Applications Prosenjit Bose, 1 Luc Devroye and Pat Morin 1 1 School of Computer Science, Carleton University, Ottawa, Canada, K1S 5B6, {jit,morin}@cs.carleton.ca

More information

CS 161: Design and Analysis of Algorithms

CS 161: Design and Analysis of Algorithms CS 161: Design and Analysis of Algorithms Greedy Algorithms 3: Minimum Spanning Trees/Scheduling Disjoint Sets, continued Analysis of Kruskal s Algorithm Interval Scheduling Disjoint Sets, Continued Each

More information

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #15: Mining Streams 2

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #15: Mining Streams 2 CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #15: Mining Streams 2 Today s Lecture More algorithms for streams: (1) Filtering a data stream: Bloom filters Select elements with property

More information

MA/CS 109 Lecture 7. Back To Exponen:al Growth Popula:on Models

MA/CS 109 Lecture 7. Back To Exponen:al Growth Popula:on Models MA/CS 109 Lecture 7 Back To Exponen:al Growth Popula:on Models Homework this week 1. Due next Thursday (not Tuesday) 2. Do most of computa:ons in discussion next week 3. If possible, bring your laptop

More information

Objec&ves. Review. Data structure: Heaps Data structure: Graphs. What is a priority queue? What is a heap?

Objec&ves. Review. Data structure: Heaps Data structure: Graphs. What is a priority queue? What is a heap? Objec&ves Data structure: Heaps Data structure: Graphs Submit problem set 2 1 Review What is a priority queue? What is a heap? Ø Proper&es Ø Implementa&on What is the process for finding the smallest element

More information

Small-Space Dictionary Matching (Dissertation Proposal)

Small-Space Dictionary Matching (Dissertation Proposal) Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length

More information

CSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule.

CSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule. CSE 473: Ar+ficial Intelligence Markov Models - II Daniel S. Weld - - - University of Washington [Most slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188

More information

Part 2. Representation Learning Algorithms

Part 2. Representation Learning Algorithms 53 Part 2 Representation Learning Algorithms 54 A neural network = running several logistic regressions at the same time If we feed a vector of inputs through a bunch of logis;c regression func;ons, then

More information

Exact data mining from in- exact data Nick Freris

Exact data mining from in- exact data Nick Freris Exact data mining from in- exact data Nick Freris Qualcomm, San Diego October 10, 2013 Introduc=on (1) Informa=on retrieval is a large industry.. Biology, finance, engineering, marke=ng, vision/graphics,

More information

Q1 current best prac2ce

Q1 current best prac2ce Group- A Q1 current best prac2ce Star2ng from some common molecular representa2on with bond orders, configura2on on chiral centers (e.g. ChemDraw, SMILES) NEW! PDB should become resource for refinement

More information

arxiv: v1 [cs.ds] 15 Feb 2012

arxiv: v1 [cs.ds] 15 Feb 2012 Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl

More information

Data Compression Techniques (Spring 2012) Model Solutions for Exercise 2

Data Compression Techniques (Spring 2012) Model Solutions for Exercise 2 582487 Data Compression Techniques (Spring 22) Model Solutions for Exercise 2 If you have any feedback or corrections, please contact nvalimak at cs.helsinki.fi.. Problem: Construct a canonical prefix

More information

Lesson 76 Introduction to Complex Numbers

Lesson 76 Introduction to Complex Numbers Lesson 76 Introduction to Complex Numbers HL2 MATH - SANTOWSKI Lesson Objectives (1) Introduce the idea of imaginary and complex numbers (2) Prac?ce opera?ons with complex numbers (3) Use complex numbers

More information

Electricity & Magnetism Lecture 5: Electric Potential Energy

Electricity & Magnetism Lecture 5: Electric Potential Energy Electricity & Magnetism Lecture 5: Electric Potential Energy Today... Ø Ø Electric Poten1al Energy Unit 21 session Gravita1onal and Electrical PE Electricity & Magne/sm Lecture 5, Slide 1 Stuff you asked

More information

HW #4. (mostly by) Salim Sarımurat. 1) Insert 6 2) Insert 8 3) Insert 30. 4) Insert S a.

HW #4. (mostly by) Salim Sarımurat. 1) Insert 6 2) Insert 8 3) Insert 30. 4) Insert S a. HW #4 (mostly by) Salim Sarımurat 04.12.2009 S. 1. 1. a. 1) Insert 6 2) Insert 8 3) Insert 30 4) Insert 40 2 5) Insert 50 6) Insert 61 7) Insert 70 1. b. 1) Insert 12 2) Insert 29 3) Insert 30 4) Insert

More information

Seman&cs with Dense Vectors. Dorota Glowacka

Seman&cs with Dense Vectors. Dorota Glowacka Semancs with Dense Vectors Dorota Glowacka dorota.glowacka@ed.ac.uk Previous lectures: - how to represent a word as a sparse vector with dimensions corresponding to the words in the vocabulary - the values

More information

Bellman s Curse of Dimensionality

Bellman s Curse of Dimensionality Bellman s Curse of Dimensionality n- dimensional state space Number of states grows exponen

More information

Algorithms for NLP. Classifica(on III. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Classifica(on III. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Classifica(on III Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley Objec(ve Func(ons What do we want from our weights? Depends! So far: minimize (training) errors: This is

More information

PSPACE, NPSPACE, L, NL, Savitch's Theorem. More new problems that are representa=ve of space bounded complexity classes

PSPACE, NPSPACE, L, NL, Savitch's Theorem. More new problems that are representa=ve of space bounded complexity classes PSPACE, NPSPACE, L, NL, Savitch's Theorem More new problems that are representa=ve of space bounded complexity classes Outline for today How we'll count space usage Space bounded complexity classes New

More information

The Boolean Model ~1955

The Boolean Model ~1955 The Boolean Model ~1955 The boolean model is the first, most criticized, and (until a few years ago) commercially more widespread, model of IR. Its functionalities can often be found in the Advanced Search

More information

Lecture 3 : Algorithms for source coding. September 30, 2016

Lecture 3 : Algorithms for source coding. September 30, 2016 Lecture 3 : Algorithms for source coding September 30, 2016 Outline 1. Huffman code ; proof of optimality ; 2. Coding with intervals : Shannon-Fano-Elias code and Shannon code ; 3. Arithmetic coding. 1/39

More information

Author to whom correspondence should be addressed;

Author to whom correspondence should be addressed; Algorithms 2009, 2, 1031-1044; doi:10.3390/a2031031 Article OPEN ACCESS algorithms ISSN 1999-4893 www.mdpi.com/journal/algorithms Graph Compression by BFS Alberto Apostolico 1,2 and Guido Drovandi 3,4,

More information

COMP 562: Introduction to Machine Learning

COMP 562: Introduction to Machine Learning COMP 562: Introduction to Machine Learning Lecture 20 : Support Vector Machines, Kernels Mahmoud Mostapha 1 Department of Computer Science University of North Carolina at Chapel Hill mahmoudm@cs.unc.edu

More information

Pseudospectral Methods For Op2mal Control. Jus2n Ruths March 27, 2009

Pseudospectral Methods For Op2mal Control. Jus2n Ruths March 27, 2009 Pseudospectral Methods For Op2mal Control Jus2n Ruths March 27, 2009 Introduc2on Pseudospectral methods arose to find solu2ons to Par2al Differen2al Equa2ons Recently adapted for Op2mal Control Key Ideas

More information

Divide-and-Conquer Algorithms Part Two

Divide-and-Conquer Algorithms Part Two Divide-and-Conquer Algorithms Part Two Recap from Last Time Divide-and-Conquer Algorithms A divide-and-conquer algorithm is one that works as follows: (Divide) Split the input apart into multiple smaller

More information

boolean queries Inverted index query processing Query optimization boolean model January 15, / 35

boolean queries Inverted index query processing Query optimization boolean model January 15, / 35 boolean model January 15, 2017 1 / 35 Outline 1 boolean queries 2 3 4 2 / 35 taxonomy of IR models Set theoretic fuzzy extended boolean set-based IR models Boolean vector probalistic algebraic generalized

More information

Module 4: Linear Equa1ons Topic D: Systems of Linear Equa1ons and Their Solu1ons. Lesson 4-24: Introduc1on to Simultaneous Equa1ons

Module 4: Linear Equa1ons Topic D: Systems of Linear Equa1ons and Their Solu1ons. Lesson 4-24: Introduc1on to Simultaneous Equa1ons Module 4: Linear Equa1ons Topic D: Systems of Linear Equa1ons and Their Solu1ons Lesson 4-24: Introduc1on to Simultaneous Equa1ons Lesson 4-24: Introduc1on to Simultaneous Equa1ons Purpose for Learning:

More information

Data Structures in Java

Data Structures in Java Data Structures in Java Lecture 20: Algorithm Design Techniques 12/2/2015 Daniel Bauer 1 Algorithms and Problem Solving Purpose of algorithms: find solutions to problems. Data Structures provide ways of

More information

In-core computation of distance distributions and geometric centralities with HyperBall: A hundred billion nodes and beyond

In-core computation of distance distributions and geometric centralities with HyperBall: A hundred billion nodes and beyond In-core computation of distance distributions and geometric centralities with HyperBall: A hundred billion nodes and beyond Paolo Boldi, Sebastiano Vigna Laboratory for Web Algorithmics Università degli

More information

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code Lecture 16 Agenda for the lecture Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code Variable-length source codes with error 16.1 Error-free coding schemes 16.1.1 The Shannon-Fano-Elias

More information

Rela%ons and Their Proper%es. Slides by A. Bloomfield

Rela%ons and Their Proper%es. Slides by A. Bloomfield Rela%ons and Their Proper%es Slides by A. Bloomfield What is a rela%on Let A and B be sets. A binary rela%on R is a subset of A B Example Let A be the students in a the CS major A = {Alice, Bob, Claire,

More information

Con$nuous Op$miza$on: The "Right" Language for Fast Graph Algorithms? Aleksander Mądry

Con$nuous Op$miza$on: The Right Language for Fast Graph Algorithms? Aleksander Mądry Con$nuous Op$miza$on: The "Right" Language for Fast Graph Algorithms? Aleksander Mądry We have been studying graph algorithms for a while now Tradi$onal view: Graphs are combinatorial objects graph algorithms

More information

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for MARKOV CHAINS A finite state Markov chain is a sequence S 0,S 1,... of discrete cv s from a finite alphabet S where q 0 (s) is a pmf on S 0 and for n 1, Q(s s ) = Pr(S n =s S n 1 =s ) = Pr(S n =s S n 1

More information

Exponen'al func'ons and exponen'al growth. UBC Math 102

Exponen'al func'ons and exponen'al growth. UBC Math 102 Exponen'al func'ons and exponen'al growth Course Calendar: OSH 4 due by 12:30pm in MX 1111 You are here Coming up (next week) Group version of Quiz 3 distributed by email Group version of Quiz 3 due in

More information

Computer Vision. Pa0ern Recogni4on Concepts. Luis F. Teixeira MAP- i 2014/15

Computer Vision. Pa0ern Recogni4on Concepts. Luis F. Teixeira MAP- i 2014/15 Computer Vision Pa0ern Recogni4on Concepts Luis F. Teixeira MAP- i 2014/15 Outline General pa0ern recogni4on concepts Classifica4on Classifiers Decision Trees Instance- Based Learning Bayesian Learning

More information

Parameter Es*ma*on: Cracking Incomplete Data

Parameter Es*ma*on: Cracking Incomplete Data Parameter Es*ma*on: Cracking Incomplete Data Khaled S. Refaat Collaborators: Arthur Choi and Adnan Darwiche Agenda Learning Graphical Models Complete vs. Incomplete Data Exploi*ng Data for Decomposi*on

More information

Integer Sorting on the word-ram

Integer Sorting on the word-ram Integer Sorting on the word-rm Uri Zwick Tel viv University May 2015 Last updated: June 30, 2015 Integer sorting Memory is composed of w-bit words. rithmetical, logical and shift operations on w-bit words

More information

LOWELL. MICHIGAN. WEDNESDAY, FEBRUARY NUMllEE 33, Chicago. >::»«ad 0:30am, " 16.n«l w 00 ptn Jaekten,.'''4snd4:4(>a tii, ijilwopa

LOWELL. MICHIGAN. WEDNESDAY, FEBRUARY NUMllEE 33, Chicago. >::»«ad 0:30am,  16.n«l w 00 ptn Jaekten,.'''4snd4:4(>a tii, ijilwopa 4/X6 X 896 & # 98 # 4 $2 q $ 8 8 $ 8 6 8 2 8 8 2 2 4 2 4 X q q!< Q 48 8 8 X 4 # 8 & q 4 ) / X & & & Q!! & & )! 2 ) & / / ;) Q & & 8 )

More information

Last Lecture Recap UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 3: Linear Regression

Last Lecture Recap UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 3: Linear Regression UVA CS 4501-001 / 6501 007 Introduc8on to Machine Learning and Data Mining Lecture 3: Linear Regression Yanjun Qi / Jane University of Virginia Department of Computer Science 1 Last Lecture Recap q Data

More information

Lecture 8: Channel and source-channel coding theorems; BEC & linear codes. 1 Intuitive justification for upper bound on channel capacity

Lecture 8: Channel and source-channel coding theorems; BEC & linear codes. 1 Intuitive justification for upper bound on channel capacity 5-859: Information Theory and Applications in TCS CMU: Spring 23 Lecture 8: Channel and source-channel coding theorems; BEC & linear codes February 7, 23 Lecturer: Venkatesan Guruswami Scribe: Dan Stahlke

More information

Introduc)on to Ar)ficial Intelligence

Introduc)on to Ar)ficial Intelligence Introduc)on to Ar)ficial Intelligence Lecture 13 Approximate Inference CS/CNS/EE 154 Andreas Krause Bayesian networks! Compact representa)on of distribu)ons over large number of variables! (OQen) allows

More information

CMPSCI611: The Matroid Theorem Lecture 5

CMPSCI611: The Matroid Theorem Lecture 5 CMPSCI611: The Matroid Theorem Lecture 5 We first review our definitions: A subset system is a set E together with a set of subsets of E, called I, such that I is closed under inclusion. This means that

More information

Graphical Models. Lecture 1: Mo4va4on and Founda4ons. Andrew McCallum

Graphical Models. Lecture 1: Mo4va4on and Founda4ons. Andrew McCallum Graphical Models Lecture 1: Mo4va4on and Founda4ons Andrew McCallum mccallum@cs.umass.edu Thanks to Noah Smith and Carlos Guestrin for some slide materials. Board work Expert systems the desire for probability

More information

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma IS4200/CS6200 Informa0on Retrieval PageRank Con+nued with slides from Hinrich Schütze and Chris6na Lioma Exercise: Assump0ons underlying PageRank Assump0on 1: A link on the web is a quality signal the

More information

Rank and Select Operations on Binary Strings (1974; Elias)

Rank and Select Operations on Binary Strings (1974; Elias) Rank and Select Operations on Binary Strings (1974; Elias) Naila Rahman, University of Leicester, www.cs.le.ac.uk/ nyr1 Rajeev Raman, University of Leicester, www.cs.le.ac.uk/ rraman entry editor: Paolo

More information

Lecture 1 : Data Compression and Entropy

Lecture 1 : Data Compression and Entropy CPS290: Algorithmic Foundations of Data Science January 8, 207 Lecture : Data Compression and Entropy Lecturer: Kamesh Munagala Scribe: Kamesh Munagala In this lecture, we will study a simple model for

More information

QuickScorer: a fast algorithm to rank documents with additive ensembles of regression trees

QuickScorer: a fast algorithm to rank documents with additive ensembles of regression trees QuickScorer: a fast algorithm to rank documents with additive ensembles of regression trees Claudio Lucchese, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto HPC Lab, ISTI-CNR, Pisa, Italy & Tiscali

More information

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet) Compression Motivation Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet) Storage: Store large & complex 3D models (e.g. 3D scanner

More information

Lecture 3. 1 Polynomial-time algorithms for the maximum flow problem

Lecture 3. 1 Polynomial-time algorithms for the maximum flow problem ORIE 633 Network Flows August 30, 2007 Lecturer: David P. Williamson Lecture 3 Scribe: Gema Plaza-Martínez 1 Polynomial-time algorithms for the maximum flow problem 1.1 Introduction Let s turn now to considering

More information

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on. Instructor: Wei-Min Shen

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on. Instructor: Wei-Min Shen CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on Instructor: Wei-Min Shen Today s Lecture Search Techniques (review & con/nue) Op/miza/on Techniques Home Work 1: descrip/on

More information

Lec 03 Entropy and Coding II Hoffman and Golomb Coding

Lec 03 Entropy and Coding II Hoffman and Golomb Coding CS/EE 5590 / ENG 40 Special Topics Multimedia Communication, Spring 207 Lec 03 Entropy and Coding II Hoffman and Golomb Coding Zhu Li Z. Li Multimedia Communciation, 207 Spring p. Outline Lecture 02 ReCap

More information

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1 Kraft s inequality An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if N 2 l i 1 Proof: Suppose that we have a tree code. Let l max = max{l 1,...,

More information

Methodological Foundations of Biomedical Informatics (BMSC-GA 4449) Optimization

Methodological Foundations of Biomedical Informatics (BMSC-GA 4449) Optimization Methodological Foundations of Biomedical Informatics (BMSCGA 4449) Optimization Op#miza#on A set of techniques for finding the values of variables at which the objec#ve func#on amains its cri#cal (minimal

More information

11 The Max-Product Algorithm

11 The Max-Product Algorithm Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms for Inference Fall 2014 11 The Max-Product Algorithm In the previous lecture, we introduced

More information

Algorithms, Graph Theory, and the Solu7on of Laplacian Linear Equa7ons. Daniel A. Spielman Yale University

Algorithms, Graph Theory, and the Solu7on of Laplacian Linear Equa7ons. Daniel A. Spielman Yale University Algorithms, Graph Theory, and the Solu7on of Laplacian Linear Equa7ons Daniel A. Spielman Yale University Rutgers, Dec 6, 2011 Outline Linear Systems in Laplacian Matrices What? Why? Classic ways to solve

More information

CSE 200 Lecture Notes Turing machine vs. RAM machine vs. circuits

CSE 200 Lecture Notes Turing machine vs. RAM machine vs. circuits CSE 200 Lecture Notes Turing machine vs. RAM machine vs. circuits Chris Calabro January 13, 2016 1 RAM model There are many possible, roughly equivalent RAM models. Below we will define one in the fashion

More information

Query CS347. Term-document incidence. Incidence vectors. Which plays of Shakespeare contain the words Brutus ANDCaesar but NOT Calpurnia?

Query CS347. Term-document incidence. Incidence vectors. Which plays of Shakespeare contain the words Brutus ANDCaesar but NOT Calpurnia? Query CS347 Which plays of Shakespeare contain the words Brutus ANDCaesar but NOT Calpurnia? Lecture 1 April 4, 2001 Prabhakar Raghavan Term-document incidence Incidence vectors Antony and Cleopatra Julius

More information

Database Theory VU , SS Ehrenfeucht-Fraïssé Games. Reinhard Pichler

Database Theory VU , SS Ehrenfeucht-Fraïssé Games. Reinhard Pichler Database Theory Database Theory VU 181.140, SS 2018 7. Ehrenfeucht-Fraïssé Games Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität Wien 15 May, 2018 Pichler 15

More information

Optimisation and Operations Research

Optimisation and Operations Research Optimisation and Operations Research Lecture 12: Algorithm Analysis and Complexity Matthew Roughan http://www.maths.adelaide.edu.au/matthew.roughan/ Lecture_notes/OORII/

More information

Topic 4 Randomized algorithms

Topic 4 Randomized algorithms CSE 103: Probability and statistics Winter 010 Topic 4 Randomized algorithms 4.1 Finding percentiles 4.1.1 The mean as a summary statistic Suppose UCSD tracks this year s graduating class in computer science

More information

Image Data Compression. Dirty-paper codes Alexey Pak, Lehrstuhl für Interak<ve Echtzeitsysteme, Fakultät für Informa<k, KIT

Image Data Compression. Dirty-paper codes Alexey Pak, Lehrstuhl für Interak<ve Echtzeitsysteme, Fakultät für Informa<k, KIT Image Data Compression Dirty-paper codes 1 Reminder: watermarking with side informa8on Watermark embedder Noise n Input message m Auxiliary informa

More information

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts Theoretical Computer Science 410 (2009) 4402 4413 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Dynamic rank/select structures with

More information

Dictionary: an abstract data type

Dictionary: an abstract data type 2-3 Trees 1 Dictionary: an abstract data type A container that maps keys to values Dictionary operations Insert Search Delete Several possible implementations Balanced search trees Hash tables 2 2-3 trees

More information

Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval. Lecture #3 Machine Learning. Edward Chang

Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval. Lecture #3 Machine Learning. Edward Chang Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval Lecture #3 Machine Learning Edward Y. Chang Edward Chang Founda'ons of LSMM 1 Edward Chang Foundations of LSMM 2 Machine Learning

More information

BIOINF 4371 Drug Design 1 Oliver Kohlbacher & Jens Krüger

BIOINF 4371 Drug Design 1 Oliver Kohlbacher & Jens Krüger BIOINF 4371 Drug Design 1 Oliver Kohlbacher & Jens Krüger Winter 2013/2014 11. Docking Part IV: Receptor Flexibility Overview Receptor flexibility Types of flexibility Implica5ons for docking Examples

More information

Algorithms, Graph Theory, and Linear Equa9ons in Laplacians. Daniel A. Spielman Yale University

Algorithms, Graph Theory, and Linear Equa9ons in Laplacians. Daniel A. Spielman Yale University Algorithms, Graph Theory, and Linear Equa9ons in Laplacians Daniel A. Spielman Yale University Shang- Hua Teng Carter Fussell TPS David Harbater UPenn Pavel Cur9s Todd Knoblock CTY Serge Lang 1927-2005

More information

Electronic Structure Calcula/ons: Density Func/onal Theory. and. Predic/ng Cataly/c Ac/vity

Electronic Structure Calcula/ons: Density Func/onal Theory. and. Predic/ng Cataly/c Ac/vity Electronic Structure Calcula/ons: Density Func/onal Theory and Predic/ng Cataly/c Ac/vity Review Examples of experiments that do not agree with classical physics: Photoelectric effect Photons and the quan/za/on

More information

Content-Addressable Memory Associative Memory Lernmatrix Association Heteroassociation Learning Retrieval Reliability of the answer

Content-Addressable Memory Associative Memory Lernmatrix Association Heteroassociation Learning Retrieval Reliability of the answer Associative Memory Content-Addressable Memory Associative Memory Lernmatrix Association Heteroassociation Learning Retrieval Reliability of the answer Storage Analysis Sparse Coding Implementation on a

More information