Complete Faceting. code4lib, Feb Toke Eskildsen Mikkel Kamstrup Erlandsen

Similar documents
Prioritized Garbage Collection Using the Garbage Collector to Support Caching

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

Problem Decomposition: One Professor s Approach to Coding

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Information Retrieval Using Boolean Model SEEM5680

CS425: Algorithms for Web Scale Data

Section 3.3: Discrete-Event Simulation Examples

Impulse. Observations

Using R for Iterative and Incremental Processing

Behavioral Simulations in MapReduce

Time. Today. l Physical clocks l Logical clocks

Lecture 14: Nov. 11 & 13

Course Announcements. Bacon is due next Monday. Next lab is about drawing UIs. Today s lecture will help thinking about your DB interface.

Dictionary: an abstract data type

Hashing Data Structures. Ananda Gunawardena

Large-Scale Behavioral Targeting

APPEND AND MERGE. Colorado average snowfall amounts - Aggregated by County and Month

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Lecture 13: Sequential Circuits, FSM

ECE 571 Advanced Microprocessor-Based Design Lecture 9

DM507 - Algoritmer og Datastrukturer Project 1

COMP 250 Fall Midterm examination

Data Structures in Java

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

CSCE 561 Information Retrieval System Models

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

THE ZCACHE: DECOUPLING WAYS AND ASSOCIATIVITY. Daniel Sanchez and Christos Kozyrakis Stanford University

Deterministic Finite Automaton (DFA)

Arup Nanda Starwood Hotels

An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets

Lecture 13: Sequential Circuits, FSM

The Boolean Model ~1955

Chuck Cartledge, PhD. 21 January 2018

Farewell, PipelinePilot Migrating the Exquiron cheminformatics platform to KNIME and the ChemAxon technology

Data-Intensive Distributed Computing

ST-Links. SpatialKit. Version 3.0.x. For ArcMap. ArcMap Extension for Directly Connecting to Spatial Databases. ST-Links Corporation.

Asymptotic Running Time of Algorithms

IR: Information Retrieval

Graphical User Interfaces for Emittance and Correlation Plot. Henrik Loos

Phrase Finding; Stream-and-Sort vs Request-and-Answer. William W. Cohen

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics

Administering your Enterprise Geodatabase using Python. Jill Penney

COMPUTER SCIENCE TRIPOS

Parallel Transposition of Sparse Data Structures

High Performance Computing

EGR 111 Heat Transfer

Computational Complexity. This lecture. Notes. Lecture 02 - Basic Complexity Analysis. Tom Kelsey & Susmit Sarkar. Notes

FACULTY OF SCIENCE ACADEMY OF COMPUTER SCIENCE AND SOFTWARE ENGINEERING OBJECT ORIENTED PROGRAMMING DATE 07/2014 SESSION 8:00-10:00

Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm

Become a Microprobe Power User Part 2: Qualitative & Quantitative Analysis

SOLUTION: SOLUTION: SOLUTION:

Running Time. Overview. Case Study: Sorting. Sorting problem: Analysis of algorithms: framework for comparing algorithms and predicting performance.

Foundations Of Mathematical Logic (Dover Books On Mathematics) By Haskell B. Curry

Measuring Goodness of an Algorithm. Asymptotic Analysis of Algorithms. Measuring Efficiency of an Algorithm. Algorithm and Data Structure

P Q1 Q2 Q3 Q4 Q5 Tot (60) (20) (20) (20) (60) (20) (200) You are allotted a maximum of 4 hours to complete this exam.

Acceptably inaccurate. Probabilistic data structures

Huffman Coding. C.M. Liu Perceptual Lab, College of Computer Science National Chiao-Tung University

CSE613: Parallel Programming, Spring 2012 Date: May 11. Final Exam. ( 11:15 AM 1:45 PM : 150 Minutes )

CS173 Running Time and Big-O. Tandy Warnow

Comp 11 Lectures. Mike Shah. June 20, Tufts University. Mike Shah (Tufts University) Comp 11 Lectures June 20, / 38

Design at the Register Transfer Level

Time. To do. q Physical clocks q Logical clocks

django in the real world

AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis

The Fragment Network: A Chemistry Recommendation Engine Built Using a Graph Database

Computation of Large Sparse Aggregated Areas for Analytic Database Queries

Factorized Relational Databases Olteanu and Závodný, University of Oxford

ECE 571 Advanced Microprocessor-Based Design Lecture 10

AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis

Scripting Languages Fast development, extensible programs

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY

Material Covered on the Final

Data Structures. Outline. Introduction. Andres Mendez-Vazquez. December 3, Data Manipulation Examples

University of New Mexico Department of Computer Science. Midterm Examination. CS 361 Data Structures and Algorithms Spring, 2003

Finite-State Machines (Automata) lecture 12

13 Concurrent Programming using Tasks and Services

Lecture: Pipelining Basics

Multi-Approximate-Keyword Routing Query

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Catie Baker Spring 2015

Maschinelle Sprachverarbeitung

Bloom Filters, Minhashes, and Other Random Stuff

An Efficient Evolutionary Algorithm for Solving Incrementally Structured Problems

Dictionary: an abstract data type

Harvard Center for Geographic Analysis Geospatial on the MOC

Latches. October 13, 2003 Latches 1

Spatial Analytics Workshop

Analytical Modeling of Parallel Systems

Frequency Estimators

arxiv: v2 [cs.ds] 9 Nov 2017

Searching, mainly via Hash tables

Window-aware Load Shedding for Aggregation Queries over Data Streams

Quiz 2. Due November 26th, CS525 - Advanced Database Organization Solutions

Efficient algorithms for symmetric tensor contractions

COMP 355 Advanced Algorithms

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

Photoluminescence Spectrometer (FLS980)

CS6901: review of Theory of Computation and Algorithms

CS 347 Parallel and Distributed Data Processing

4th year Project demo presentation

Bloom Filters, general theory and variants

Transcription:

Complete Faceting code4lib, Feb. 2009 Toke Eskildsen te@statsbiblioteket.dk Mikkel Kamstrup Erlandsen mke@statsbiblioteket.dk

Battle Plan Terminology (geek level: 1) History (geek level: 3) Data Structures (geek level: 6) Indexing (geek level: 4) Searching or actually Counting (geek level: 5) Scaling or How we Cheat (geek level: 3) Free Bacon! (geek level: )

Wait, What is Summa? Keywords Search Engine Designed for Libraries Open Source (LGPL) Integrated Search Cold Facts 100% Java Lucene Index(es) Developed Since Winter 2005 In Production Since Nov. 2006 Lightning Talk Later

Terminology Documents contain Fields ti: Applied Quantum Mechanics gen_subj: physics subj: quantum mechanics ti: Smooth Manifolds in Physics gen_subj: mathematics gen_subj: physics subj: smooth manifolds

Terminology Documents contain Fields ti: Applied Quantum Mechanics gen_subj: physics subj: quantum mechanics ti: Smooth Manifolds in Physics gen_subj: mathematics gen_subj: physics subj: smooth manifolds Facets contain Tags The title facet Applied Quantum Mechanics Smooth Manifolds in Physics The subject facet physics quantum mechanics mathematics smooth manifolds

Terminology Documents contain Fields ti: Applied Quantum Mechanics gen_subj: physics subj: quantum mechanics ti: Smooth Manifolds in Physics gen_subj: mathematics gen_subj: physics subj: smooth manifolds Facets contain Tags The title facet Applied Quantum Mechanics Smooth Manifolds in Physics The subject facet physics quantum mechanics The spaghetti is called References mathematics smooth manifolds

Diving In Iterate Lucene hits, collect field content Use clean OO facet/tag structure Create cache map in memory Collect tag counts with nice HashMap Logical path onwards? Use field cache or similar BitSets

Stop! What Do We Want? Scale Up and Down Iterative Updates Decoupling from Text Search Engine

Facet Mapping Index DocID References offset 0 0 1 3 2 3 3 4 End (4) 5 References Offset FacetID TagID 0 1 1 1 7 3141593 2 8 87 3 2 12 4 1 1 Resolve the tag string for docs 0 and 3 Facet 1 (sorted list of Tags) TagID Offset in tag file 0 42 1 6 2 2718282 Tag file bacon\npuppies\n... 6

Persistence All references are arrays Just dump them directly to the file system Two strategies for updating tag files Append tags on the fly Store as a full dump at a point in time Two strategies for resolving tags on search Get them from the file system (SSDs rules) Load them fully into memory (ouch)

Persistence All references are arrays Just dump them directly to the file system Two strategies for updating tag files Append tags on the fly Store as a full dump at a point in time Two strategies for resolving tags on search Get them from the file system (SSDs rules) Load them fully into memory (ouch) Put those SSDs to work!

Facet structure building Summa Manipulators: Analyze payload Write to a private sub-index Attach additional info and pass the payload on The Facet Manipulator: Receives Document ID and Fields from the Document manipulator Performs an iterative update of the facet structure

Searching The Job of a Search Node: Search private sub-index, or other private source Add response to collector The Job of the Facet Node: Receive a BitSet from a previous Document Search Node Generate Facet response, add it to the collector

Tag Counting Index DocID References offset 0 0 1 3 2 3 3 4 End (4) 5 References Offset FacetID TagID 0 1 1 1 7 3141593 2 8 87 3 2 12 4 1 1 Increment doc count for tag 1 TagCounter 1 coupled to Facet 1 TagID Counter (int) 0 0 1 0 + 1 + 1 2 0

Distributed Search Summa is equipped to take full advantage of a distributed setup Each sub search node produces a part of the full answer A special search node aggregates results from a set of sub search nodes

Distribution is tricky Merge the top three tags from two nodes: Node 1 Node 2 Result Tag A 2 Tag E 2 Tag A 4 + = Tag B 1 Tag A 2 Tag D 2 Tag C 1 Tag D 2 Tag E 2

Distribution is tricky Merge the top three tags from two nodes: Node 1 Node 2 Result Tag A 2 Tag E 2 Tag A 4 + = Tag B 1 Tag A 2 Tag D 2 Tag C 1 Tag D 2 Tag E 2 FAIL

Distribution is tricky Merge the top three tags from two nodes: Node 1 Node 2 Result Tag A 2 Tag E 2 Tag A 4 + = Tag B 1 Tag A 2 Tag D 2 Tag C 1 Tag D 2 Tag E 2 Node 1 Node 2 Result Tag A 2 Tag E 2 Tag A 4 Tag B 1 Tag A 2 Tag B 3 + = Tag C 1 Tag D 2 Tag D 3 Tag D 1 Tag B 2 Tag E 1 Tag C 1 Tag F 1 Tag F 1

Real Life in Numbers 10M docs, 10M tags, 100M refs, 1 machine A few 1000 hits < 100 ms A few 100.000 hits < 200 ms 10M hits ~3 sec 100M docs, 1G tags, 1G refs, 3 machines A few 1000 hits ~1 sec (ouch) A few 100.000 hits ~1 sec 10M hits < 3 sec 100M hits ~15 sec

Bonus Level! Persistent sorted Tags Index lookup (alphabetic listings) Localized range queries Sort without warm-up and memory overhead

Level Completed! Dead Troll (CC) BY-NC-SA by Kim Smith (Squid@Flickr)

Questions? wiki.statsbiblioteket.dk/summa Summa, Integrated Search Document/Field, Facet/Tag FieldCache, BitSet Iterative updates Lucene decoupling Structure Persistence, Memory Overhead Indexing WeakHashMap, MergeSort Tag Counting Distributed searching Cheating Scalability numbers Collator Order