FROM QUERIES TO TOP-K RESULTS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Similar documents
RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Text Mining. March 3, March 3, / 49

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Lecture 1b: Text, terms, and bags of words

Probabilistic Spelling Correction CE-324: Modern Information Retrieval Sharif University of Technology

Learning Textual Entailment using SVMs and String Similarity Measures

10/17/04. Today s Main Points

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Natural Language Processing SoSe Words and Language Model

Boolean and Vector Space Retrieval Models

Midterm sample questions

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

DT2118 Speech and Speaker Recognition

Accelerated Natural Language Processing Lecture 3 Morphology and Finite State Machines; Edit Distance

Department of Computer Science and Engineering Indian Institute of Technology, Kanpur. Spatial Role Labeling

Graphical Models for Automatic Speech Recognition

Hidden Markov Models in Language Processing

Prenominal Modifier Ordering via MSA. Alignment

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Feature Engineering. Knowledge Discovery and Data Mining 1. Roman Kern. ISDS, TU Graz

Language Processing with Perl and Prolog

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

The Noisy Channel Model and Markov Models

HMM and Part of Speech Tagging. Adam Meyers New York University

Statistical Methods for NLP

Spatial Role Labeling CS365 Course Project

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

CS838-1 Advanced NLP: Hidden Markov Models

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

8: Hidden Markov Models

A brief introduction to Conditional Random Fields

B u i l d i n g a n d E x p l o r i n g

Information Extraction from Biomedical Text. BMI/CS 776 Mark Craven

Text Analysis. Week 5

Sequential Supervised Learning

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Conceptual Similarity: Why, Where, How

TALP at GeoQuery 2007: Linguistic and Geographical Analysis for Query Parsing

CISC4090: Theory of Computation

Intelligent Systems (AI-2)

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

CS 224N HW:#3. (V N0 )δ N r p r + N 0. N r (r δ) + (V N 0)δ. N r r δ. + (V N 0)δ N = 1. 1 we must have the restriction: δ NN 0.

Graphical models for part of speech tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Hidden Markov Models

Lecture 3: ASR: HMMs, Forward, Viterbi

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Lecture 13: Structured Prediction

ECIR Tutorial 30 March Advanced Language Modeling Approaches (case study: expert search) 1/100.

CLRG Biocreative V

Natural Language Processing

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Query CS347. Term-document incidence. Incidence vectors. Which plays of Shakespeare contain the words Brutus ANDCaesar but NOT Calpurnia?

TnT Part of Speech Tagger

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Probabilistic Language Modeling

Intelligent Systems (AI-2)

Efficient Cryptanalysis of Homophonic Substitution Ciphers

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

B490 Mining the Big Data

Sequences and Information

Sequence modelling. Marco Saerens (UCL) Slides references

Lecture 7: Sequence Labeling

Hidden Markov Models (HMMs)

Information Retrieval

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Hidden Markov Models

Information Retrieval and Web Search

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

Soundex distance metric

Fun with weighted FSTs

Query. Information Retrieval (IR) Term-document incidence. Incidence vectors. Bigger corpora. Answers to query

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Unsupervised Vocabulary Induction

Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Text and multimedia languages and properties

Computers playing Jeopardy!

POS-Tagging. Fabian M. Suchanek

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

8: Hidden Markov Models

text statistics October 24, 2018 text statistics 1 / 20

NUL System at NTCIR RITE-VAL tasks

1. Markov models. 1.1 Markov-chain

N-gram Language Modeling

Collapsed Variational Bayesian Inference for Hidden Markov Models

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

Gaussian Models

PV211: Introduction to Information Retrieval

Language Modeling. Michael Collins, Columbia University

INFO 2950 Intro to Data Science. Lecture 18: Power Laws and Big Data

Conditional Random Fields for Sequential Supervised Learning

DISTRIBUTIONAL SEMANTICS

Similarity Measures for Query Expansion in TopX

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Semantic Similarity from Corpora - Latent Semantic Analysis

Transcription:

FROM QUERIES TO TOP-K RESULTS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1

Outline Intro Basics of probability and information theory Retrieval models Retrieval evaluation Link analysis From queries to top-k results Query processing Indexing Top-k search Social search Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 2

Query processing overview Accents, Spacing, Capitalization, Acronym expansion Structure analysis, Tokenization Stop words Noun groups, Entities Stemming, Spelling correction Controlled vocabulary query document (same process for document indexing) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 3

Query normalization: clean query Remove punctuations, comas, semicolons, unnecessary spaces Upper-case vs. lower-case spellings (language-dependent) Normalize and expand acronyms (e.g.: N.Y. NY New York) Normalize language dependent characters (e.g.: ü ue) 4

Query normalization: remove stop words Typically, maintained in so-called stop word lists, e.g.: a able about across after all almost also am among an and any are as at be because but by can cannot do does either else ever every for from get has have he her hers him his how however i if in into is it its just least let like likely may me most my neither no nor not of off often on only or other our own rather she should since so some than that the their them then there these they this to too us we what when where which while who whom why will with would yet you your Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 5

Task Named-entity recognition in query Identify named entities such as persons, locations, organizations, dates, etc. in query Example: flights to John F. Kennedy airport Solutions Look-up in dictionary or knowledge base (mapping is still difficult) Shallow parsing (exploit internal structure of names and local context in which they appear) Shallow parsing + probabilistic graphical models for sequential tagging (e.g.: Hidden Markov Models (HMMs), Conditional Random Fields (CRFs)) Example: HMM with 2 states {entity, non-entity}, find max X P(X, Y) X 1 X 2 X 3 X 4 X 5 X 6 Y 1 = noun Y 2 = prep Y 3 = name Y 4 = name Y 5 = name Y 6 = noun Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 6

Query normalization: stemming Morphological reduction to the stem of a term (i.e., the ground form) Lemmatization algorithms determine the part of speech (e.g., noun, verb, adjective, etc.) of a term and apply stemming rules to map terms to their grammatical lemma ( ground form) There are different stemming rules for adjectives, verbs, nouns, Simple rules for stemming [.]{3, }+ies y (countries country, ladies lady, ) [.]{4, }+ing (fishing fish, sing sing, applying apply, ) [.]{2, }+ss sses ss (press press, guesses guess, less less, chess chess, ) [.]{3, }[^s]+s (books book, symbols symbol, ) Notes Order in which stemming rules are applied is important Indexed documents undergo the same stemming process as the query Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 7

Derivational morphology, e.g.: ational ate ization ize biliti ble The Porter Stemmer Stemming algorithm proposed by Martin Porter in 1980 Standard algorithm for English stemming Uses different steps for Mapping plural to singular form, e.g.: sses ss ies i ss ss s є Mapping past and progressive tense to simple present tense, e.g.: eed ee ed є ing є Clean-up and handle endings (e.g.: y i ) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 8

Query reformulation on Google Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 9

Check spelling Query normalization: spelling correction E.g., by using similarity measures and occurrence frequencies on entries from dictionaries, query logs, web corpus Propose correction of misspelled words, e.g.: recieve receive dissapoiting disappointing acomodation accommodation mulitplayer multiplayer Playstaton Playstation Schwarznegger Schwarzenegger Google s instant autocorrection Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 10

Example: misspellings for Britney Spears on Google Source: http://www.cse.unl.edu/~lksoh/classes/csce410_810_spring06/sup1.html Observation: most used spelling is typically correct ( Wisdom-of-the-Crowds effect) Can this observation be used for spellchecking and auto-correction? Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 11

Preprocessing Google s spelling correction approach (1) Goal: generate triples of the form intended_str, observed_str, obs_freq Build a term list with frequent terms occurring on the web. Remove non-words (e.g., too many punctuations, too short or too long). For each term in the list, find all other terms in the list that are close to it and create pairs str 1, str 2, where str 1, str 2 are close enough to each other. From all pairs of the form str 1, str 2 and maintain only those pairs (str, str), for which freq str 10 freq str. Return list of remaining triples str, str, freq(str). Note: computation can be done in parallel and is easy to distribute. Source: Whitelaw et al. Using the Web for Language Independent Spellchecking and Autocorrection. EMNLP 2009 Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 12

Input Google s spelling correction approach (2) Observed word w, Candidate corrections c c is close to w} Data given by the set of triples c, w, freq w w is observed Output Candidate corrections, ranked decreasingly by P(c w) P w c P(c) Error model (word likelihood) N-gram language model derived from the web Estimation of error model: for adjacent-substring partitions R of c and T of w estimate P w c max R,T: R = T R i=1 P T i R i e.g., with R i, T i 3 Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 13

Google s spelling correction approach (3) Context of a word can also be taken into account Generate triples of the form w l cw r, w l w w r, freq w Language model can be down-weighted (relatively to the error model), in case errors are common, e.g.: P c w P w c P c λ with λ 0 Reported error rates < 4% when error model trained on corpus size of ~10 Mio. terms How to find close strings? Use similarity measures on strings (e.g., based on edit distance, Jaccard similarity on substring partitions, ) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 14

String distance measures: edit distance (1) Minimal number of editing operations to turn a string s 1 into another string s 2 Levenshtein distance (edit distance) Uses replacement, deletion, insertion of a character as editing operations Input: s 1 [1.. i] and s 2 1.. j Conditions: edit 0, 0 = diff i, j, edit i, 0 = i + edit 0, 0, edit 0, j = j + edit 0, 0. Output: edit(i, j) = min { edit(i 1, j) + 1, edit i, j 1 + 1, edit(i 1, j 1) + diff(i, j) } e.g., with diff i, j = 1 if s 1[i] s 2 [j] 0 otherwise efficient computation by dynamic programming // delete // insert // replace Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 15

Levenshtein distance String distance measures: edit distance (2) K N I T T I N G S 1 2 3 4 5 6 7 8 I 2 2 2 3 4 5 6 7 T 3 3 3 2 3 4 5 6 T 4 4 4 3 2 3 4 5 I 5 5 4 4 3 2 3 4 N 6 5 5 5 4 3 2 3 G 7 6 6 6 5 4 3 2 Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 16

Approximate string containment with edit distance Levenshtein distance for approximate string containment Slightly different starting conditions colour is contained in kolorama with 2 errors Source: Modern Information Retrieval, Baeza-Yates, Ribeiro-Neto Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 17

String distance measures: edit distance (3) Demerau-Levenshtein distance Uses replacement, deletion, insertion, and transposition of character as editing operations Input: s 1 [1.. i] and s 2 1.. j Conditions: edit 0, 0 = diff i, j, edit i, 0 = i + edit 0, 0, edit 0, j = j + edit 0, 0. Output: edit(i, j) = min { edit(i 1, j) + 1, edit i, j 1 + 1, edit i 1, j 1 + diff i, j edit i 2, j 2 + 1} with diff i, j = 1 if s 1[i] s 2 [j] 0 otherwise // transpose Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 18

Hamming distance Other useful string distance measures d H s 1, s 2 = # i s 1 [i] s 2 [i]}, for s 1 = s 2 Jaccard distance G N s : = {substrings of length N}, i.e., subset of N-grams Example G 3 "schwarzenegger" : = {sch, chw, hwa, war, arz, rze, zen, ene } d J s 1, s 2 = 1 G N s 1 G N s 2 G N s 1 G N s 2 Simple N-gram-based distance d s 1, s 2 = G N s 1 + G N s 2 2 G N s 1 G N s 2 Theorem 1: for string s 1 and a target string s 2 G N s 1 G N s 2 < s 1 N 1 dn d edit s 1, s 2 > d Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 19

Soundex algorithm Phonetic similarities Idea: map words onto 4-letter codes, such that words with similar pronunciation have the same code First letter of the word becomes first code letter Then map b, p, f, v 1 c, s, g, j, k, q, x, z 2 d, t 3 l 4 m, n 5 r 6 For letters with the same soundex number that are immediately next to each other, only one is mapped a, e, i, o, u, y, h, w are ignored (exept for the first character) If code length > 4, keep only first four characters of the code Examples: Penny P500, Ponny P500, Powers P620, Perez P620 Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 20

Query reformulation (1) A specific search need can be expressed in different ways, some formulations lead to better results than others Already discussed some strategies for (implicit) query reformulation Integration of relevance feedback (e.g., Rocchio algorithm), implicit feedback (using clicks and similar queries from query log), pseudo-relevance feedback (assuming top-k results are relevant) Example: estimate the probability of w given w q from query log P w w P w d RelDocs(q) P d RelDocs(q) d RelDocs w d where RelDocs(q) and RelDocs(w) are the documents that were (implicitly) rated as relevant for the query and the keyword w respectively Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 21

Linguistic techniques that use Query reformulation (2) stemming/lemmatization and spelling correction (through edit distance) thesauri or dictionaries for term expansion or replacement by synonyms, hypernyms, hyponyms, meronyms, holonyms, Naive reformulation Substitute (or expand) keyword w with S w = w 1,, w k such that for each w i S w sim w, w i τ (with threshold τ) Score q, d, w q S w = w q w i S w sim w, w i Sc w i, d Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 22

Query reformulation: example (1) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 23

Query reformulation: example (2) Instant query reformulation (thesaurus-based?) Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 24

Careful thesaurus-based query reformulation Find important phrases in query (or from best initial query results) Try to map found phrases onto synonyms, hyponyms, hypernyms from some thesaurus If a phrase is mapped to one concept expand it with synonyms and hyponyms Compute score as Score q, d, Th: thesaurus = max sim w, c Sc c, d w q c Th avoids unfair expansion of terms that have many concepts Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 25

Thesaurus-based query reformulation Source: http://wordnetweb.princeton.edu/perl/webwn Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 26

Possible similarity measures Dice sim w i, w j = 2 docs(w i) docs(w j ) docs(w i ) + docs(w j ) Jaccard sim w i, w j = docs(w i) docs(w j ) docs(w i ) docs(w j ) Point-wise mutual information (PMI) Similarity on ontology graphs sim w i, w j = freq(w i and w j ) freq w i freq w j sim w, w = max sim c i, c j c i,c j p p is path from w to w Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 27

Summary Query analysis Parsing, tokenisation Query cleaning (remove punctuations, comas, stop words) Named-entity/noun-group recognition (dictionaries, shallow parsing, HMMs) Stemming Spelling correction Similarity/distance measures (edit distance, n-gram-based Dice, Jaccard, ) Query reformulation (linguistic, thesaurus-based) Thesaurus-based similarity Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 28