Text Analytics. Models for Information Retrieval 1. Ulf Leser
|
|
- Jonathan Gerald Bradley
- 5 years ago
- Views:
Transcription
1 Text Analytics Models for Information Retrieval 1 Ulf Leser
2 Processing Pipeline Parsing Normalization Term Weighting Normalization Parsing Corpus IR Database Interface User Document IDs Filtering Ranking Feedback Frakes et al Document Retrieval Text Preprocessing Linguistic Annotation Named Entity Recognition Named Entity Normalization Relationship Extraction Ulf Leser: Text Analytics, Vorlesung, Sommersemester
3 Format Conversion New generation of e-commerce applications require data schemas In relational database systems, data objects are conventionally that are constantly evolving and sparsely populated. The conven- stored using a horizontal scheme. A data object is represented as tional horizontal row representation fails to meet these require- Transform text in ASCII / UniCode Formats PDF, XML, DOC, ODP, images, Problems Formatting instruction, Special characters, formulas, figures and captions, tables, section headers, footnotes, page numbers, margin text, Diplomacy: To what extend can one reconstruct the original document from the normalized representation? Ulf Leser: Text Analytics, Vorlesung, Sommersemester
4 Assumptions A document a sequence of sentences A sentence is a sequence of tokens A token is the smallest unit of text E.g. words, numbers A term is a token or a set of tokens with a semantic meaning E.g. single token, proper names San is a token, but not a term San Francisco are two tokens but one term Dictionaries usually contain terms, not tokens A word is either a token or a term depends on the context A concept is the mental representation of a real-world thing Concepts are represented by terms/words Terms may represent multiple concepts: homonyms Concepts may be represented by multiple terms: synonyms Ulf Leser: Text Analytics, Vorlesung, Sommersemester
5 Case A Difficult Case Should text be converted to lower case letters? Advantages Makes queries simpler Decreases index size a lot Disadvantages No abbreviations Looses important hints for sentence splitting Sentences start with upper case letters Looses important hints for tokenization, NER, Often Convert to lower case after all other steps Ulf Leser: Text Analytics, Vorlesung, Sommersemester
6 Sentence Splitting Most linguistic analysis works on sentence level Sentences group together objects and statements But mind references: What is this white thing? It is a house. Naive approach: Reg-Exp search for [.?!;] (note the blank) Abbreviations C. Elegans is a worm which ; This does not hold for the U.S. Errors (due to previous normalization steps) is not clear.conclusions.we reported on Proper names DOT.NET is a technique for Direct speech By saying It is the economy, stupid!, B. Clinton meant that Ulf Leser: Text Analytics, Vorlesung, Sommersemester
7 Tokenization Split a sentence into token But in a way that allows for recognition of multi-token terms later Simple approach: search for (blanks) In San Francisco it rained a lot this year. A state-of-the-art Z-9 Firebird was purchased on 3/12/1995. SQL commands comprise SELECT FROM WHERE clauses; the latter may contain functions such as leftstr(string, INT). This LCD-TV-Screen cost 3.100,99 USD. [Bis[1,2-cyclohexanedionedioximato(1-)-O]- [1,2-cyclohexanedione dioximato(2-)-o]methyl-borato(2-)- N,N0,N00,N000,N0000,N00000)-chlorotechnetium) belongs to a family of Typical approach Treat hyphens/parantheses as blanks Remove. (of abbreviations; after sentence splitting) Hope that errors are rare and don t matter much Ulf Leser: Text Analytics, Vorlesung, Sommersemester
8 Stems versus Lemmas Common idea: Normalize words to a basal, normal form car,cars -> car; gives, gave, give -> give Stemming reduce words to a base form Base form often is not a word in the language Ignores detached prepositions ( Ich kaufe ein -> einkaufen or kauf ) Rather simple Lemmatization reduce words to their lemma Finds the linguistic root of a word Must be a proper word of the language Rather difficult, requires morphological analysis Ulf Leser: Text Analytics, Vorlesung, Sommersemester
9 Stop Words Words that are so frequent that their removal (hopefully) does not change the meaning of a doc for retrieval/mining purposes English: Top-2: 10% of all tokens; Top6: 20%; Top-50: 50% English (top10; LOB corpus): the, of, and, to, a, in, that, is, was, it German: aber, als, am, an, auch, auf, aus, bei, bin, Removing stop words reduces a positional token index by ~40% Search engines do remove stop words (but which?) Hope: Increase in precision due to less spurious hits Clearly, this also depends on the usage of stop words in the query Variations Remove top 10, 100, 1000, words Use language-specific or corpus-specific stop word list Ulf Leser: Text Analytics, Vorlesung, Sommersemester
10 Zipf s Law Let f be the frequency of a word and r its rank in the list of all words sorted by frequency Zipf s law: f ~ k/r for some constant k Example Word ranks in Moby Dick Good fit to Zipf s law Some domaindependency (whale) Zipf s law is a fairly good approximation in most corpora Ulf Leser: Text Analytics, Vorlesung, Sommersemester
11 Proper Names Named Entity Recognition Proper names are different from ordinary words Very often comprise more than one token Often not contained in lexica Appear and disappear all the time Often contain special characters Are very important for information retrieval and text mining Search for Bill Clinton (not Clinton ordered the bill ) Extract all relationships between a disease and the DMD gene Recognizing proper names is difficult more later Named entity recognition dark is a proper name broken wing is no proper name, broken-wing gene possibly is one Disambiguation dark is a gene of the fruitfly (and a common English word) Ulf Leser: Text Analytics, Vorlesung, Sommersemester
12 Document Summarization Preprocessing may go far beyond what we discussed Statistical summarization Remove all token that are not representative for the text Only keep words which are more frequent than expected by chance Chance : Over all documents in the collection (corpus-view), in the language (naïve-view) Makes amount of text much smaller, hopefully doesn t influence IR performance Semantic summarization Understand text and create summary of content Annotation / classification Have a human understand the text and categorize it / describe it with words from a controlled dictionary That s what libraries are crazy about and Yahoo is famous for Ulf Leser: Text Analytics, Vorlesung, Sommersemester
13 Content of this Lecture IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Other IR Models Ulf Leser: Text Analytics, Vorlesung, Sommersemester
14 Information Retrieval Core The core question in IR: Which of a given set of (normalized) documents is relevant for a given query? Ranking: How relevant is each document of a given set for a given query? Query Normalization Document base Normalization Match / Relevance Scoring Result Ulf Leser: Text Analytics, Vorlesung, Sommersemester
15 IR Models Modeling: How is relevance judged? What model of relevancy is assumed? Set Theoretic Fuzzy Extended Boolean Classic Models Algebraic U s e r T a s k Retrieval: Adhoc Filtering Browsing Boolean Vector-Space Probabilistic Structured Models Non-Overlapping Lists Proximal Nodes Browsing Generalized Vector Lat. Semantic Index Neural Networks Probabilistic Inference Network Belief Network Flat Structure Guided Hypertext [BYRN99] Ulf Leser: Text Analytics, Vorlesung, Sommersemester
16 Notation All of the models we discuss use the Bag of Words view Definition Let K be the set of all terms, k K is a term Let D be the set of all documents, d D is a document Let v d by a vector for d with v d [i]=0 if the i th term k i in w(d) is not contained in d v d [i]=1 if the i th term k i in w(d) is contained in d Let v q by a vector for q in the same sense Often, we shall use weights instead of 0/1 memberships Let w ij 0 be the weight of term k i in document d j To what extend does k i contribute to the content of d j? w ij =0 if k i d j v d and v q are defined accordingly Ulf Leser: Text Analytics, Vorlesung, Sommersemester
17 Content of this Lecture IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Other IR Models Ulf Leser: Text Analytics, Vorlesung, Sommersemester
18 Boolean Model Simple model based on set theory Queries are specified as Boolean expressions Terms are atoms Terms are connected by AND, OR, NOT (XOR,...) There are no real weights; w ij {0,1} All terms contribute equally to the content of a doc Relevance of a document is either 0 or 1 Computation by Boolean operations on the doc and query vector Represent q as set S of vectors over all terms Example: q=k 1 (k 2 k 3 ) => S={(1,1,1),(1,1,0),(1,0,0)} S describes all possible combinations of keywords that make a doc relevant Document d is relevant for q iff v d S Ulf Leser: Text Analytics, Vorlesung, Sommersemester
19 Properties Simple, clear semantics, widely used in (early) systems Disadvantages No partial matching (expression must evaluate to true) Suppose query k 1 k 2 k 9 A doc d with k 1 k 2 k 8 is as much refused as one with none of the terms No ranking (extensions exist, but not satisfactory) Query terms cannot be weighted Average users don t like Boolean expressions Search engines: +bill +clinton lewinsky tax Results: Often unsatisfactory Too many documents (too few restrictions) Without ranking Too few documents (too many restrictions) Ulf Leser: Text Analytics, Vorlesung, Sommersemester
20 A Note on Implementation Our definition only gave one way of defining the semantics of a Boolean query; this should not be used for implementations The set S of vectors can be extremely large Suppose q=k 1 and K =1000 => S has components Improvement: Evaluate only those terms that are used in the query Usual way of implementation: Indexing Assume we have an index with fast operation find: K P D Search each term k i of the query, resulting in a set D i of docs with D i D Evaluate query in the usual order using set operations on D i s k i k j : D i D j k i k j : D i D j NOT k i : D\D i Use cost-based optimization Evaluate those subexpressions first which hopefully result in small results Ulf Leser: Text Analytics, Vorlesung, Sommersemester
21 Negation in the Boolean Model Evaluating NOT k i : D\D i is expensive Result often is very, very large ( D\D i D ) Because almost all terms appear in almost no documents Recall Zipf s Law the tail of the distribution If k i is not a stop word, which we removed anyway Solution 1: Disallow negation This is what web search engines do (exception?) Solution 2: Allow only in the form k i NOT k j Evaluation: Compute D i and remove docs which do not contain k j Ulf Leser: Text Analytics, Vorlesung, Sommersemester
22 Content of this Lecture IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Other IR Models Ulf Leser: Text Analytics, Vorlesung, Sommersemester
23 Vector Space Model Salton, G., Wong, A. and Yang, C. S. (1975). "A Vector Space Model for Automatic Indexing." Communications of the ACM 18(11): Historically, a breakthrough in IR Still most popular model today General idea Fix a vocabulary K Typically the set of all different terms in D View each doc and each query as a point in a K -dim space Rank docs according to distance from the query Main advantages Ability to rank docs (according to relevance) Natural capability for partial matching Ulf Leser: Text Analytics, Vorlesung, Sommersemester
24 Vector Space Quelle: Star Astronomy Movie stars Mammals Diet Each term in K is one dimension Different suggestions for determining co-ordinates = Weights Different suggestions for determining the distance between two points The closest docs are the most relevant ones Rational: vectors correspond to themes which are loosely related to sets of terms Distance between vector ~ distance between themes Ulf Leser: Text Analytics, Vorlesung, Sommersemester
25 Preliminaries 1 Definition Let D be a document collection, K be the set of all terms in D, d D and k K. The term frequency tf dk is the frequency of k in d The document frequency df k is the number of docs in D containing k Or sometimes the number of occurrences of k in D The inverse document frequency idf k = D / df k Actually, it should rather be called the inverse relative document frequency In practice, one usually uses idf k = log( D / df k ) Remarks idf is a popular weighting term, therefore a proper definition Clearly, tf dk is a natural measure for the weight w dk But not the only or best one Ulf Leser: Text Analytics, Vorlesung, Sommersemester
26 Preliminaries 2 Recall The scalar product between vectors v and w v o w = v * w *cos( v, w) with v = 2 v i and v o w = i= 1.. n v i * w i Ulf Leser: Text Analytics, Vorlesung, Sommersemester
27 The Simplest Approach Co-ordinates are set as term frequency Distance is measured as the cosine of the angle between doc d and query q Wait for Euclidean distance sim( d, q) = cos( v d, v q ) = ( v [ i]* v [ i] ) q d [ i] Properties Including term weights increases retrieval performance Computes a (seemingly reasonable) ranking Also returns partial matches v v d d o v * v q q = v q 2 * Can be dropped for ranking v d [ i] 2 Ulf Leser: Text Analytics, Vorlesung, Sommersemester
28 Example Data Assume stop word removal (wir, in, ) and stemming (verkaufen -> kauf, wollen -> woll, ) Text verkauf haus italien gart miet blüh woll 1 Wir verkaufen Häuser in Italien 2 Häuser mit Gärten zu vermieten 3 Häuser: In Italien, um Italien, um Italien herum 4 Die italienschen Gärtner sind im Garten 5 Der Garten in unserem italienschen Haus blüht Q Wir wollen ein Haus mit Garten in Italien mieten Ulf Leser: Text Analytics, Vorlesung, Sommersemester
29 Ranking without Length Normalization sim( d, q) = d o q = i= 1.. n v d [ i]* v q [ i] Q sim(d 1,q)=1*0+1*1+1*1+0*1+0*1+0*0+0*1=2 sim(d 2,q)=1+1+1=3 sim(d 3,q)=1+3=4 sim(d 4,q)=1+2=3 sim(d 5,q)=1+1+1=3 No good? Rg Q: Wir wollen ein Haus mit Garten in Italien mieten 1 d 3 : Häuser: In Italien, um Italien, um Italien herum 2 d 5 : Der Garten in unserem italienschen Haus blüht 2 d 4 : Die italienschen Gärtner sind im Garten 2 d 2 : Häuser mit Gärten zu vermieten 5 d 1 : Wir verkaufen Häuser in Italien Ulf Leser: Text Analytics, Vorlesung, Sommersemester
30 Ranking with Length Normalization sim( d, q) = ( v ) q[ i]* vd[ i] 2 vd[ i] Q sim(d 1,q) = (1*0+1*1+1*1+0*1+0*1+0*0+0*1) / 3 ~ 1.15 sim(d 2,q) = (1+1+1) / 3 ~ 1.73 sim(d 3,q) = (1+3) / 10 ~ 1.26 sim(d 4,q) = (1+2) / 5 ~ 1.34 sim(d 5,q) = (1+1+1) / 4 ~ 1.5 No good? Rg Q: Wir wollen ein Haus mit Garten in Italien mieten 1 d 2 : Häuser mit Gärten zu vermieten 2 d 5 : Der Garten in unserem italienschen Haus blüht 3 d 4 : Die italienschen Gärtner sind im Garten 4 d 3 : Häuser: In Italien, um Italien, um Italien herum 5 d 1 : Wir verkaufen Häuser in Italien Ulf Leser: Text Analytics, Vorlesung, Sommersemester
31 Improved Scoring: TF*IDF One obvious problem: The longer the document, the higher the chances of being ranked high Solution: Normalize on document length (also yields 0 w ij 1) w = Second obvious problem: Some terms are just everywhere in D and should be scored less Solution: Use IDF scores ij tf d ij i = i tf ij tf k = 1.. di tfij w ij = * idf d ik j Ulf Leser: Text Analytics, Vorlesung, Sommersemester
32 TF*IDF Short Form Give terms in a doc d high weights which are frequent in d and infrequent in D IDF deals with the consequences of Zipf s law The few very frequent (and unspecific) terms get low scores The many infrequent (and specific) terms get higher scores Ulf Leser: Text Analytics, Vorlesung, Sommersemester
33 Example TF*IDF w ij tf = d ij sim( d, q) = i * idf j tf = d ij i D * df ( v ) q[ i]* vd[ i] 2 vd[ i] k IDF 5 5/4 5/4 5/ (tf) 1/3 1/3 1/3 2 (tf) 1/3 1/3 1/3 3 (tf) 1/4 3/4 4 (tf) 1/3 2/3 5 (tf) 1/4 1/4 1/4 1/4 Q sim(d 1,q)=(5/4*1/3 +5/4*1/3) / 0.3 ~ 1.51 sim(d 2,q)=(5/4*1/3 +5/3*1/3+5*1/3) / 0.3 ~ 4,80 sim(d 3,q)=(5/4*1/4+5/4*3/4) / 0.63 ~ 1,57 sim(d 4,q)=(5/4*1/3 +5/3*2/3) / 0.56 ~ 2,08 sim(d 5,q)=(5/4*1/4 +5/4*1/4+5/3*1/4) / 0.25 ~ 2,08 wollen ein Haus mit Garten in Italien mieten d 2 : Häuser mit Gärten zu vermieten d 5 : Der Garten in unserem italienschen Haus blüht wollen ein Haus mit Garten in Italien mieten Häuser mit Gärten zu vermieten Der Garten in unserem italienschen Haus blüht d 4 : Die italienschen Gärtner sind im Garten Die italienschen Gärtner sind im Garten d 3 : Häuser: In Italien, um Italien, um Italien Wir verkaufen Häuser in Italien herum Häuser: In Italien, um Italien, um Italien herum d Ulf Leser: Text Analytics, Vorlesung, Sommersemester : Wir verkaufen Häuser in Italien 33
34 Further Improvements Scoring the query in the same way as the documents Note: Repeating words in a query then makes a difference Empirically estimated best scoring function [BYRN99] Different length normalization w ij = tf j ij max tf ij Use log of IDF with slightly different measure (n i : # of docs containing k j ); rare terms are not totally distorting the score any more w ij = tf j ij maxtf ij *log D n j Ulf Leser: Text Analytics, Vorlesung, Sommersemester
35 Why not Simpler? Why not use simple Euclidean distance? Length of vectors would be much more important Since queries usually are very short, very short documents would always win Cosine measures normalizes by the length of both vectors Ulf Leser: Text Analytics, Vorlesung, Sommersemester
36 Metaphor: Comparison to Clustering Clustering: Find groups of objects that are close to each other and far apart from all others Rephrase the IR problem A query partitions D in two clusters: relevant R / not relevant N q should be in the heard of R: similarity Docs in R should be close to each other: large w ij *q i values Docs in R should be far apart from other points: idf Goal of scoring: Balance between intra- and inter-cluster similarity Ulf Leser: Text Analytics, Vorlesung, Sommersemester
37 Shortcomings We assumed that all terms are independent, i.e., that their vectors are orthogonal Clearly wrong: terms are semantically close or not The appearance of red in a doc with wine doesn t mean much But wine is a non-zero match for drinks Extension: Topic-based Vector Space Model No treatment homonyms (as for most other methods) We would need to split words into their senses See word disambiguation (later) Order independent But order carries semantic meaning (object? subject?) Order important for disambiguation Ulf Leser: Text Analytics, Vorlesung, Sommersemester
38 A Note on Implementation Again: Assume we have an index with op find: K P D Assume we want to retrieve the top-r docs Core algorithm Look up all terms k i of the query Build the union of all documents which contain at least on k i Hold in a list sorted by score (initialize with 0) Walk through terms k i in order of decreasing IDF-weights For each document d j : Add w ji *IDF i to current score s j Go through docs in order of current score Early stop Look at s j r s j r+1 : Can doc with rank r+1 still reach doc with rank r? Can be estimated from distribution of TF and IDF values Early stop might produce the wrong order of top-r docs, but not wrong docs Why? Ulf Leser: Text Analytics, Vorlesung, Sommersemester
39 A Different View on VSD Star Diet Query evaluation actually searches for the top-r nearest neighbors This can be approached using techniques from multidimensional indexing kdd-trees, Grid files, etc. Hope: No sequential scan of (all, many) docs, but directed search according to distances But severe problem: High dimensionality Ulf Leser: Text Analytics, Vorlesung, Sommersemester
40 Content of this Lecture IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Ulf Leser: Text Analytics, Vorlesung, Sommersemester
41 Interactive IR Recall: IR is the process of finding the right information This includes more than a single query Relevance feedback User poses query System answers User judges the relevance of top-k returned docs System considers feedback and generates new (improved) answer User never needs to pose another query The new query is generated by the system Loop until satisfaction Query (Ranked) Results IR System Ulf Leser: Text Analytics, Vorlesung, Sommersemester
42 Relevance Feedback User judges the current answers Basic assumptions Relevant docs are somewhat similar to each other the common core should be emphasized Irrelevant docs are different from relevant docs the differences should be deemphasized System adapts to feedback by Query expansion: Add new terms to the query From the relevant documents Term re-weighting: Assign new weights to terms From the relevant documents Ulf Leser: Text Analytics, Vorlesung, Sommersemester
43 Rationale Assume we know the set R of all relevant docs How would a good query for R look like? v q = 1 R d R d D 1 R d R d Idea Weight terms high that are in R Weight terms low that are not in R, i.e., that are in N=D\R Remark It is not clear that this is the best of all queries Relevance feedback is a heuristic Ulf Leser: Text Analytics, Vorlesung, Sommersemester
44 Rocchio Algorithm Usually we do not know the real R Best bet: Let R (N) be the set docs marked as relevant (irrelevant) Adapt query vector as follows 1 1 v = + q new α * v( q) β * d γ * d R N This implies query expansion and term re-weighting How to choose α, β, γ? Alternative 1 vq new = α * v( q) + β * d γ *{ d d = arg max( sim( q, d))} R d N d R d R d N Intuition: Non-relevant docs are heterogeneous and tear in every direction better to only take the worst Ulf Leser: Text Analytics, Vorlesung, Sommersemester
45 Example (simplified) α=0.5, β=0.5, γ=0 Information 1.0 d 1 = "information science = (0.2, 0.8) d 2 = "retrieval systems" = (0.9, 0.1) q = "retrieval of information = (0.7, 0.3) 0.5 d 1 q If d 1 is marked relevant q = ½ q + ½ d 1 = (0.45, 0.55) If d 2 is marked relevant Q = ½ q + ½ d 2 = (0.80, 0.20) q q d 2 Quelle: A. Nürnberger: IR Retrieval Ulf Leser: Text Analytics, Vorlesung, Sommersemester
46 Results Advantages Improved results (many studies) compared to single queries Users need not generate new queries themselves Iterative process converging to the best possible answer Especially helpful for increasing recall (due to query expansion) Disadvantages Requires some work by the user Excite: Only 4% used relevance feedback ( more of this button) Writing a new query based on returned results might be faster (and easier and more successful) than classifying results Based on the assumption that relevant docs are similar What if user searches for all meanings of jaguar? Query very long already after one iteration makes retrieval slow Ulf Leser: Text Analytics, Vorlesung, Sommersemester
47 Collaborative Filtering What other information could we use for improving retrieval performance? Collaborative filtering: Give the user what other, similar users liked Customers who bought this book also bought In IR: Find users posing similar queries and look at which documents they eventually looked it In e-commerce, it is the buying in itself (very reliable) In IR, we need to approximate Documents a searcher clicked on (if recordable) Did the user look at the second page? (Low credit for first results) Did the user pose a refinement query next? All are not very reliable Problems: What is a similar user? What is a similar query? Ulf Leser: Text Analytics, Vorlesung, Sommersemester
48 Thesaurus-based Query Expansion [M07, CS276] Expand query with synonyms and hyponyms of each term feline feline cat One may weight added terms less than original query terms Often used in scientific IR systems (Medline) Requires a high quality thesaurus General observation Increases recall May significantly decrease precision interest rate interest rate fascinate evaluate Do synonyms really exist? Ulf Leser: Text Analytics, Vorlesung, Sommersemester
Maschinelle Sprachverarbeitung
Maschinelle Sprachverarbeitung Retrieval Models and Implementation Ulf Leser Content of this Lecture Information Retrieval Models Boolean Model Vector Space Model Inverted Files Ulf Leser: Maschinelle
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search IR models: Vector Space Model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Brosing boolean vector probabilistic
More informationChap 2: Classical models for information retrieval
Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81 Outline Basic IR Models 1 Basic
More informationBoolean and Vector Space Retrieval Models
Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) 1
More informationLatent Semantic Analysis. Hongning Wang
Latent Semantic Analysis Hongning Wang CS@UVa Recap: vector space model Represent both doc and query by concept vectors Each concept defines one dimension K concepts define a high-dimensional space Element
More informationRetrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1
Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1 Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR)
More informationCS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002 Recap of last time Index size Index construction techniques Dynamic indices Real world considerations 2 Back of the envelope
More informationBoolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).
Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,
More information1 Information retrieval fundamentals
CS 630 Lecture 1: 01/26/2006 Lecturer: Lillian Lee Scribes: Asif-ul Haque, Benyah Shaparenko This lecture focuses on the following topics Information retrieval fundamentals Vector Space Model (VSM) Deriving
More informationText Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University
Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data
More informationRETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic
More informationInformation Retrieval. Lecture 6
Information Retrieval Lecture 6 Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring tf idf and vector spaces This lecture
More informationNatural Language Processing. Topics in Information Retrieval. Updated 5/10
Natural Language Processing Topics in Information Retrieval Updated 5/10 Outline Introduction to IR Design features of IR systems Evaluation measures The vector space model Latent semantic indexing Background
More informationManning & Schuetze, FSNLP, (c)
page 554 554 15 Topics in Information Retrieval co-occurrence Latent Semantic Indexing Term 1 Term 2 Term 3 Term 4 Query user interface Document 1 user interface HCI interaction Document 2 HCI interaction
More informationText Analytics. Searching Terms. Ulf Leser
Text Analytics Searching Terms Ulf Leser A Probabilistic Interpretation of Relevance We want to compute the probability that a doc d is relevant to query q The probabilistic model determines this probability
More informationCan Vector Space Bases Model Context?
Can Vector Space Bases Model Context? Massimo Melucci University of Padua Department of Information Engineering Via Gradenigo, 6/a 35031 Padova Italy melo@dei.unipd.it Abstract Current Information Retrieval
More informationFall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26
Fall 2016 CS646: Information Retrieval Lecture 6 Boolean Search and Vector Space Model Jiepu Jiang University of Massachusetts Amherst 2016/09/26 Outline Today Boolean Retrieval Vector Space Model Latent
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 4: Probabilistic Retrieval Models April 29, 2010 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig
More informationPV211: Introduction to Information Retrieval
PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 11: Probabilistic Information Retrieval Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk
More informationText Analytics (Text Mining)
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS
More informationVariable Latent Semantic Indexing
Variable Latent Semantic Indexing Prabhakar Raghavan Yahoo! Research Sunnyvale, CA November 2005 Joint work with A. Dasgupta, R. Kumar, A. Tomkins. Yahoo! Research. Outline 1 Introduction 2 Background
More informationData Mining Recitation Notes Week 3
Data Mining Recitation Notes Week 3 Jack Rae January 28, 2013 1 Information Retrieval Given a set of documents, pull the (k) most similar document(s) to a given query. 1.1 Setup Say we have D documents
More information13 Searching the Web with the SVD
13 Searching the Web with the SVD 13.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this
More information9 Searching the Internet with the SVD
9 Searching the Internet with the SVD 9.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this
More informationDealing with Text Databases
Dealing with Text Databases Unstructured data Boolean queries Sparse matrix representation Inverted index Counts vs. frequencies Term frequency tf x idf term weights Documents as vectors Cosine similarity
More informationLecture 5: Web Searching using the SVD
Lecture 5: Web Searching using the SVD Information Retrieval Over the last 2 years the number of internet users has grown exponentially with time; see Figure. Trying to extract information from this exponentially
More informationVector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model
Vector Space Model Yufei Tao KAIST March 5, 2013 In this lecture, we will study a problem that is (very) fundamental in information retrieval, and must be tackled by all search engines. Let S be a set
More information16 The Information Retrieval "Data Model"
16 The Information Retrieval "Data Model" 16.1 The general model Not presented in 16.2 Similarity the course! 16.3 Boolean Model Not relevant for exam. 16.4 Vector space Model 16.5 Implementation issues
More informationVector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson
Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics Querying
More informationRecap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?
CS276A Information Retrieval Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring tf idf and vector spaces Lecture 7 This
More informationLecture 3: Probabilistic Retrieval Models
Probabilistic Retrieval Models Information Retrieval and Web Search Engines Lecture 3: Probabilistic Retrieval Models November 5 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme
More informationCSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides
CSE 494/598 Lecture-4: Correlation Analysis LYDIA MANIKONDA HT TP://WWW.PUBLIC.ASU.EDU/~LMANIKON / **Content adapted from last year s slides Announcements Project-1 Due: February 12 th 2016 Analysis report:
More informationPart A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )
Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds
More information.. CSC 566 Advanced Data Mining Alexander Dekhtyar..
.. CSC 566 Advanced Data Mining Alexander Dekhtyar.. Information Retrieval Latent Semantic Indexing Preliminaries Vector Space Representation of Documents: TF-IDF Documents. A single text document is a
More informationLatent Semantic Analysis. Hongning Wang
Latent Semantic Analysis Hongning Wang CS@UVa VS model in practice Document and query are represented by term vectors Terms are not necessarily orthogonal to each other Synonymy: car v.s. automobile Polysemy:
More informationInformation Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture; IIR Sections
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationPart I: Web Structure Mining Chapter 1: Information Retrieval and Web Search
Part I: Web Structure Mining Chapter : Information Retrieval an Web Search The Web Challenges Crawling the Web Inexing an Keywor Search Evaluating Search Quality Similarity Search The Web Challenges Tim
More informationMotivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models
3. Retrieval Models Motivation Information Need User Retrieval Model Result: Query 1. 2. 3. Document Collection 2 Agenda 3.1 Boolean Retrieval 3.2 Vector Space Model 3.3 Probabilistic IR 3.4 Statistical
More informationWhat is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.
What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured. Text mining What can be used for text mining?? Classification/categorization
More informationMATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson
MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Latent
More informationInformation Retrieval
Introduction to Information CS276: Information and Web Search Christopher Manning and Pandu Nayak Lecture 13: Latent Semantic Indexing Ch. 18 Today s topic Latent Semantic Indexing Term-document matrices
More informationTerm Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan
Term Weighting and the Vector Space Model borrowing from: Pandu Nayak and Prabhakar Raghavan IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes
More informationOutline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting
Outline for today Information Retrieval Efficient Scoring and Ranking Recap on ranked retrieval Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Efficient
More informationNotes on Latent Semantic Analysis
Notes on Latent Semantic Analysis Costas Boulis 1 Introduction One of the most fundamental problems of information retrieval (IR) is to find all documents (and nothing but those) that are semantically
More informationSparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent
More informationANLP Lecture 22 Lexical Semantics with Dense Vectors
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous
More informationRanked Retrieval (2)
Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationLeverage Sparse Information in Predictive Modeling
Leverage Sparse Information in Predictive Modeling Liang Xie Countrywide Home Loans, Countrywide Bank, FSB August 29, 2008 Abstract This paper examines an innovative method to leverage information from
More informationText Analytics (Text Mining)
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS
More informationDISTRIBUTIONAL SEMANTICS
COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.
More informationManning & Schuetze, FSNLP (c) 1999,2000
558 15 Topics in Information Retrieval (15.10) y 4 3 2 1 0 0 1 2 3 4 5 6 7 8 Figure 15.7 An example of linear regression. The line y = 0.25x + 1 is the best least-squares fit for the four points (1,1),
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Ranked retrieval Thus far, our queries have all been Boolean. Documents either
More informationMapping of Science. Bart Thijs ECOOM, K.U.Leuven, Belgium
Mapping of Science Bart Thijs ECOOM, K.U.Leuven, Belgium Introduction Definition: Mapping of Science is the application of powerful statistical tools and analytical techniques to uncover the structure
More informationIR Models: The Probabilistic Model. Lecture 8
IR Models: The Probabilistic Model Lecture 8 ' * ) ( % $ $ +#! "#! '& & Probability of Relevance? ' ', IR is an uncertain process Information need to query Documents to index terms Query terms and index
More informationCS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya
CS 375 Advanced Machine Learning Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya Outline SVD and LSI Kleinberg s Algorithm PageRank Algorithm Vector Space Model Vector space model represents
More informationTopic Models and Applications to Short Documents
Topic Models and Applications to Short Documents Dieu-Thu Le Email: dieuthu.le@unitn.it Trento University April 6, 2011 1 / 43 Outline Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text
More informationCS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine
CS 277: Data Mining Mining Web Link Structure Class Presentations In-class, Tuesday and Thursday next week 2-person teams: 6 minutes, up to 6 slides, 3 minutes/slides each person 1-person teams 4 minutes,
More informationCLRG Biocreative V
CLRG ChemTMiner @ Biocreative V Sobha Lalitha Devi., Sindhuja Gopalan., Vijay Sundar Ram R., Malarkodi C.S., Lakshmi S., Pattabhi RK Rao Computational Linguistics Research Group, AU-KBC Research Centre
More informationPV211: Introduction to Information Retrieval
PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 6: Scoring, term weighting, the vector space model Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics,
More informationLanguage as a Stochastic Process
CS769 Spring 2010 Advanced Natural Language Processing Language as a Stochastic Process Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Basic Statistics for NLP Pick an arbitrary letter x at random from any
More information1. Ignoring case, extract all unique words from the entire set of documents.
CS 378 Introduction to Data Mining Spring 29 Lecture 2 Lecturer: Inderjit Dhillon Date: Jan. 27th, 29 Keywords: Vector space model, Latent Semantic Indexing(LSI), SVD 1 Vector Space Model The basic idea
More informationExtended IR Models. Johan Bollen Old Dominion University Department of Computer Science
Extended IR Models. Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen January 20, 2004 Page 1 UserTask Retrieval Classic Model Boolean
More informationText classification II CE-324: Modern Information Retrieval Sharif University of Technology
Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database
ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database Dina Vishnyakova 1,2, 4, *, Julien Gobeill 1,3,4, Emilie Pasche 1,2,3,4 and Patrick Ruch
More informationGaussian Models
Gaussian Models ddebarr@uw.edu 2016-04-28 Agenda Introduction Gaussian Discriminant Analysis Inference Linear Gaussian Systems The Wishart Distribution Inferring Parameters Introduction Gaussian Density
More informationLatent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology
Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,
More informationModern Information Retrieval
Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction
More informationInformation Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture;
More informationhow to *do* computationally assisted research
how to *do* computationally assisted research digital literacy @ comwell Kristoffer L Nielbo knielbo@sdu.dk knielbo.github.io/ March 22, 2018 1/30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 class Person(object):
More informationWolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig
Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Indexes for Multimedia Data 14 Indexes for Multimedia
More informationTerm Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze
Term Weighting and Vector Space Model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Ranked retrieval Thus far, our queries have all been Boolean. Documents either
More informationRecommendation Systems
Recommendation Systems Popularity Recommendation Systems Predicting user responses to options Offering news articles based on users interests Offering suggestions on what the user might like to buy/consume
More informationTowards Collaborative Information Retrieval
Towards Collaborative Information Retrieval Markus Junker, Armin Hust, and Stefan Klink German Research Center for Artificial Intelligence (DFKI GmbH), P.O. Box 28, 6768 Kaiserslautern, Germany {markus.junker,
More informationSemantic Similarity from Corpora - Latent Semantic Analysis
Semantic Similarity from Corpora - Latent Semantic Analysis Carlo Strapparava FBK-Irst Istituto per la ricerca scientifica e tecnologica I-385 Povo, Trento, ITALY strappa@fbk.eu Overview Latent Semantic
More informationPart-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287
Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the
More informationRanked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-2018 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 2 1 Boolean Retrieval Thus
More informationInformation Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)
Information Retrieval and Topic Models Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin) Sec. 1.1 Unstructured data in 1620 Which plays of Shakespeare
More informationBehavioral Data Mining. Lecture 2
Behavioral Data Mining Lecture 2 Autonomy Corp Bayes Theorem Bayes Theorem P(A B) = probability of A given that B is true. P(A B) = P(B A)P(A) P(B) In practice we are most interested in dealing with events
More informationScoring, Term Weighting and the Vector Space
Scoring, Term Weighting and the Vector Space Model Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content [J
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun yzsun@ccs.neu.edu March 16, 2016 Methods to Learn Classification Clustering Frequent Pattern Mining Matrix Data Decision
More informationLecture 1b: Text, terms, and bags of words
Lecture 1b: Text, terms, and bags of words Trevor Cohn (based on slides by William Webber) COMP90042, 2015, Semester 1 Corpus, document, term Body of text referred to as corpus Corpus regarded as a collection
More informationMaschinelle Sprachverarbeitung
Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Chapter 13 Text Classification and Naïve Bayes Dell Zhang Birkbeck, University of London Motivation Relevance Feedback revisited The user marks a number of documents
More informationLatent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology
Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,
More informationChapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze
Chapter 10: Information Retrieval See corresponding chapter in Manning&Schütze Evaluation Metrics in IR 2 Goal In IR there is a much larger variety of possible metrics For different tasks, different metrics
More informationText mining and natural language analysis. Jefrey Lijffijt
Text mining and natural language analysis Jefrey Lijffijt PART I: Introduction to Text Mining Why text mining The amount of text published on paper, on the web, and even within companies is inconceivably
More informationInformation Retrieval Basic IR models. Luca Bondi
Basic IR models Luca Bondi Previously on IR 2 d j q i IRM SC q i, d j IRM D, Q, R q i, d j d j = w 1,j, w 2,j,, w M,j T w i,j = 0 if term t i does not appear in document d j w i,j and w i:1,j assumed to
More informationVector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson
Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics
More informationCSCE 561 Information Retrieval System Models
CSCE 561 Information Retrieval System Models Satya Katragadda 26 August 2015 Agenda Introduction to Information Retrieval Inverted Index IR System Models Boolean Retrieval Model 2 Introduction Information
More informationLanguage Models. CS6200: Information Retrieval. Slides by: Jesse Anderton
Language Models CS6200: Information Retrieval Slides by: Jesse Anderton What s wrong with VSMs? Vector Space Models work reasonably well, but have a few problems: They are based on bag-of-words, so they
More informationA Study for Evaluating the Importance of Various Parts of Speech (POS) for Information Retrieval (IR)
A Study for Evaluating the Importance of Various Parts of Speech (POS) for Information Retrieval (IR) Chirag Shah Dept. of CSE IIT Bombay, Powai Mumbai - 400 076, Maharashtra, India. Email: chirag@cse.iitb.ac.in
More informationQuerying. 1 o Semestre 2008/2009
Querying Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2008/2009 Outline 1 2 3 4 5 Outline 1 2 3 4 5 function sim(d j, q) = 1 W d W q W d is the document norm W q is the
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 12: Latent Semantic Indexing and Relevance Feedback Paul Ginsparg Cornell
More informationPROBABILISTIC LATENT SEMANTIC ANALYSIS
PROBABILISTIC LATENT SEMANTIC ANALYSIS Lingjia Deng Revised from slides of Shuguang Wang Outline Review of previous notes PCA/SVD HITS Latent Semantic Analysis Probabilistic Latent Semantic Analysis Applications
More informationMore Smoothing, Tuning, and Evaluation
More Smoothing, Tuning, and Evaluation Nathan Schneider (slides adapted from Henry Thompson, Alex Lascarides, Chris Dyer, Noah Smith, et al.) ENLP 21 September 2016 1 Review: 2 Naïve Bayes Classifier w
More informationSemantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing
Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding
More information