Text Analytics. Models for Information Retrieval 1. Ulf Leser

Size: px
Start display at page:

Download "Text Analytics. Models for Information Retrieval 1. Ulf Leser"

Transcription

1 Text Analytics Models for Information Retrieval 1 Ulf Leser

2 Processing Pipeline Parsing Normalization Term Weighting Normalization Parsing Corpus IR Database Interface User Document IDs Filtering Ranking Feedback Frakes et al Document Retrieval Text Preprocessing Linguistic Annotation Named Entity Recognition Named Entity Normalization Relationship Extraction Ulf Leser: Text Analytics, Vorlesung, Sommersemester

3 Format Conversion New generation of e-commerce applications require data schemas In relational database systems, data objects are conventionally that are constantly evolving and sparsely populated. The conven- stored using a horizontal scheme. A data object is represented as tional horizontal row representation fails to meet these require- Transform text in ASCII / UniCode Formats PDF, XML, DOC, ODP, images, Problems Formatting instruction, Special characters, formulas, figures and captions, tables, section headers, footnotes, page numbers, margin text, Diplomacy: To what extend can one reconstruct the original document from the normalized representation? Ulf Leser: Text Analytics, Vorlesung, Sommersemester

4 Assumptions A document a sequence of sentences A sentence is a sequence of tokens A token is the smallest unit of text E.g. words, numbers A term is a token or a set of tokens with a semantic meaning E.g. single token, proper names San is a token, but not a term San Francisco are two tokens but one term Dictionaries usually contain terms, not tokens A word is either a token or a term depends on the context A concept is the mental representation of a real-world thing Concepts are represented by terms/words Terms may represent multiple concepts: homonyms Concepts may be represented by multiple terms: synonyms Ulf Leser: Text Analytics, Vorlesung, Sommersemester

5 Case A Difficult Case Should text be converted to lower case letters? Advantages Makes queries simpler Decreases index size a lot Disadvantages No abbreviations Looses important hints for sentence splitting Sentences start with upper case letters Looses important hints for tokenization, NER, Often Convert to lower case after all other steps Ulf Leser: Text Analytics, Vorlesung, Sommersemester

6 Sentence Splitting Most linguistic analysis works on sentence level Sentences group together objects and statements But mind references: What is this white thing? It is a house. Naive approach: Reg-Exp search for [.?!;] (note the blank) Abbreviations C. Elegans is a worm which ; This does not hold for the U.S. Errors (due to previous normalization steps) is not clear.conclusions.we reported on Proper names DOT.NET is a technique for Direct speech By saying It is the economy, stupid!, B. Clinton meant that Ulf Leser: Text Analytics, Vorlesung, Sommersemester

7 Tokenization Split a sentence into token But in a way that allows for recognition of multi-token terms later Simple approach: search for (blanks) In San Francisco it rained a lot this year. A state-of-the-art Z-9 Firebird was purchased on 3/12/1995. SQL commands comprise SELECT FROM WHERE clauses; the latter may contain functions such as leftstr(string, INT). This LCD-TV-Screen cost 3.100,99 USD. [Bis[1,2-cyclohexanedionedioximato(1-)-O]- [1,2-cyclohexanedione dioximato(2-)-o]methyl-borato(2-)- N,N0,N00,N000,N0000,N00000)-chlorotechnetium) belongs to a family of Typical approach Treat hyphens/parantheses as blanks Remove. (of abbreviations; after sentence splitting) Hope that errors are rare and don t matter much Ulf Leser: Text Analytics, Vorlesung, Sommersemester

8 Stems versus Lemmas Common idea: Normalize words to a basal, normal form car,cars -> car; gives, gave, give -> give Stemming reduce words to a base form Base form often is not a word in the language Ignores detached prepositions ( Ich kaufe ein -> einkaufen or kauf ) Rather simple Lemmatization reduce words to their lemma Finds the linguistic root of a word Must be a proper word of the language Rather difficult, requires morphological analysis Ulf Leser: Text Analytics, Vorlesung, Sommersemester

9 Stop Words Words that are so frequent that their removal (hopefully) does not change the meaning of a doc for retrieval/mining purposes English: Top-2: 10% of all tokens; Top6: 20%; Top-50: 50% English (top10; LOB corpus): the, of, and, to, a, in, that, is, was, it German: aber, als, am, an, auch, auf, aus, bei, bin, Removing stop words reduces a positional token index by ~40% Search engines do remove stop words (but which?) Hope: Increase in precision due to less spurious hits Clearly, this also depends on the usage of stop words in the query Variations Remove top 10, 100, 1000, words Use language-specific or corpus-specific stop word list Ulf Leser: Text Analytics, Vorlesung, Sommersemester

10 Zipf s Law Let f be the frequency of a word and r its rank in the list of all words sorted by frequency Zipf s law: f ~ k/r for some constant k Example Word ranks in Moby Dick Good fit to Zipf s law Some domaindependency (whale) Zipf s law is a fairly good approximation in most corpora Ulf Leser: Text Analytics, Vorlesung, Sommersemester

11 Proper Names Named Entity Recognition Proper names are different from ordinary words Very often comprise more than one token Often not contained in lexica Appear and disappear all the time Often contain special characters Are very important for information retrieval and text mining Search for Bill Clinton (not Clinton ordered the bill ) Extract all relationships between a disease and the DMD gene Recognizing proper names is difficult more later Named entity recognition dark is a proper name broken wing is no proper name, broken-wing gene possibly is one Disambiguation dark is a gene of the fruitfly (and a common English word) Ulf Leser: Text Analytics, Vorlesung, Sommersemester

12 Document Summarization Preprocessing may go far beyond what we discussed Statistical summarization Remove all token that are not representative for the text Only keep words which are more frequent than expected by chance Chance : Over all documents in the collection (corpus-view), in the language (naïve-view) Makes amount of text much smaller, hopefully doesn t influence IR performance Semantic summarization Understand text and create summary of content Annotation / classification Have a human understand the text and categorize it / describe it with words from a controlled dictionary That s what libraries are crazy about and Yahoo is famous for Ulf Leser: Text Analytics, Vorlesung, Sommersemester

13 Content of this Lecture IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Other IR Models Ulf Leser: Text Analytics, Vorlesung, Sommersemester

14 Information Retrieval Core The core question in IR: Which of a given set of (normalized) documents is relevant for a given query? Ranking: How relevant is each document of a given set for a given query? Query Normalization Document base Normalization Match / Relevance Scoring Result Ulf Leser: Text Analytics, Vorlesung, Sommersemester

15 IR Models Modeling: How is relevance judged? What model of relevancy is assumed? Set Theoretic Fuzzy Extended Boolean Classic Models Algebraic U s e r T a s k Retrieval: Adhoc Filtering Browsing Boolean Vector-Space Probabilistic Structured Models Non-Overlapping Lists Proximal Nodes Browsing Generalized Vector Lat. Semantic Index Neural Networks Probabilistic Inference Network Belief Network Flat Structure Guided Hypertext [BYRN99] Ulf Leser: Text Analytics, Vorlesung, Sommersemester

16 Notation All of the models we discuss use the Bag of Words view Definition Let K be the set of all terms, k K is a term Let D be the set of all documents, d D is a document Let v d by a vector for d with v d [i]=0 if the i th term k i in w(d) is not contained in d v d [i]=1 if the i th term k i in w(d) is contained in d Let v q by a vector for q in the same sense Often, we shall use weights instead of 0/1 memberships Let w ij 0 be the weight of term k i in document d j To what extend does k i contribute to the content of d j? w ij =0 if k i d j v d and v q are defined accordingly Ulf Leser: Text Analytics, Vorlesung, Sommersemester

17 Content of this Lecture IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Other IR Models Ulf Leser: Text Analytics, Vorlesung, Sommersemester

18 Boolean Model Simple model based on set theory Queries are specified as Boolean expressions Terms are atoms Terms are connected by AND, OR, NOT (XOR,...) There are no real weights; w ij {0,1} All terms contribute equally to the content of a doc Relevance of a document is either 0 or 1 Computation by Boolean operations on the doc and query vector Represent q as set S of vectors over all terms Example: q=k 1 (k 2 k 3 ) => S={(1,1,1),(1,1,0),(1,0,0)} S describes all possible combinations of keywords that make a doc relevant Document d is relevant for q iff v d S Ulf Leser: Text Analytics, Vorlesung, Sommersemester

19 Properties Simple, clear semantics, widely used in (early) systems Disadvantages No partial matching (expression must evaluate to true) Suppose query k 1 k 2 k 9 A doc d with k 1 k 2 k 8 is as much refused as one with none of the terms No ranking (extensions exist, but not satisfactory) Query terms cannot be weighted Average users don t like Boolean expressions Search engines: +bill +clinton lewinsky tax Results: Often unsatisfactory Too many documents (too few restrictions) Without ranking Too few documents (too many restrictions) Ulf Leser: Text Analytics, Vorlesung, Sommersemester

20 A Note on Implementation Our definition only gave one way of defining the semantics of a Boolean query; this should not be used for implementations The set S of vectors can be extremely large Suppose q=k 1 and K =1000 => S has components Improvement: Evaluate only those terms that are used in the query Usual way of implementation: Indexing Assume we have an index with fast operation find: K P D Search each term k i of the query, resulting in a set D i of docs with D i D Evaluate query in the usual order using set operations on D i s k i k j : D i D j k i k j : D i D j NOT k i : D\D i Use cost-based optimization Evaluate those subexpressions first which hopefully result in small results Ulf Leser: Text Analytics, Vorlesung, Sommersemester

21 Negation in the Boolean Model Evaluating NOT k i : D\D i is expensive Result often is very, very large ( D\D i D ) Because almost all terms appear in almost no documents Recall Zipf s Law the tail of the distribution If k i is not a stop word, which we removed anyway Solution 1: Disallow negation This is what web search engines do (exception?) Solution 2: Allow only in the form k i NOT k j Evaluation: Compute D i and remove docs which do not contain k j Ulf Leser: Text Analytics, Vorlesung, Sommersemester

22 Content of this Lecture IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Other IR Models Ulf Leser: Text Analytics, Vorlesung, Sommersemester

23 Vector Space Model Salton, G., Wong, A. and Yang, C. S. (1975). "A Vector Space Model for Automatic Indexing." Communications of the ACM 18(11): Historically, a breakthrough in IR Still most popular model today General idea Fix a vocabulary K Typically the set of all different terms in D View each doc and each query as a point in a K -dim space Rank docs according to distance from the query Main advantages Ability to rank docs (according to relevance) Natural capability for partial matching Ulf Leser: Text Analytics, Vorlesung, Sommersemester

24 Vector Space Quelle: Star Astronomy Movie stars Mammals Diet Each term in K is one dimension Different suggestions for determining co-ordinates = Weights Different suggestions for determining the distance between two points The closest docs are the most relevant ones Rational: vectors correspond to themes which are loosely related to sets of terms Distance between vector ~ distance between themes Ulf Leser: Text Analytics, Vorlesung, Sommersemester

25 Preliminaries 1 Definition Let D be a document collection, K be the set of all terms in D, d D and k K. The term frequency tf dk is the frequency of k in d The document frequency df k is the number of docs in D containing k Or sometimes the number of occurrences of k in D The inverse document frequency idf k = D / df k Actually, it should rather be called the inverse relative document frequency In practice, one usually uses idf k = log( D / df k ) Remarks idf is a popular weighting term, therefore a proper definition Clearly, tf dk is a natural measure for the weight w dk But not the only or best one Ulf Leser: Text Analytics, Vorlesung, Sommersemester

26 Preliminaries 2 Recall The scalar product between vectors v and w v o w = v * w *cos( v, w) with v = 2 v i and v o w = i= 1.. n v i * w i Ulf Leser: Text Analytics, Vorlesung, Sommersemester

27 The Simplest Approach Co-ordinates are set as term frequency Distance is measured as the cosine of the angle between doc d and query q Wait for Euclidean distance sim( d, q) = cos( v d, v q ) = ( v [ i]* v [ i] ) q d [ i] Properties Including term weights increases retrieval performance Computes a (seemingly reasonable) ranking Also returns partial matches v v d d o v * v q q = v q 2 * Can be dropped for ranking v d [ i] 2 Ulf Leser: Text Analytics, Vorlesung, Sommersemester

28 Example Data Assume stop word removal (wir, in, ) and stemming (verkaufen -> kauf, wollen -> woll, ) Text verkauf haus italien gart miet blüh woll 1 Wir verkaufen Häuser in Italien 2 Häuser mit Gärten zu vermieten 3 Häuser: In Italien, um Italien, um Italien herum 4 Die italienschen Gärtner sind im Garten 5 Der Garten in unserem italienschen Haus blüht Q Wir wollen ein Haus mit Garten in Italien mieten Ulf Leser: Text Analytics, Vorlesung, Sommersemester

29 Ranking without Length Normalization sim( d, q) = d o q = i= 1.. n v d [ i]* v q [ i] Q sim(d 1,q)=1*0+1*1+1*1+0*1+0*1+0*0+0*1=2 sim(d 2,q)=1+1+1=3 sim(d 3,q)=1+3=4 sim(d 4,q)=1+2=3 sim(d 5,q)=1+1+1=3 No good? Rg Q: Wir wollen ein Haus mit Garten in Italien mieten 1 d 3 : Häuser: In Italien, um Italien, um Italien herum 2 d 5 : Der Garten in unserem italienschen Haus blüht 2 d 4 : Die italienschen Gärtner sind im Garten 2 d 2 : Häuser mit Gärten zu vermieten 5 d 1 : Wir verkaufen Häuser in Italien Ulf Leser: Text Analytics, Vorlesung, Sommersemester

30 Ranking with Length Normalization sim( d, q) = ( v ) q[ i]* vd[ i] 2 vd[ i] Q sim(d 1,q) = (1*0+1*1+1*1+0*1+0*1+0*0+0*1) / 3 ~ 1.15 sim(d 2,q) = (1+1+1) / 3 ~ 1.73 sim(d 3,q) = (1+3) / 10 ~ 1.26 sim(d 4,q) = (1+2) / 5 ~ 1.34 sim(d 5,q) = (1+1+1) / 4 ~ 1.5 No good? Rg Q: Wir wollen ein Haus mit Garten in Italien mieten 1 d 2 : Häuser mit Gärten zu vermieten 2 d 5 : Der Garten in unserem italienschen Haus blüht 3 d 4 : Die italienschen Gärtner sind im Garten 4 d 3 : Häuser: In Italien, um Italien, um Italien herum 5 d 1 : Wir verkaufen Häuser in Italien Ulf Leser: Text Analytics, Vorlesung, Sommersemester

31 Improved Scoring: TF*IDF One obvious problem: The longer the document, the higher the chances of being ranked high Solution: Normalize on document length (also yields 0 w ij 1) w = Second obvious problem: Some terms are just everywhere in D and should be scored less Solution: Use IDF scores ij tf d ij i = i tf ij tf k = 1.. di tfij w ij = * idf d ik j Ulf Leser: Text Analytics, Vorlesung, Sommersemester

32 TF*IDF Short Form Give terms in a doc d high weights which are frequent in d and infrequent in D IDF deals with the consequences of Zipf s law The few very frequent (and unspecific) terms get low scores The many infrequent (and specific) terms get higher scores Ulf Leser: Text Analytics, Vorlesung, Sommersemester

33 Example TF*IDF w ij tf = d ij sim( d, q) = i * idf j tf = d ij i D * df ( v ) q[ i]* vd[ i] 2 vd[ i] k IDF 5 5/4 5/4 5/ (tf) 1/3 1/3 1/3 2 (tf) 1/3 1/3 1/3 3 (tf) 1/4 3/4 4 (tf) 1/3 2/3 5 (tf) 1/4 1/4 1/4 1/4 Q sim(d 1,q)=(5/4*1/3 +5/4*1/3) / 0.3 ~ 1.51 sim(d 2,q)=(5/4*1/3 +5/3*1/3+5*1/3) / 0.3 ~ 4,80 sim(d 3,q)=(5/4*1/4+5/4*3/4) / 0.63 ~ 1,57 sim(d 4,q)=(5/4*1/3 +5/3*2/3) / 0.56 ~ 2,08 sim(d 5,q)=(5/4*1/4 +5/4*1/4+5/3*1/4) / 0.25 ~ 2,08 wollen ein Haus mit Garten in Italien mieten d 2 : Häuser mit Gärten zu vermieten d 5 : Der Garten in unserem italienschen Haus blüht wollen ein Haus mit Garten in Italien mieten Häuser mit Gärten zu vermieten Der Garten in unserem italienschen Haus blüht d 4 : Die italienschen Gärtner sind im Garten Die italienschen Gärtner sind im Garten d 3 : Häuser: In Italien, um Italien, um Italien Wir verkaufen Häuser in Italien herum Häuser: In Italien, um Italien, um Italien herum d Ulf Leser: Text Analytics, Vorlesung, Sommersemester : Wir verkaufen Häuser in Italien 33

34 Further Improvements Scoring the query in the same way as the documents Note: Repeating words in a query then makes a difference Empirically estimated best scoring function [BYRN99] Different length normalization w ij = tf j ij max tf ij Use log of IDF with slightly different measure (n i : # of docs containing k j ); rare terms are not totally distorting the score any more w ij = tf j ij maxtf ij *log D n j Ulf Leser: Text Analytics, Vorlesung, Sommersemester

35 Why not Simpler? Why not use simple Euclidean distance? Length of vectors would be much more important Since queries usually are very short, very short documents would always win Cosine measures normalizes by the length of both vectors Ulf Leser: Text Analytics, Vorlesung, Sommersemester

36 Metaphor: Comparison to Clustering Clustering: Find groups of objects that are close to each other and far apart from all others Rephrase the IR problem A query partitions D in two clusters: relevant R / not relevant N q should be in the heard of R: similarity Docs in R should be close to each other: large w ij *q i values Docs in R should be far apart from other points: idf Goal of scoring: Balance between intra- and inter-cluster similarity Ulf Leser: Text Analytics, Vorlesung, Sommersemester

37 Shortcomings We assumed that all terms are independent, i.e., that their vectors are orthogonal Clearly wrong: terms are semantically close or not The appearance of red in a doc with wine doesn t mean much But wine is a non-zero match for drinks Extension: Topic-based Vector Space Model No treatment homonyms (as for most other methods) We would need to split words into their senses See word disambiguation (later) Order independent But order carries semantic meaning (object? subject?) Order important for disambiguation Ulf Leser: Text Analytics, Vorlesung, Sommersemester

38 A Note on Implementation Again: Assume we have an index with op find: K P D Assume we want to retrieve the top-r docs Core algorithm Look up all terms k i of the query Build the union of all documents which contain at least on k i Hold in a list sorted by score (initialize with 0) Walk through terms k i in order of decreasing IDF-weights For each document d j : Add w ji *IDF i to current score s j Go through docs in order of current score Early stop Look at s j r s j r+1 : Can doc with rank r+1 still reach doc with rank r? Can be estimated from distribution of TF and IDF values Early stop might produce the wrong order of top-r docs, but not wrong docs Why? Ulf Leser: Text Analytics, Vorlesung, Sommersemester

39 A Different View on VSD Star Diet Query evaluation actually searches for the top-r nearest neighbors This can be approached using techniques from multidimensional indexing kdd-trees, Grid files, etc. Hope: No sequential scan of (all, many) docs, but directed search according to distances But severe problem: High dimensionality Ulf Leser: Text Analytics, Vorlesung, Sommersemester

40 Content of this Lecture IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Ulf Leser: Text Analytics, Vorlesung, Sommersemester

41 Interactive IR Recall: IR is the process of finding the right information This includes more than a single query Relevance feedback User poses query System answers User judges the relevance of top-k returned docs System considers feedback and generates new (improved) answer User never needs to pose another query The new query is generated by the system Loop until satisfaction Query (Ranked) Results IR System Ulf Leser: Text Analytics, Vorlesung, Sommersemester

42 Relevance Feedback User judges the current answers Basic assumptions Relevant docs are somewhat similar to each other the common core should be emphasized Irrelevant docs are different from relevant docs the differences should be deemphasized System adapts to feedback by Query expansion: Add new terms to the query From the relevant documents Term re-weighting: Assign new weights to terms From the relevant documents Ulf Leser: Text Analytics, Vorlesung, Sommersemester

43 Rationale Assume we know the set R of all relevant docs How would a good query for R look like? v q = 1 R d R d D 1 R d R d Idea Weight terms high that are in R Weight terms low that are not in R, i.e., that are in N=D\R Remark It is not clear that this is the best of all queries Relevance feedback is a heuristic Ulf Leser: Text Analytics, Vorlesung, Sommersemester

44 Rocchio Algorithm Usually we do not know the real R Best bet: Let R (N) be the set docs marked as relevant (irrelevant) Adapt query vector as follows 1 1 v = + q new α * v( q) β * d γ * d R N This implies query expansion and term re-weighting How to choose α, β, γ? Alternative 1 vq new = α * v( q) + β * d γ *{ d d = arg max( sim( q, d))} R d N d R d R d N Intuition: Non-relevant docs are heterogeneous and tear in every direction better to only take the worst Ulf Leser: Text Analytics, Vorlesung, Sommersemester

45 Example (simplified) α=0.5, β=0.5, γ=0 Information 1.0 d 1 = "information science = (0.2, 0.8) d 2 = "retrieval systems" = (0.9, 0.1) q = "retrieval of information = (0.7, 0.3) 0.5 d 1 q If d 1 is marked relevant q = ½ q + ½ d 1 = (0.45, 0.55) If d 2 is marked relevant Q = ½ q + ½ d 2 = (0.80, 0.20) q q d 2 Quelle: A. Nürnberger: IR Retrieval Ulf Leser: Text Analytics, Vorlesung, Sommersemester

46 Results Advantages Improved results (many studies) compared to single queries Users need not generate new queries themselves Iterative process converging to the best possible answer Especially helpful for increasing recall (due to query expansion) Disadvantages Requires some work by the user Excite: Only 4% used relevance feedback ( more of this button) Writing a new query based on returned results might be faster (and easier and more successful) than classifying results Based on the assumption that relevant docs are similar What if user searches for all meanings of jaguar? Query very long already after one iteration makes retrieval slow Ulf Leser: Text Analytics, Vorlesung, Sommersemester

47 Collaborative Filtering What other information could we use for improving retrieval performance? Collaborative filtering: Give the user what other, similar users liked Customers who bought this book also bought In IR: Find users posing similar queries and look at which documents they eventually looked it In e-commerce, it is the buying in itself (very reliable) In IR, we need to approximate Documents a searcher clicked on (if recordable) Did the user look at the second page? (Low credit for first results) Did the user pose a refinement query next? All are not very reliable Problems: What is a similar user? What is a similar query? Ulf Leser: Text Analytics, Vorlesung, Sommersemester

48 Thesaurus-based Query Expansion [M07, CS276] Expand query with synonyms and hyponyms of each term feline feline cat One may weight added terms less than original query terms Often used in scientific IR systems (Medline) Requires a high quality thesaurus General observation Increases recall May significantly decrease precision interest rate interest rate fascinate evaluate Do synonyms really exist? Ulf Leser: Text Analytics, Vorlesung, Sommersemester

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Retrieval Models and Implementation Ulf Leser Content of this Lecture Information Retrieval Models Boolean Model Vector Space Model Inverted Files Ulf Leser: Maschinelle

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search IR models: Vector Space Model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Brosing boolean vector probabilistic

More information

Chap 2: Classical models for information retrieval

Chap 2: Classical models for information retrieval Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81 Outline Basic IR Models 1 Basic

More information

Boolean and Vector Space Retrieval Models

Boolean and Vector Space Retrieval Models Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) 1

More information

Latent Semantic Analysis. Hongning Wang

Latent Semantic Analysis. Hongning Wang Latent Semantic Analysis Hongning Wang CS@UVa Recap: vector space model Represent both doc and query by concept vectors Each concept defines one dimension K concepts define a high-dimensional space Element

More information

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1 Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1 Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR)

More information

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002 CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002 Recap of last time Index size Index construction techniques Dynamic indices Real world considerations 2 Back of the envelope

More information

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,

More information

1 Information retrieval fundamentals

1 Information retrieval fundamentals CS 630 Lecture 1: 01/26/2006 Lecturer: Lillian Lee Scribes: Asif-ul Haque, Benyah Shaparenko This lecture focuses on the following topics Information retrieval fundamentals Vector Space Model (VSM) Deriving

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic

More information

Information Retrieval. Lecture 6

Information Retrieval. Lecture 6 Information Retrieval Lecture 6 Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring tf idf and vector spaces This lecture

More information

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Natural Language Processing. Topics in Information Retrieval. Updated 5/10 Natural Language Processing Topics in Information Retrieval Updated 5/10 Outline Introduction to IR Design features of IR systems Evaluation measures The vector space model Latent semantic indexing Background

More information

Manning & Schuetze, FSNLP, (c)

Manning & Schuetze, FSNLP, (c) page 554 554 15 Topics in Information Retrieval co-occurrence Latent Semantic Indexing Term 1 Term 2 Term 3 Term 4 Query user interface Document 1 user interface HCI interaction Document 2 HCI interaction

More information

Text Analytics. Searching Terms. Ulf Leser

Text Analytics. Searching Terms. Ulf Leser Text Analytics Searching Terms Ulf Leser A Probabilistic Interpretation of Relevance We want to compute the probability that a doc d is relevant to query q The probabilistic model determines this probability

More information

Can Vector Space Bases Model Context?

Can Vector Space Bases Model Context? Can Vector Space Bases Model Context? Massimo Melucci University of Padua Department of Information Engineering Via Gradenigo, 6/a 35031 Padova Italy melo@dei.unipd.it Abstract Current Information Retrieval

More information

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26 Fall 2016 CS646: Information Retrieval Lecture 6 Boolean Search and Vector Space Model Jiepu Jiang University of Massachusetts Amherst 2016/09/26 Outline Today Boolean Retrieval Vector Space Model Latent

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 4: Probabilistic Retrieval Models April 29, 2010 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig

More information

PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 11: Probabilistic Information Retrieval Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS

More information

Variable Latent Semantic Indexing

Variable Latent Semantic Indexing Variable Latent Semantic Indexing Prabhakar Raghavan Yahoo! Research Sunnyvale, CA November 2005 Joint work with A. Dasgupta, R. Kumar, A. Tomkins. Yahoo! Research. Outline 1 Introduction 2 Background

More information

Data Mining Recitation Notes Week 3

Data Mining Recitation Notes Week 3 Data Mining Recitation Notes Week 3 Jack Rae January 28, 2013 1 Information Retrieval Given a set of documents, pull the (k) most similar document(s) to a given query. 1.1 Setup Say we have D documents

More information

13 Searching the Web with the SVD

13 Searching the Web with the SVD 13 Searching the Web with the SVD 13.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this

More information

9 Searching the Internet with the SVD

9 Searching the Internet with the SVD 9 Searching the Internet with the SVD 9.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this

More information

Dealing with Text Databases

Dealing with Text Databases Dealing with Text Databases Unstructured data Boolean queries Sparse matrix representation Inverted index Counts vs. frequencies Term frequency tf x idf term weights Documents as vectors Cosine similarity

More information

Lecture 5: Web Searching using the SVD

Lecture 5: Web Searching using the SVD Lecture 5: Web Searching using the SVD Information Retrieval Over the last 2 years the number of internet users has grown exponentially with time; see Figure. Trying to extract information from this exponentially

More information

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model Vector Space Model Yufei Tao KAIST March 5, 2013 In this lecture, we will study a problem that is (very) fundamental in information retrieval, and must be tackled by all search engines. Let S be a set

More information

16 The Information Retrieval "Data Model"

16 The Information Retrieval Data Model 16 The Information Retrieval "Data Model" 16.1 The general model Not presented in 16.2 Similarity the course! 16.3 Boolean Model Not relevant for exam. 16.4 Vector space Model 16.5 Implementation issues

More information

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics Querying

More information

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors? CS276A Information Retrieval Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring tf idf and vector spaces Lecture 7 This

More information

Lecture 3: Probabilistic Retrieval Models

Lecture 3: Probabilistic Retrieval Models Probabilistic Retrieval Models Information Retrieval and Web Search Engines Lecture 3: Probabilistic Retrieval Models November 5 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme

More information

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides CSE 494/598 Lecture-4: Correlation Analysis LYDIA MANIKONDA HT TP://WWW.PUBLIC.ASU.EDU/~LMANIKON / **Content adapted from last year s slides Announcements Project-1 Due: February 12 th 2016 Analysis report:

More information

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 ) Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds

More information

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

.. CSC 566 Advanced Data Mining Alexander Dekhtyar.. .. CSC 566 Advanced Data Mining Alexander Dekhtyar.. Information Retrieval Latent Semantic Indexing Preliminaries Vector Space Representation of Documents: TF-IDF Documents. A single text document is a

More information

Latent Semantic Analysis. Hongning Wang

Latent Semantic Analysis. Hongning Wang Latent Semantic Analysis Hongning Wang CS@UVa VS model in practice Document and query are represented by term vectors Terms are not necessarily orthogonal to each other Synonymy: car v.s. automobile Polysemy:

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture; IIR Sections

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search Part I: Web Structure Mining Chapter : Information Retrieval an Web Search The Web Challenges Crawling the Web Inexing an Keywor Search Evaluating Search Quality Similarity Search The Web Challenges Tim

More information

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models 3. Retrieval Models Motivation Information Need User Retrieval Model Result: Query 1. 2. 3. Document Collection 2 Agenda 3.1 Boolean Retrieval 3.2 Vector Space Model 3.3 Probabilistic IR 3.4 Statistical

More information

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured. What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured. Text mining What can be used for text mining?? Classification/categorization

More information

MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson

MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Latent

More information

Information Retrieval

Information Retrieval Introduction to Information CS276: Information and Web Search Christopher Manning and Pandu Nayak Lecture 13: Latent Semantic Indexing Ch. 18 Today s topic Latent Semantic Indexing Term-document matrices

More information

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan Term Weighting and the Vector Space Model borrowing from: Pandu Nayak and Prabhakar Raghavan IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes

More information

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting Outline for today Information Retrieval Efficient Scoring and Ranking Recap on ranked retrieval Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Efficient

More information

Notes on Latent Semantic Analysis

Notes on Latent Semantic Analysis Notes on Latent Semantic Analysis Costas Boulis 1 Introduction One of the most fundamental problems of information retrieval (IR) is to find all documents (and nothing but those) that are semantically

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ANLP Lecture 22 Lexical Semantics with Dense Vectors ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous

More information

Ranked Retrieval (2)

Ranked Retrieval (2) Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Leverage Sparse Information in Predictive Modeling

Leverage Sparse Information in Predictive Modeling Leverage Sparse Information in Predictive Modeling Liang Xie Countrywide Home Loans, Countrywide Bank, FSB August 29, 2008 Abstract This paper examines an innovative method to leverage information from

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS

More information

DISTRIBUTIONAL SEMANTICS

DISTRIBUTIONAL SEMANTICS COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.

More information

Manning & Schuetze, FSNLP (c) 1999,2000

Manning & Schuetze, FSNLP (c) 1999,2000 558 15 Topics in Information Retrieval (15.10) y 4 3 2 1 0 0 1 2 3 4 5 6 7 8 Figure 15.7 An example of linear regression. The line y = 0.25x + 1 is the best least-squares fit for the four points (1,1),

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Ranked retrieval Thus far, our queries have all been Boolean. Documents either

More information

Mapping of Science. Bart Thijs ECOOM, K.U.Leuven, Belgium

Mapping of Science. Bart Thijs ECOOM, K.U.Leuven, Belgium Mapping of Science Bart Thijs ECOOM, K.U.Leuven, Belgium Introduction Definition: Mapping of Science is the application of powerful statistical tools and analytical techniques to uncover the structure

More information

IR Models: The Probabilistic Model. Lecture 8

IR Models: The Probabilistic Model. Lecture 8 IR Models: The Probabilistic Model Lecture 8 ' * ) ( % $ $ +#! "#! '& & Probability of Relevance? ' ', IR is an uncertain process Information need to query Documents to index terms Query terms and index

More information

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya CS 375 Advanced Machine Learning Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya Outline SVD and LSI Kleinberg s Algorithm PageRank Algorithm Vector Space Model Vector space model represents

More information

Topic Models and Applications to Short Documents

Topic Models and Applications to Short Documents Topic Models and Applications to Short Documents Dieu-Thu Le Email: dieuthu.le@unitn.it Trento University April 6, 2011 1 / 43 Outline Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text

More information

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine CS 277: Data Mining Mining Web Link Structure Class Presentations In-class, Tuesday and Thursday next week 2-person teams: 6 minutes, up to 6 slides, 3 minutes/slides each person 1-person teams 4 minutes,

More information

CLRG Biocreative V

CLRG Biocreative V CLRG ChemTMiner @ Biocreative V Sobha Lalitha Devi., Sindhuja Gopalan., Vijay Sundar Ram R., Malarkodi C.S., Lakshmi S., Pattabhi RK Rao Computational Linguistics Research Group, AU-KBC Research Centre

More information

PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 6: Scoring, term weighting, the vector space model Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics,

More information

Language as a Stochastic Process

Language as a Stochastic Process CS769 Spring 2010 Advanced Natural Language Processing Language as a Stochastic Process Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Basic Statistics for NLP Pick an arbitrary letter x at random from any

More information

1. Ignoring case, extract all unique words from the entire set of documents.

1. Ignoring case, extract all unique words from the entire set of documents. CS 378 Introduction to Data Mining Spring 29 Lecture 2 Lecturer: Inderjit Dhillon Date: Jan. 27th, 29 Keywords: Vector space model, Latent Semantic Indexing(LSI), SVD 1 Vector Space Model The basic idea

More information

Extended IR Models. Johan Bollen Old Dominion University Department of Computer Science

Extended IR Models. Johan Bollen Old Dominion University Department of Computer Science Extended IR Models. Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen January 20, 2004 Page 1 UserTask Retrieval Classic Model Boolean

More information

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database

ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database Dina Vishnyakova 1,2, 4, *, Julien Gobeill 1,3,4, Emilie Pasche 1,2,3,4 and Patrick Ruch

More information

Gaussian Models

Gaussian Models Gaussian Models ddebarr@uw.edu 2016-04-28 Agenda Introduction Gaussian Discriminant Analysis Inference Linear Gaussian Systems The Wishart Distribution Inferring Parameters Introduction Gaussian Density

More information

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture;

More information

how to *do* computationally assisted research

how to *do* computationally assisted research how to *do* computationally assisted research digital literacy @ comwell Kristoffer L Nielbo knielbo@sdu.dk knielbo.github.io/ March 22, 2018 1/30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 class Person(object):

More information

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Indexes for Multimedia Data 14 Indexes for Multimedia

More information

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze Term Weighting and Vector Space Model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Ranked retrieval Thus far, our queries have all been Boolean. Documents either

More information

Recommendation Systems

Recommendation Systems Recommendation Systems Popularity Recommendation Systems Predicting user responses to options Offering news articles based on users interests Offering suggestions on what the user might like to buy/consume

More information

Towards Collaborative Information Retrieval

Towards Collaborative Information Retrieval Towards Collaborative Information Retrieval Markus Junker, Armin Hust, and Stefan Klink German Research Center for Artificial Intelligence (DFKI GmbH), P.O. Box 28, 6768 Kaiserslautern, Germany {markus.junker,

More information

Semantic Similarity from Corpora - Latent Semantic Analysis

Semantic Similarity from Corpora - Latent Semantic Analysis Semantic Similarity from Corpora - Latent Semantic Analysis Carlo Strapparava FBK-Irst Istituto per la ricerca scientifica e tecnologica I-385 Povo, Trento, ITALY strappa@fbk.eu Overview Latent Semantic

More information

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the

More information

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-2018 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 2 1 Boolean Retrieval Thus

More information

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin) Information Retrieval and Topic Models Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin) Sec. 1.1 Unstructured data in 1620 Which plays of Shakespeare

More information

Behavioral Data Mining. Lecture 2

Behavioral Data Mining. Lecture 2 Behavioral Data Mining Lecture 2 Autonomy Corp Bayes Theorem Bayes Theorem P(A B) = probability of A given that B is true. P(A B) = P(B A)P(A) P(B) In practice we are most interested in dealing with events

More information

Scoring, Term Weighting and the Vector Space

Scoring, Term Weighting and the Vector Space Scoring, Term Weighting and the Vector Space Model Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content [J

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun yzsun@ccs.neu.edu March 16, 2016 Methods to Learn Classification Clustering Frequent Pattern Mining Matrix Data Decision

More information

Lecture 1b: Text, terms, and bags of words

Lecture 1b: Text, terms, and bags of words Lecture 1b: Text, terms, and bags of words Trevor Cohn (based on slides by William Webber) COMP90042, 2015, Semester 1 Corpus, document, term Body of text referred to as corpus Corpus regarded as a collection

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Chapter 13 Text Classification and Naïve Bayes Dell Zhang Birkbeck, University of London Motivation Relevance Feedback revisited The user marks a number of documents

More information

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,

More information

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze Chapter 10: Information Retrieval See corresponding chapter in Manning&Schütze Evaluation Metrics in IR 2 Goal In IR there is a much larger variety of possible metrics For different tasks, different metrics

More information

Text mining and natural language analysis. Jefrey Lijffijt

Text mining and natural language analysis. Jefrey Lijffijt Text mining and natural language analysis Jefrey Lijffijt PART I: Introduction to Text Mining Why text mining The amount of text published on paper, on the web, and even within companies is inconceivably

More information

Information Retrieval Basic IR models. Luca Bondi

Information Retrieval Basic IR models. Luca Bondi Basic IR models Luca Bondi Previously on IR 2 d j q i IRM SC q i, d j IRM D, Q, R q i, d j d j = w 1,j, w 2,j,, w M,j T w i,j = 0 if term t i does not appear in document d j w i,j and w i:1,j assumed to

More information

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics

More information

CSCE 561 Information Retrieval System Models

CSCE 561 Information Retrieval System Models CSCE 561 Information Retrieval System Models Satya Katragadda 26 August 2015 Agenda Introduction to Information Retrieval Inverted Index IR System Models Boolean Retrieval Model 2 Introduction Information

More information

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton Language Models CS6200: Information Retrieval Slides by: Jesse Anderton What s wrong with VSMs? Vector Space Models work reasonably well, but have a few problems: They are based on bag-of-words, so they

More information

A Study for Evaluating the Importance of Various Parts of Speech (POS) for Information Retrieval (IR)

A Study for Evaluating the Importance of Various Parts of Speech (POS) for Information Retrieval (IR) A Study for Evaluating the Importance of Various Parts of Speech (POS) for Information Retrieval (IR) Chirag Shah Dept. of CSE IIT Bombay, Powai Mumbai - 400 076, Maharashtra, India. Email: chirag@cse.iitb.ac.in

More information

Querying. 1 o Semestre 2008/2009

Querying. 1 o Semestre 2008/2009 Querying Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2008/2009 Outline 1 2 3 4 5 Outline 1 2 3 4 5 function sim(d j, q) = 1 W d W q W d is the document norm W q is the

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 12: Latent Semantic Indexing and Relevance Feedback Paul Ginsparg Cornell

More information

PROBABILISTIC LATENT SEMANTIC ANALYSIS

PROBABILISTIC LATENT SEMANTIC ANALYSIS PROBABILISTIC LATENT SEMANTIC ANALYSIS Lingjia Deng Revised from slides of Shuguang Wang Outline Review of previous notes PCA/SVD HITS Latent Semantic Analysis Probabilistic Latent Semantic Analysis Applications

More information

More Smoothing, Tuning, and Evaluation

More Smoothing, Tuning, and Evaluation More Smoothing, Tuning, and Evaluation Nathan Schneider (slides adapted from Henry Thompson, Alex Lascarides, Chris Dyer, Noah Smith, et al.) ENLP 21 September 2016 1 Review: 2 Naïve Bayes Classifier w

More information

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding

More information