Text Analytics. Models for Information Retrieval 1. Ulf Leser

Size: px

Start display at page:

Download "Text Analytics. Models for Information Retrieval 1. Ulf Leser"

Jonathan Gerald Bradley
5 years ago
Views:

1 Text Analytics Models for Information Retrieval 1 Ulf Leser

2 Processing Pipeline Parsing Normalization Term Weighting Normalization Parsing Corpus IR Database Interface User Document IDs Filtering Ranking Feedback Frakes et al Document Retrieval Text Preprocessing Linguistic Annotation Named Entity Recognition Named Entity Normalization Relationship Extraction Ulf Leser: Text Analytics, Vorlesung, Sommersemester

3 Format Conversion New generation of e-commerce applications require data schemas In relational database systems, data objects are conventionally that are constantly evolving and sparsely populated. The conven- stored using a horizontal scheme. A data object is represented as tional horizontal row representation fails to meet these require- Transform text in ASCII / UniCode Formats PDF, XML, DOC, ODP, images, Problems Formatting instruction, Special characters, formulas, figures and captions, tables, section headers, footnotes, page numbers, margin text, Diplomacy: To what extend can one reconstruct the original document from the normalized representation? Ulf Leser: Text Analytics, Vorlesung, Sommersemester

4 Assumptions A document a sequence of sentences A sentence is a sequence of tokens A token is the smallest unit of text E.g. words, numbers A term is a token or a set of tokens with a semantic meaning E.g. single token, proper names San is a token, but not a term San Francisco are two tokens but one term Dictionaries usually contain terms, not tokens A word is either a token or a term depends on the context A concept is the mental representation of a real-world thing Concepts are represented by terms/words Terms may represent multiple concepts: homonyms Concepts may be represented by multiple terms: synonyms Ulf Leser: Text Analytics, Vorlesung, Sommersemester

5 Case A Difficult Case Should text be converted to lower case letters? Advantages Makes queries simpler Decreases index size a lot Disadvantages No abbreviations Looses important hints for sentence splitting Sentences start with upper case letters Looses important hints for tokenization, NER, Often Convert to lower case after all other steps Ulf Leser: Text Analytics, Vorlesung, Sommersemester

6 Sentence Splitting Most linguistic analysis works on sentence level Sentences group together objects and statements But mind references: What is this white thing? It is a house. Naive approach: Reg-Exp search for [.?!;] (note the blank) Abbreviations C. Elegans is a worm which ; This does not hold for the U.S. Errors (due to previous normalization steps) is not clear.conclusions.we reported on Proper names DOT.NET is a technique for Direct speech By saying It is the economy, stupid!, B. Clinton meant that Ulf Leser: Text Analytics, Vorlesung, Sommersemester

7 Tokenization Split a sentence into token But in a way that allows for recognition of multi-token terms later Simple approach: search for (blanks) In San Francisco it rained a lot this year. A state-of-the-art Z-9 Firebird was purchased on 3/12/1995. SQL commands comprise SELECT FROM WHERE clauses; the latter may contain functions such as leftstr(string, INT). This LCD-TV-Screen cost 3.100,99 USD. [Bis[1,2-cyclohexanedionedioximato(1-)-O]- [1,2-cyclohexanedione dioximato(2-)-o]methyl-borato(2-)- N,N0,N00,N000,N0000,N00000)-chlorotechnetium) belongs to a family of Typical approach Treat hyphens/parantheses as blanks Remove. (of abbreviations; after sentence splitting) Hope that errors are rare and don t matter much Ulf Leser: Text Analytics, Vorlesung, Sommersemester

8 Stems versus Lemmas Common idea: Normalize words to a basal, normal form car,cars -> car; gives, gave, give -> give Stemming reduce words to a base form Base form often is not a word in the language Ignores detached prepositions ( Ich kaufe ein -> einkaufen or kauf ) Rather simple Lemmatization reduce words to their lemma Finds the linguistic root of a word Must be a proper word of the language Rather difficult, requires morphological analysis Ulf Leser: Text Analytics, Vorlesung, Sommersemester

9 Stop Words Words that are so frequent that their removal (hopefully) does not change the meaning of a doc for retrieval/mining purposes English: Top-2: 10% of all tokens; Top6: 20%; Top-50: 50% English (top10; LOB corpus): the, of, and, to, a, in, that, is, was, it German: aber, als, am, an, auch, auf, aus, bei, bin, Removing stop words reduces a positional token index by ~40% Search engines do remove stop words (but which?) Hope: Increase in precision due to less spurious hits Clearly, this also depends on the usage of stop words in the query Variations Remove top 10, 100, 1000, words Use language-specific or corpus-specific stop word list Ulf Leser: Text Analytics, Vorlesung, Sommersemester

Dick Good fit to Zipf s law Some domaindependency (whale) Zipf s law is a fairly good

10 Zipf s Law Let f be the frequency of a word and r its rank in the list of all words sorted by frequency Zipf s law: f ~ k/r for some constant k Example Word ranks in Moby Dick Good fit to Zipf s law Some domaindependency (whale) Zipf s law is a fairly good approximation in most corpora Ulf Leser: Text Analytics, Vorlesung, Sommersemester

11 Proper Names Named Entity Recognition Proper names are different from ordinary words Very often comprise more than one token Often not contained in lexica Appear and disappear all the time Often contain special characters Are very important for information retrieval and text mining Search for Bill Clinton (not Clinton ordered the bill ) Extract all relationships between a disease and the DMD gene Recognizing proper names is difficult more later Named entity recognition dark is a proper name broken wing is no proper name, broken-wing gene possibly is one Disambiguation dark is a gene of the fruitfly (and a common English word) Ulf Leser: Text Analytics, Vorlesung, Sommersemester

12 Document Summarization Preprocessing may go far beyond what we discussed Statistical summarization Remove all token that are not representative for the text Only keep words which are more frequent than expected by chance Chance : Over all documents in the collection (corpus-view), in the language (naïve-view) Makes amount of text much smaller, hopefully doesn t influence IR performance Semantic summarization Understand text and create summary of content Annotation / classification Have a human understand the text and categorize it / describe it with words from a controlled dictionary That s what libraries are crazy about and Yahoo is famous for Ulf Leser: Text Analytics, Vorlesung, Sommersemester

13 Content of this Lecture IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Other IR Models Ulf Leser: Text Analytics, Vorlesung, Sommersemester

14 Information Retrieval Core The core question in IR: Which of a given set of (normalized) documents is relevant for a given query? Ranking: How relevant is each document of a given set for a given query? Query Normalization Document base Normalization Match / Relevance Scoring Result Ulf Leser: Text Analytics, Vorlesung, Sommersemester

15 IR Models Modeling: How is relevance judged? What model of relevancy is assumed? Set Theoretic Fuzzy Extended Boolean Classic Models Algebraic U s e r T a s k Retrieval: Adhoc Filtering Browsing Boolean Vector-Space Probabilistic Structured Models Non-Overlapping Lists Proximal Nodes Browsing Generalized Vector Lat. Semantic Index Neural Networks Probabilistic Inference Network Belief Network Flat Structure Guided Hypertext [BYRN99] Ulf Leser: Text Analytics, Vorlesung, Sommersemester

16 Notation All of the models we discuss use the Bag of Words view Definition Let K be the set of all terms, k K is a term Let D be the set of all documents, d D is a document Let v d by a vector for d with v d [i]=0 if the i th term k i in w(d) is not contained in d v d [i]=1 if the i th term k i in w(d) is contained in d Let v q by a vector for q in the same sense Often, we shall use weights instead of 0/1 memberships Let w ij 0 be the weight of term k i in document d j To what extend does k i contribute to the content of d j? w ij =0 if k i d j v d and v q are defined accordingly Ulf Leser: Text Analytics, Vorlesung, Sommersemester

17 Content of this Lecture IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Other IR Models Ulf Leser: Text Analytics, Vorlesung, Sommersemester

18 Boolean Model Simple model based on set theory Queries are specified as Boolean expressions Terms are atoms Terms are connected by AND, OR, NOT (XOR,...) There are no real weights; w ij {0,1} All terms contribute equally to the content of a doc Relevance of a document is either 0 or 1 Computation by Boolean operations on the doc and query vector Represent q as set S of vectors over all terms Example: q=k 1 (k 2 k 3 ) => S={(1,1,1),(1,1,0),(1,0,0)} S describes all possible combinations of keywords that make a doc relevant Document d is relevant for q iff v d S Ulf Leser: Text Analytics, Vorlesung, Sommersemester

19 Properties Simple, clear semantics, widely used in (early) systems Disadvantages No partial matching (expression must evaluate to true) Suppose query k 1 k 2 k 9 A doc d with k 1 k 2 k 8 is as much refused as one with none of the terms No ranking (extensions exist, but not satisfactory) Query terms cannot be weighted Average users don t like Boolean expressions Search engines: +bill +clinton lewinsky tax Results: Often unsatisfactory Too many documents (too few restrictions) Without ranking Too few documents (too many restrictions) Ulf Leser: Text Analytics, Vorlesung, Sommersemester

20 A Note on Implementation Our definition only gave one way of defining the semantics of a Boolean query; this should not be used for implementations The set S of vectors can be extremely large Suppose q=k 1 and K =1000 => S has components Improvement: Evaluate only those terms that are used in the query Usual way of implementation: Indexing Assume we have an index with fast operation find: K P D Search each term k i of the query, resulting in a set D i of docs with D i D Evaluate query in the usual order using set operations on D i s k i k j : D i D j k i k j : D i D j NOT k i : D\D i Use cost-based optimization Evaluate those subexpressions first which hopefully result in small results Ulf Leser: Text Analytics, Vorlesung, Sommersemester

21 Negation in the Boolean Model Evaluating NOT k i : D\D i is expensive Result often is very, very large ( D\D i D ) Because almost all terms appear in almost no documents Recall Zipf s Law the tail of the distribution If k i is not a stop word, which we removed anyway Solution 1: Disallow negation This is what web search engines do (exception?) Solution 2: Allow only in the form k i NOT k j Evaluation: Compute D i and remove docs which do not contain k j Ulf Leser: Text Analytics, Vorlesung, Sommersemester

22 Content of this Lecture IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Other IR Models Ulf Leser: Text Analytics, Vorlesung, Sommersemester

23 Vector Space Model Salton, G., Wong, A. and Yang, C. S. (1975). "A Vector Space Model for Automatic Indexing." Communications of the ACM 18(11): Historically, a breakthrough in IR Still most popular model today General idea Fix a vocabulary K Typically the set of all different terms in D View each doc and each query as a point in a K -dim space Rank docs according to distance from the query Main advantages Ability to rank docs (according to relevance) Natural capability for partial matching Ulf Leser: Text Analytics, Vorlesung, Sommersemester

24 Vector Space Quelle: Star Astronomy Movie stars Mammals Diet Each term in K is one dimension Different suggestions for determining co-ordinates = Weights Different suggestions for determining the distance between two points The closest docs are the most relevant ones Rational: vectors correspond to themes which are loosely related to sets of terms Distance between vector ~ distance between themes Ulf Leser: Text Analytics, Vorlesung, Sommersemester

25 Preliminaries 1 Definition Let D be a document collection, K be the set of all terms in D, d D and k K. The term frequency tf dk is the frequency of k in d The document frequency df k is the number of docs in D containing k Or sometimes the number of occurrences of k in D The inverse document frequency idf k = D / df k Actually, it should rather be called the inverse relative document frequency In practice, one usually uses idf k = log( D / df k ) Remarks idf is a popular weighting term, therefore a proper definition Clearly, tf dk is a natural measure for the weight w dk But not the only or best one Ulf Leser: Text Analytics, Vorlesung, Sommersemester

26 Preliminaries 2 Recall The scalar product between vectors v and w v o w = v * w *cos( v, w) with v = 2 v i and v o w = i= 1.. n v i * w i Ulf Leser: Text Analytics, Vorlesung, Sommersemester

27 The Simplest Approach Co-ordinates are set as term frequency Distance is measured as the cosine of the angle between doc d and query q Wait for Euclidean distance sim( d, q) = cos( v d, v q ) = ( v [ i]* v [ i] ) q d [ i] Properties Including term weights increases retrieval performance Computes a (seemingly reasonable) ranking Also returns partial matches v v d d o v * v q q = v q 2 * Can be dropped for ranking v d [ i] 2 Ulf Leser: Text Analytics, Vorlesung, Sommersemester

28 Example Data Assume stop word removal (wir, in, ) and stemming (verkaufen -> kauf, wollen -> woll, ) Text verkauf haus italien gart miet blüh woll 1 Wir verkaufen Häuser in Italien 2 Häuser mit Gärten zu vermieten 3 Häuser: In Italien, um Italien, um Italien herum 4 Die italienschen Gärtner sind im Garten 5 Der Garten in unserem italienschen Haus blüht Q Wir wollen ein Haus mit Garten in Italien mieten Ulf Leser: Text Analytics, Vorlesung, Sommersemester

29 Ranking without Length Normalization sim( d, q) = d o q = i= 1.. n v d [ i]* v q [ i] Q sim(d 1,q)=1*0+1*1+1*1+0*1+0*1+0*0+0*1=2 sim(d 2,q)=1+1+1=3 sim(d 3,q)=1+3=4 sim(d 4,q)=1+2=3 sim(d 5,q)=1+1+1=3 No good? Rg Q: Wir wollen ein Haus mit Garten in Italien mieten 1 d 3 : Häuser: In Italien, um Italien, um Italien herum 2 d 5 : Der Garten in unserem italienschen Haus blüht 2 d 4 : Die italienschen Gärtner sind im Garten 2 d 2 : Häuser mit Gärten zu vermieten 5 d 1 : Wir verkaufen Häuser in Italien Ulf Leser: Text Analytics, Vorlesung, Sommersemester

30 Ranking with Length Normalization sim( d, q) = ( v ) q[ i]* vd[ i] 2 vd[ i] Q sim(d 1,q) = (1*0+1*1+1*1+0*1+0*1+0*0+0*1) / 3 ~ 1.15 sim(d 2,q) = (1+1+1) / 3 ~ 1.73 sim(d 3,q) = (1+3) / 10 ~ 1.26 sim(d 4,q) = (1+2) / 5 ~ 1.34 sim(d 5,q) = (1+1+1) / 4 ~ 1.5 No good? Rg Q: Wir wollen ein Haus mit Garten in Italien mieten 1 d 2 : Häuser mit Gärten zu vermieten 2 d 5 : Der Garten in unserem italienschen Haus blüht 3 d 4 : Die italienschen Gärtner sind im Garten 4 d 3 : Häuser: In Italien, um Italien, um Italien herum 5 d 1 : Wir verkaufen Häuser in Italien Ulf Leser: Text Analytics, Vorlesung, Sommersemester

31 Improved Scoring: TF*IDF One obvious problem: The longer the document, the higher the chances of being ranked high Solution: Normalize on document length (also yields 0 w ij 1) w = Second obvious problem: Some terms are just everywhere in D and should be scored less Solution: Use IDF scores ij tf d ij i = i tf ij tf k = 1.. di tfij w ij = * idf d ik j Ulf Leser: Text Analytics, Vorlesung, Sommersemester

32 TF*IDF Short Form Give terms in a doc d high weights which are frequent in d and infrequent in D IDF deals with the consequences of Zipf s law The few very frequent (and unspecific) terms get low scores The many infrequent (and specific) terms get higher scores Ulf Leser: Text Analytics, Vorlesung, Sommersemester

33 Example TF*IDF w ij tf = d ij sim( d, q) = i * idf j tf = d ij i D * df ( v ) q[ i]* vd[ i] 2 vd[ i] k IDF 5 5/4 5/4 5/ (tf) 1/3 1/3 1/3 2 (tf) 1/3 1/3 1/3 3 (tf) 1/4 3/4 4 (tf) 1/3 2/3 5 (tf) 1/4 1/4 1/4 1/4 Q sim(d 1,q)=(5/4*1/3 +5/4*1/3) / 0.3 ~ 1.51 sim(d 2,q)=(5/4*1/3 +5/3*1/3+5*1/3) / 0.3 ~ 4,80 sim(d 3,q)=(5/4*1/4+5/4*3/4) / 0.63 ~ 1,57 sim(d 4,q)=(5/4*1/3 +5/3*2/3) / 0.56 ~ 2,08 sim(d 5,q)=(5/4*1/4 +5/4*1/4+5/3*1/4) / 0.25 ~ 2,08 wollen ein Haus mit Garten in Italien mieten d 2 : Häuser mit Gärten zu vermieten d 5 : Der Garten in unserem italienschen Haus blüht wollen ein Haus mit Garten in Italien mieten Häuser mit Gärten zu vermieten Der Garten in unserem italienschen Haus blüht d 4 : Die italienschen Gärtner sind im Garten Die italienschen Gärtner sind im Garten d 3 : Häuser: In Italien, um Italien, um Italien Wir verkaufen Häuser in Italien herum Häuser: In Italien, um Italien, um Italien herum d Ulf Leser: Text Analytics, Vorlesung, Sommersemester : Wir verkaufen Häuser in Italien 33

34 Further Improvements Scoring the query in the same way as the documents Note: Repeating words in a query then makes a difference Empirically estimated best scoring function [BYRN99] Different length normalization w ij = tf j ij max tf ij Use log of IDF with slightly different measure (n i : # of docs containing k j ); rare terms are not totally distorting the score any more w ij = tf j ij maxtf ij *log D n j Ulf Leser: Text Analytics, Vorlesung, Sommersemester

35 Why not Simpler? Why not use simple Euclidean distance? Length of vectors would be much more important Since queries usually are very short, very short documents would always win Cosine measures normalizes by the length of both vectors Ulf Leser: Text Analytics, Vorlesung, Sommersemester

36 Metaphor: Comparison to Clustering Clustering: Find groups of objects that are close to each other and far apart from all others Rephrase the IR problem A query partitions D in two clusters: relevant R / not relevant N q should be in the heard of R: similarity Docs in R should be close to each other: large w ij *q i values Docs in R should be far apart from other points: idf Goal of scoring: Balance between intra- and inter-cluster similarity Ulf Leser: Text Analytics, Vorlesung, Sommersemester

37 Shortcomings We assumed that all terms are independent, i.e., that their vectors are orthogonal Clearly wrong: terms are semantically close or not The appearance of red in a doc with wine doesn t mean much But wine is a non-zero match for drinks Extension: Topic-based Vector Space Model No treatment homonyms (as for most other methods) We would need to split words into their senses See word disambiguation (later) Order independent But order carries semantic meaning (object? subject?) Order important for disambiguation Ulf Leser: Text Analytics, Vorlesung, Sommersemester

38 A Note on Implementation Again: Assume we have an index with op find: K P D Assume we want to retrieve the top-r docs Core algorithm Look up all terms k i of the query Build the union of all documents which contain at least on k i Hold in a list sorted by score (initialize with 0) Walk through terms k i in order of decreasing IDF-weights For each document d j : Add w ji *IDF i to current score s j Go through docs in order of current score Early stop Look at s j r s j r+1 : Can doc with rank r+1 still reach doc with rank r? Can be estimated from distribution of TF and IDF values Early stop might produce the wrong order of top-r docs, but not wrong docs Why? Ulf Leser: Text Analytics, Vorlesung, Sommersemester

39 A Different View on VSD Star Diet Query evaluation actually searches for the top-r nearest neighbors This can be approached using techniques from multidimensional indexing kdd-trees, Grid files, etc. Hope: No sequential scan of (all, many) docs, but directed search according to distances But severe problem: High dimensionality Ulf Leser: Text Analytics, Vorlesung, Sommersemester

40 Content of this Lecture IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing Ulf Leser: Text Analytics, Vorlesung, Sommersemester

41 Interactive IR Recall: IR is the process of finding the right information This includes more than a single query Relevance feedback User poses query System answers User judges the relevance of top-k returned docs System considers feedback and generates new (improved) answer User never needs to pose another query The new query is generated by the system Loop until satisfaction Query (Ranked) Results IR System Ulf Leser: Text Analytics, Vorlesung, Sommersemester

42 Relevance Feedback User judges the current answers Basic assumptions Relevant docs are somewhat similar to each other the common core should be emphasized Irrelevant docs are different from relevant docs the differences should be deemphasized System adapts to feedback by Query expansion: Add new terms to the query From the relevant documents Term re-weighting: Assign new weights to terms From the relevant documents Ulf Leser: Text Analytics, Vorlesung, Sommersemester

43 Rationale Assume we know the set R of all relevant docs How would a good query for R look like? v q = 1 R d R d D 1 R d R d Idea Weight terms high that are in R Weight terms low that are not in R, i.e., that are in N=D\R Remark It is not clear that this is the best of all queries Relevance feedback is a heuristic Ulf Leser: Text Analytics, Vorlesung, Sommersemester

44 Rocchio Algorithm Usually we do not know the real R Best bet: Let R (N) be the set docs marked as relevant (irrelevant) Adapt query vector as follows 1 1 v = + q new α * v( q) β * d γ * d R N This implies query expansion and term re-weighting How to choose α, β, γ? Alternative 1 vq new = α * v( q) + β * d γ *{ d d = arg max( sim( q, d))} R d N d R d R d N Intuition: Non-relevant docs are heterogeneous and tear in every direction better to only take the worst Ulf Leser: Text Analytics, Vorlesung, Sommersemester

45 Example (simplified) α=0.5, β=0.5, γ=0 Information 1.0 d 1 = "information science = (0.2, 0.8) d 2 = "retrieval systems" = (0.9, 0.1) q = "retrieval of information = (0.7, 0.3) 0.5 d 1 q If d 1 is marked relevant q = ½ q + ½ d 1 = (0.45, 0.55) If d 2 is marked relevant Q = ½ q + ½ d 2 = (0.80, 0.20) q q d 2 Quelle: A. Nürnberger: IR Retrieval Ulf Leser: Text Analytics, Vorlesung, Sommersemester

46 Results Advantages Improved results (many studies) compared to single queries Users need not generate new queries themselves Iterative process converging to the best possible answer Especially helpful for increasing recall (due to query expansion) Disadvantages Requires some work by the user Excite: Only 4% used relevance feedback ( more of this button) Writing a new query based on returned results might be faster (and easier and more successful) than classifying results Based on the assumption that relevant docs are similar What if user searches for all meanings of jaguar? Query very long already after one iteration makes retrieval slow Ulf Leser: Text Analytics, Vorlesung, Sommersemester

47 Collaborative Filtering What other information could we use for improving retrieval performance? Collaborative filtering: Give the user what other, similar users liked Customers who bought this book also bought In IR: Find users posing similar queries and look at which documents they eventually looked it In e-commerce, it is the buying in itself (very reliable) In IR, we need to approximate Documents a searcher clicked on (if recordable) Did the user look at the second page? (Low credit for first results) Did the user pose a refinement query next? All are not very reliable Problems: What is a similar user? What is a similar query? Ulf Leser: Text Analytics, Vorlesung, Sommersemester

48 Thesaurus-based Query Expansion [M07, CS276] Expand query with synonyms and hyponyms of each term feline feline cat One may weight added terms less than original query terms Often used in scientific IR systems (Medline) Requires a high quality thesaurus General observation Increases recall May significantly decrease precision interest rate interest rate fascinate evaluate Do synonyms really exist? Ulf Leser: Text Analytics, Vorlesung, Sommersemester

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Retrieval Models and Implementation Ulf Leser Content of this Lecture Information Retrieval Models Boolean Model Vector Space Model Inverted Files Ulf Leser: Maschinelle