Dealing with Text Databases

Size: px

Start display at page:

Download "Dealing with Text Databases"

Colin Perkins
5 years ago
Views:

1 Dealig with Text Databases Ustructured data Boolea queries Sparse matrix represetatio Iverted idex Couts vs. frequecies Term frequecy tf x idf term weights Documets as vectors Cosie similarity Dimesioality reductio Vectors ad Boolea queries 1

http://www-csli.staford.edu/~schuetze/iformatio-retrieval-book.

Shakespeare cotai the words Brutus AND Caesar but NOT Calpuria?

2 Christopher Maig Prabhakar Raghava Hirich Schütze Uiversity of Stuttgart Ustructured data Which plays of Shakespeare cotai the words Brutus AND Caesar but NOT Calpuria? (Calpuriia, third ad last wife of Julius Caesar) Oe could grep all of Shakespeare s plays for Brutus ad Caesar, the strip out lies cotaiig Calpuria? Slow (for large corpora) NOT Calpuria is o-trivial 2

3 Term-documet icidece Atoy ad Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Atoy Brutus Caesar Calpuria Cleopatra mercy worser Brutus AND Caesar but NOT Calpuria 1 if play cotais word, 0 otherwise Icidece vectors So we have a 0/1 vector for each term To aswer query: take the vectors for Brutus, Caesar ad ot Calpuria (complemeted) è bitwise AND AND AND =

4 Aswers to query AND AND = Atoy ad Cleopatra Hamlet Atoy ad Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Atoy Brutus Caesar Calpuria Cleopatra mercy worser Sparse matrix represetatio For real data matrix becomes very big Matrix has much, much more zeros the oes matrix is extremely sparse Why? Not every term (word) i every documet preset What s a better represetatio? We oly record the 1 positios 4

5 Iverted idex For each term T, we must store a list of all documets that cotai T Do we use a array or a list for this? Brutus Calpuria Caesar What happes if the word Caesar is added to documet 14? Iverted idex Liked lists geerally preferred to arrays Dyamic space allocatio Isertio of terms ito documets easy Space overhead of poiters Brutus Calpuria Caesar

6 Iverted idex costructio Documets to be idexed. Frieds, Romas, coutryme. Tokeizer Toke stream. Frieds Romas Coutryme Liguistic modules Modified tokes. fried roma coutryma Idexer fried 2 4 Iverted idex roma 1 2 coutryma Boolea queries: Exact match The Boolea Retrieval model is beig able to ask a query that is a Boolea expressio: Boolea Queries are queries usig AND, OR ad NOT to joi query terms Views each documet as a set of words (terms) Is precise: documet matches coditio or ot 6

7 Exact match Primary commercial retrieval tool for 3 decades Professioal searchers (e.g., lawyers) still like Boolea queries: You kow exactly what you re gettig. Scorig Our queries have all bee Boolea Good for expert users with precise uderstadig of their eeds ad the corpus Not good for (the majority of) users with poor Boolea formulatio of their eeds 7

8 Scorig We wish to retur i order the documets most likely to be useful to the searcher How ca we rak order the docs i the corpus with respect to a query? Assig a score say i [0,1] for each doc o each query Icidece matrices Recall: Documet (or a zoe i it) is biary vector X i {0,1} v Query is a vector Score: Overlap measure: X Y Atoy ad Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Atoy Brutus Caesar Calpuria Cleopatra mercy worser

9 Example O the query ides of march, Shakespeare s Julius Caesar has a score of 3 All other Shakespeare plays have a score of 2 (because they cotai march) or 1 Thus i a rak order, Julius Caesar would come out tops Overlap matchig What s wrog with the overlap measure? It does t cosider: Term frequecy i documet Term scarcity i collectio (documet metio frequecy) of is more commo tha ides or march Legth of documets 9

10 Scorig: desity-based Obvious ext idea: if a documet talks about a topic more, the it is a better match This applies eve whe we oly have a sigle query term. Documet relevat if it has a lot of the terms This leads to the idea of term weightig Term-documet cout matrices Cosider the umber of occurreces of a term i a documet: Bag of words model Documet is a vector i N v : a colum below Atoy ad Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Atoy Brutus Caesar Calpuria Cleopatra mercy worser

11 Couts vs. frequecies Cosider agai the ides of march query Julius Caesar has 5 occurreces of ides No other play has ides march occurs i over a doze All the plays cotai of By this scorig measure, the top-scorig play is likely to be the oe with the most ofs Digressio: termiology WARNING: I a lot of IR literature, frequecy is used to mea cout Thus term frequecy i IR literature is used to mea umber of occurreces i a doc Not divided by documet legth (which would actually make it a frequecy) We will coform to this misomer I sayig term frequecy we mea the umber of occurreces of a term i a documet. 11

12 Term frequecy tf Log docs are favored because they re more likely to cotai query terms Ca fix this to some extet by ormalizig for documet legth But is raw tf the right measure? Weightig term frequecy: tf What is the relative importace of 0 vs. 1 occurrece of a term i a doc 1 vs. 2 occurreces 2 vs. 3 occurreces Uclear: while it seems that more is better, a lot is t proportioally better tha a few Ca just use raw tf Aother optio commoly used i practice: t=term, d=documet wf t d = 0 if tft, d = 0, 1+ logtft,, d otherwise 12

13 Score computatio Score for a query q = sum over terms t i q (several terms): = t q tf, t d [Note: 0 if o query terms i documet] This score ca be zoe-combied Ca use wf istead of tf i the above Still does t cosider term scarcity i collectio (ides is rarer tha of) Weightig should deped o the term overall Which of these tells you more about a doc? 10 occurreces of heria? 10 occurreces of the? Would like to atteuate the weight of a commo term But what is commo? Suggest lookig at collectio frequecy (cf ) The total umber of occurreces of the term i the etire collectio of documets 13

14 Documet frequecy But documet frequecy (df ) may be better: df = umber of docs i the corpus cotaiig the term Word cf df alfa isurace Documet/collectio frequecy weightig is oly possible i kow (static) collectio The umber of documets i the etire collectio of documets So how do we make use of df? tf x idf term weights tf x idf measure combies: term frequecy (tf ) or wf, some measure of term desity i a doc iverse documet frequecy (idf ) measure of iformativeess of a term: its rarity across the whole corpus could just be raw cout of umber of documets the term occurs i (idf i = 1/df i ) but by far the most commoly used versio is: idf = i log df i See Kishore Papiei, NAACL 2, 2002 for theoretical justificatio 14

15 Summary: tf x idf (or tf.idf) Assig a tf.idf weight to each term i i each documet d What is the wt of a term that w i, d = tfi, d log( / dfi ) occurs i all of the docs? tf i,d = frequecy of term i i documet d = total umber of documets df i = the umber of documets that cotai term i Icreases with the umber of occurreces withi a doc Icreases with the rarity of the term across the whole corpus 15

16 16

17 Real-valued term-documet matrices Fuctio (scalig) of cout of a word i a documet: Bag of words model Each is a vector i R v Here log-scaled tf.idf Note ca be >1! Atoy ad Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Atoy Brutus Caesar Calpuria Cleopatra mercy worser Documets as vectors Each doc j ca ow be viewed as a vector of wf idf values, oe compoet for each term So we have a vector space terms are axes docs live i this space eve with stemmig, may have 20,000+ dimesios 17

18 Why tur docs ito vectors? First applicatio: Query-by-example Give a doc d, fid others like it. Now that d is a vector, fid vectors (docs) ear it... Ituitio t 3 d 2 d 3 φ θ d 1 t 1 t 2 d 4 d 5 Postulate: Documets that are close together i the vector space talk about the same thigs. 18

19 Desiderata for proximity If d 1 is ear d 2, the d 2 is ear d 1 If d 1 ear d 2, ad d 2 ear d 3, the d 1 is ot far from d 3 No doc is closer to d tha d itself First cut Idea: Distace betwee d 1 ad d 2 is the legth of the vector d 1 d 2. Euclidea distace Why is this ot a great idea? We still have t dealt with the issue of legth ormalizatio Short documets would be more similar to each other by virtue of legth, ot topic However, we ca implicitly ormalize by lookig at agles istead 19

20 Cosie similarity Distace betwee vectors d 1 ad d 2 captured by the cosie of the agle x betwee them. Note this is similarity, ot distace No triagle iequality for similarity. t 3 d 2 θ d 1 t 1 t 2 Varies from 0 to 1!!!!! Cosie similarity A vector ca be ormalized (give a legth of 1) by dividig each of its compoets by its legth here we use the L 2 orm This maps vectors oto the uit sphere: The, x 2 d! j = =1 w i, j =1 i = Loger documets do t get more weight i x 2 i 20

21 Normalized vectors For ormalized vectors, the cosie is simply the dot product:! cos( d j!, d k ) =! d j! d k Varies from 0 to 1!!!!! Example Docs: Auste's Sese ad Sesibility, Pride ad Prejudice; Brote's Wutherig Heights. tf weights SaS PaP WH affectio jealous gossip SaS PaP WH affectio jealous gossip cos(sas, PAP) =.996 x x x 0.0 = cos(sas, WH) =.996 x x x.254 =

22 Euclidea distace betwee vectors: d j d k = i = 1 2 ( d d ) i, j i, k For ormalized vectors, Euclidea distace gives the same proximity orderig as the cosie measure Queries i the vector space model Cetral idea: the query as a vector: We regard the query as short documet We retur the documets raked by the closeess of their vectors to the query, also represeted as a vector!! d j d w i= i jw q 1, i, q sim( d j, dq ) =!! = d j d 2 2 q w w i= 1 Note that d q is very sparse! Varies from 0 to 1!!!!! i, j i= 1 i, q 22

23 Normalized vectors For ormalized vectors, the cosie is simply the dot product:! cos( d j!, d k ) =! d j! d k Varies from 0 to 1!!!!! Example Docs: Auste's Sese ad Sesibility, Pride ad Prejudice; Brote's Wutherig Heights. tf weights SaS PaP WH affectio jealous gossip SaS PaP WH affectio jealous gossip cos(sas, PAP) =.996 x x x 0.0 = cos(sas, WH) =.996 x x x.254 =

24 Relatio Dimesioality reductio What if we could take our vectors ad pack them ito fewer dimesios (say 50, ) while preservig distaces? (Well, almost.) Speeds up cosie computatios 24

25 25

26 Measures for Results All of the precedig criteria are measurable: we ca quatify speed/size; we ca make expressiveess precise The key measure: user happiess What is this? Speed of respose/size of idex are factors But blidigly fast, useless aswers wo t make a user happy Need a way of quatifyig user happiess Measurig user happiess Issue: who is the user we are tryig to make happy? Depeds o the settig Web egie: user fids what they wat ad retur to the egie Ca measure rate of retur users ecommerce site: user fids what they wat ad make a purchase Is it the ed-user, or the ecommerce site, whose happiess we measure? Measure time to purchase, or fractio of searchers who become buyers? 26

27 Measurig user happiess Eterprise (compay/govt/academic): Care about user productivity How much time do my users save whe lookig for iformatio? May other criteria havig to do with breadth of access, secure access, etc. Happiess: elusive to measure Commoest proxy: relevace of search results But how do you measure relevace? We will detail a methodology here, the examie its issues Relevat measuremt requires 3 elemets: 1. A bechmark documet collectio 2. A bechmark suite of queries 3. A biary assessmet of either Relevat or Irrelevat for each query-doc pair Some work o more-tha-biary, but ot the stadard 27

28 Evaluatig a IR system Note: the iformatio eed is traslated ito a query Relevace is assessed relative to the iformatio eed ot the query E.g., Iformatio eed: I'm lookig for iformatio o whether drikig red wie is more effective at reducig your risk of heart attacks tha white wie. Query: wie red white heart attack effective You evaluate whether the doc addresses the iformatio eed, ot whether it has those words Uraked retrieval evaluatio: Precisio ad Recall Precisio: fractio of retrieved docs that are relevat = P(relevat retrieved) Recall: fractio of relevat docs that are retrieved = P(retrieved relevat) Relevat Retrieved tp fp Not Retrieved f Not Relevat t Precisio P = tp/(tp + fp) Recall R = tp/(tp + f) 28

29 Accuracy Give a query a egie classifies each doc as Relevat or Irrelevat. Accuracy of a egie: the fractio of these classificatios that is correct. Why is this ot a very useful evaluatio measure i IR? No result, 100% accuracy... Precisio/Recall You ca get high recall (but low precisio) by retrievig all docs for all queries! Recall is a o-decreasig fuctio of the umber of docs retrieved I a good system, precisio decreases as either umber of docs retrieved or recall icreases A fact with strog empirical cofirmatio 29

30 Difficulties i usig precisio/recall Should average over large corpus/query esembles Need huma relevace assessmets People are t reliable assessors Assessmets have to be biary Nuaced assessmets? Heavily skewed by corpus/authorship Results may ot traslate from oe domai to aother A combied measure: F Combied measure that assesses this tradeoff is F measure (weighted harmoic mea): F 2 1 ( β + 1) PR = = α + (1 α) β P + R P R People usually use balaced F 1 measure i.e., with b = 1 or a = ½ 30

31 Evaluatig raked results Evaluatio of raked results: The system ca retur ay umber of results By takig various umbers of the top retured documets (levels of recall), the evaluator ca produce a precisio-recall curve A precisio-recall curve Precisio Recall 31

Dealing with Text Databases

Dealing with Text Databases Unstructured data Boolean queries Sparse matrix representation Inverted index Counts vs. frequencies Term frequency tf x idf term weights Documents as vectors Cosine similarity