Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Size: px

Start display at page:

Download "Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26"

Caroline Watson
6 years ago
Views:

1 Fall 2016 CS646: Information Retrieval Lecture 6 Boolean Search and Vector Space Model Jiepu Jiang University of Massachusetts Amherst 2016/09/26

2 Outline Today Boolean Retrieval Vector Space Model Latent sementic indexing

3 Boolean Retrieval Search by exact match Queries are combined with logic operators such as AND, OR, NOT Can be nested, e.g., x AND ( y OR ( a AND b ) OR c ) Conceptually returns a set of results, without ranking Although most systems implemented certain rankings Boolean query Boolean Retrieval Can also filter results using Boolean queries in ranked retrieval x AND y x OR y NOT x x y x y x

4 Boolean Retrieval: Advantages Precise Transparant, controllable, predictable (for trained users) Can be very efficiently implemented (ignores term frequency) Works well when you know what the collection contains and what you re looking for Works well if the size of the corpus can be handled by human One has the chance to get familiar with the corpus Still widely used today, especially in specialized search systems Westlaw law search ISI web of knowledge science citation index search Almost all library catalog systems Many prefer Boolean retrieval (librarians, lawyers, physicians) Modified from James Allan s CS646 slides.

5 Ch. 6 Boolean Retrieval: Disadvantages Lacking a ranking mechanism, which is especially important Effectiveness highly depends on the user s ability to formulate good queries Users need sufficient training and knowledge about the corpus to formulate good queries General users do not have the knowledge Sometimes impossible to be familiar with the corpus, e.g., web People are lazy A much higher cost to formulate Boolean queries Not that really controllable AND gives too few; OR gives too many. Modified from James Allan s CS646 slides.

6 Ch. 6 Ranked Retrieval (Best Match Search) Returns a ranked list of results Necessary for large corpus such as the web Free text search Necessary and much easier for ordinary users But can also use Boolean queries to filter search results Just rank the filtered results, e.g., index search NOT database Requires some ranking model (the core of best-match search) The basis for processing free text query is Boolean search Web search engine: Boolean AND (fast & usually enough) May not work for bad queries; drop one or a few terms IR experiment & research: Boolean OR (more accurate experiment results) Modified from James Allan s CS646 slides.

7 Ch. 6 Ranked Retrieval (Best Match Search) Vector space model (today!) Probabilistics models, e.g., BM25 (Wed) Language modeling approachs (next week) Document representation (two weeks later) Query representation (two weeks later) Learning-to-rank (three weeks later) Midterm (11/3, 7-9pm) Right after we finished evaluation Retrieval model (about 50%) Evaluation (about 30%) Other (about 20%)

Outline Today Boolean Search Vector Space Model Latent sementic indexing Gerard Salton (1927-1995)

8 Outline Today Boolean Search Vector Space Model Latent sementic indexing Gerard Salton ( ) (Gerard Salton Award, 1983) Photo from

9 Vector Space Model: A Simple Example Consider each indexed term as a unique dimension V = { t 1, t 2,, t k }; the vocabulary has k unique terms t 1 : cat Example Vocabulary: {cat, dog, lion} t 1 = cat t 2 = dog t 3 = lion t 2 : dog t 3 : lion

10 Vector Space Model: A Simple Example Consider each indexed term as a unique dimension Represent each document as a vector in the k-dimensional space DD = [ w 1, w 2,, w k ]; for example, w i can be the frequency of t i in D D 1 : cat cat cat t 1 : cat DD 1 [ ] D 1 = (3, 0, 0) Notation t - an index term w - a term s weight (in this example, the term s frequency) t 2 : dog t 3 : lion

11 Vector Space Model: A Simple Example Consider each indexed term as a unique dimension Represent each document as a vector in the k-dimensional space DD = [ w 1, w 2,, w k ]; for example, w i can be the frequency of t i in D t 1 : cat D 2 : cat dog cat DD 2 (2, 1, 0) [ ] D 2 = We ignore word sequence in this simple VSM example also called bag-ofwords model. t 3 : lion t 2 : dog

12 Vector Space Model: A Simple Example Consider each indexed term as a unique dimension Represent each document as a vector in the k-dimensional space DD = [ w 1, w 2,, w k ]; for example, w i can be the frequency of t i in D t 1 : cat D 3 : cat dog lion dog D 3 = [ ] We ignore word sequence in this simple VSM example also called bag-ofwords model. t 3 : lion DD 3 (1, 2, 1) t 2 : dog

13 Vector Space Model: A Simple Example VSM makes it easy to measure similarity, which implies relevance Suppose we have four documents as follows To which extent do the four documents relate to each other? Which is the most related to D 4? Probably D 3 > D 2 > D 1 We need a model to capture such relatedness/relevance D 1 : cat cat cat D 2 : cat cat dog D 3 : cat dog dog D 4 : dog dog dog (discuss only cat) (discuss both, but more on cat) (discuss both, but more on dog) (discuss only dog)

Recall Luhn s idea (Lecture 2) Similarity implies relevance The more two representations agreed in given elements and their distribution, the higher would be the probability of their representing

14 Recall Luhn s idea (Lecture 2) Similarity implies relevance The more two representations agreed in given elements and their distribution, the higher would be the probability of their representing similar information. Hans Peter Luhn ( ) H. P. Luhn. A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development, 1(4), 1957: Photo from

15 Vector Space Model: A Simple Example The direction of a vector indicates the distribution of words. Comparing two documents word distributions is equivalent to measuring the size of the angle between two documents vectors. A smaller angle indicates a higher degree of similarity. This applies to any k-dimensional space. Angle(D 4, D 3 ) < Angle(D 4, D 2 ) < Angle(D 4, D 1 ) Similarity(D 4, D 3 ) > Similarity(D 4, D 2 ) > Similarity(D 4, D 1 ) t 1 : cat DD 1 3,0,0 DD 2 2,1,0 DD 3 1,2,0 DD 4 0,3,0 DD 5 3,6,0 t 2 : dog

16 Vector Space Model: A Simple Example Computationally it is easier to use cosine as a surrogate y = cosine(x) is monotonically decreasing when xx 0, ππ 2 A higher cosine value indicates a higher level of similarity. Angle(D 4, D 3 ) < Angle(D 4, D 2 ) < Angle(D 4, D 1 ) cosine(x) cos(d 4, D 3 ) > cos(d 4, D 2 ) > cos(d 4, D 1 ) x ππ 2

17 Cosine Similarity: Computation Dot product cos xy, x y = = x y ( ) i= 1 Euclidean length x = x x x [ ] k y = y y y [ ] k k xy i k k 2 2 x i yi i= 1 i= 1 i

18 Cosine similarity ignores vector length x = x x x [ ] k y = y y y [ ] k Dot product of two unit vectors cos xy, ( ) k x y x y xi yi = = = x y x y x y i= 1 unit vectors

19 Vector Space Model: A Simple Example VSM has many applications, such as text clustering (Lecture 10). For the purpose of retrieval, we can simply represent a query as a vector using the same approach and rank results by cosine(q,d). query: dog q = [ 1 0 0] similarity D 3 : cat dog dog D 2 : cat cat dog D 1 : cat cat cat t 1 : cat DD 1 3,0,0 DD 2 2,1,0 DD 3 1,2,0 qq 0,1,0 t 2 : dog

20 Vector Space Model: A Simple Example VSM has many applications, such as text clustering (Lecture 10). For the purpose of retrieval, we can simply represent a query as a vector using the same approach and rank results by cosine(q,d). query: cat dog cat q = [ 2 1 0] similarity D 2 : cat cat dog t 1 : cat DD 1 3,0,0 DD 2 2,1,0 = qq 2,1,0 D 1 : cat cat cat D 3 : cat dog dog DD 3 1,2,0 t 2 : dog

21 Vector Space Model: A Simple Example Hmmm, seems problematic We ll talk about something extensions very soon No best-match search model so far is perfect But reasonably good and useful query: cat dog cat similarity D 2 : cat cat dog D 1 : cat cat cat D 3 : cat dog dog real relevance D 2 : cat cat dog D 3 : cat dog dog D 1 : cat cat cat

22 Cosine Similarity: IR Computation cos, qd q d q d i 1 i 1 qd Independent of ranking q = q q q d = d d d [ ] k [ ] k k i i i i k k k qi di di i 1 i 1 i 1 k qd It is faster if the index stored the Euclidean length (different from document length).

23 Vector Space Model The framework is generic Define a k-dimensional space Represent each document as a vector in the k-dimensional space Represent a query as a vector in the k-dimensional space Measure relevance by the similarity of the query and the document The simple example is just a particular implementation Each term as a unique dimension. Document a within-document term frequency vector Query a query term frequency vector Measuring relevance by cos qd, ( ) q D

24 VSM: Dimension In most cases we just consider each unique word as a dimension. It looks very simple, but it works reasonably well. Many limitations Terms (dimensions) are independent of each other synonyms: retrieve, search, seek, words related to the same topic: retrieval, index, precision We will discuss an extension (LSI) very soon Ignore word sequence (all bag-of-words models do ) the woman was shot by the suspect is equivalent to the suspect was shot by the woman Term proximity (Lecture 9, next week) But it s not easy to solve the limitations effectively Many solutions do not outperform the simple approach

25 VSM: Document Vector Consider each indexed term as a unique dimension V = { t 1, t 2,, t k } DD = [ w 1, w 2,, w k ] w i : how important the term t i is for representing D s information How to compute w i? usually problem dependent Generally two types of factors to consider: document-dependent Does w i appear in D? How many times? Where? e.g., title, heading document-independent Is w i an important word? Is it a noun/verb/adj/? Is it a number/name/emoticon?

26 VSM: TF-IDF Weighting DD = [ w 1, w 2,, w k ] w i : how important the term t i is for representing D s information How to compute w i? A popular approach is to use a TF-IDF weighting TF: within-document term frequency IDF: inverse document frequency w = TF t, D IDF t ( ) ( ) i i i document-dependent document-independent

27 Choices of TF: Binary Only consider whether or not a term appears in a document Ignore repeated occurrences of the same term When to use? Very short documents: repeated occurrence of the same term is rare and unstable (due to the small text sample). e.g. twitter search, passage/sentence retrieval, TF binary ( t, D) i ( i ) ( ) 1 c t, D > 0 = 0 c ti, D = 0 Notation c(t, D): the number of times t appears in D (term frequency).

28 Choices of TF: Raw Frequency Frequent terms in a document are more important (for representing that document s information). The importance of a term is proportional to its frequency in D. When to use? When you believe it is true or you have the evidence that it is true Frequent terms may have too strong influence ( ) = ( ) TFraw ti, D c ti, D Notation c(t, D): the number of times t appears in D (term frequency).

29 Choices of TF: Log Frequency Frequent terms in a document are more important (for representing that document s information). But repreated occurrences of the term will be penalized The first occurrence of a term is the most important. Repeated occurrences are less and less important. A greater log base (b) penalizes the value by a greater extent. TF raw ( t, D) i ( ) c( t D) c( t D) 1+ log bc ti, D i, > 0 = 0 i, = 0 Notation c(t, D): the number of times t appears in D (term frequency).

30 y = x= log 1 x y = log 2 x y = log x y = log 10 x

31 Choices of TF: Log Frequency Prefers matching more unique terms c(t 1,d) = 2, c(t 2,d) = 1 is better than c(t 1,d) = 3, c(t 2,d) = 0 query: cat dog cat q = [ ] similarity D 2 : cat cat dog D 3 : cat dog dog D 1 : cat cat cat t 1 : cat DD ,0,0 log log DD ,1,0 = qq 1.69,1,0 DD 3 1,1.69,0 t 2 : dog

32 Choices of TF: Others But raw frequency, log frequency, and binary are the most popular ones. Table from the CDM textbook.

33 Choices of IDF: Uniform Every term is equally important Almost always a bad idea When to use? I can t remember any time it worked IDF t = uniform ( ) 1 i

34 Choices of IDF: KSJ The original IDF by Karen Spärck Jones (KSJ) The log base does not affect ranking (in most retrieval models) The total frequency of the term in the corpus does not directly influence IDF (although almost always highly correlated with n t ). IDF KSJ t = ( ) log N n t Notation N: the total number of documents in the corpus. n t : the total number of documents contains t.

35 Choices of IDF: BM25 The IDF used in BM25 (Wed) Have some probabilistic interpretations: P(w NR) (Wed) +0.5 used for smoothing zero values, no influence in a large corpus A greater discouting: very frequency terms have zero weight IDF BM 25 ( t) N nt N log nt < nt = N 0 nt 2 Notation N: the total number of documents in the corpus. n t : the total number of documents contains t.

36 VSM: Query Vector Can use the approach for document vector But can be different as well SMART supports using different approaches for query and documents But it requires a lot of hand-tuning But usually avoid using very different approaches Some consideration: Repeated occurrences of a query term may not mean the term is more important to the user s information need. Anyhow, query is short and term repeatition is very rare.

37 VSM: Similarity Measure Cosine is the most frequently used one. Cosine normalize vectors by Euclidean length. Euclidean distance is another option Do not fit for most IR applications Strongly influenced by document length Not only have a similar distribution, but also a similar length. A few areas that may apply: plagiarism detection, finding similar documents

38 Standard VSM Summary Very simple Map everything to a vector Compare using the angle between vectors Challenge: finding a good weighting scheme Variants of TF-IDF are the most common Okapi TF function is popular, particular in research systems The VSM model provides no guidance Another challenge: comparison/similarity function Cosine is the most common Generic inner product (without unit vectors) also occurs The VSM model provides no guidance

lib.uchicago.edu/db.xqy?one=apf1-02104.xml Susan T.

39 Outline Today Boolean Search Vector Space Model Latent sementic indexing Scott Deerwester Scott s photo from Susan T. Dumais Gerard Salton Award, 2009 Athena Lecturer Award, 2014 Sue s photo from Wikipedia

40 Terms are not independent dimensions Some terms tend to co-occur more/less often than others. P(t 1 t 2 ) = P(t 1 ) P(t 2 ): t 1 and t 2 co-occur randomly (independent). P(t 1 t 2 ) > P(t 1 ) P(t 2 ): t 1 and t 2 often co-occur together (dependent). P(t 1 t 2 ) < P(t 1 ) P(t 2 ): t 1 and t 2 often do not co-occur (dependent). D1 D2 D3 D4 index retrieval search information data computer science

41 Terms are not independent dimensions Some terms tend to co-occur more/less often than others. P(t 1 t 2 ) = P(t 1 ) P(t 2 ): t 1 and t 2 co-occur randomly (independent). P(t 1 t 2 ) > P(t 1 ) P(t 2 ): t 1 and t 2 often co-occur together (dependent). P(t 1 t 2 ) < P(t 1 ) P(t 2 ): t 1 and t 2 often do not co-occur (dependent). D1 D2 D3 D4 index retrieval search information data computer science

42 Terms are not independent dimensions We don t need as many as V dimensions. We can have a more compact representation of the corpus. The latent dimensions may also capture certain semantics (such as a group of synonyms or words related to the same topic). Latent dimension 1 index search retrieval information data science computer Latent dimension 2

43 Singular Value Decomposition (SVD) It require at most m dimensions to fully represent a corpus with m documents, because rrrrrrrr CC kk mm mm We simply assume rrrrrrrr CC kk mm = mm and m < k in this example SVD is a dimension reduction technique that transforms the corpus from the original k-dimensional space to a m-dimensional space. Original If you are not familiar with the definition of rank, please check Wikipedia: C = U S V T k m k m m m m m k C k m = k U k m m S m m * * * 0 * * 0 * * * m Transformed V T m m m m m m

44 D1 D2 D3 D4 index retrieval C = k m search information data computer science To Do SVD in MatLab: [U, S, V] = svd(c, 0);

45 H1 H2 H3 H4 index retrieval search U k m information data computer science H1 H2 H3 H4 D1 D2 D3 D4 H H S m m H H V T m m H H H H

46 U defines the directions of the m axes of the new m-dimensional space in the original k-dimensional space. H1 H2 H3 H4 index retrieval U = k m search information data computer science Each H i is a latent dimension, and each H i is a unit vector. The latent dimensions (H1, ) are orthogonal to each other. e.g., cos(h1, H2) = 0, cos(h2, H4) = 0,

47 If you are not familiar with the definition of singular value, please check Wikipedia: S is the diagonal matrix of singular values. H1 H2 H3 H4 H S = m m H H H You can consider singular values as the scaling factors of between the original and the new space. We can also consider singular values as the importance or informativeness of the latent dimensions. By convention, S is sorted and should have only positive values.

48 V represents the documents using the new dimensions. D1 D2 D3 D4 H V T = m m H H H Each column in V is also a unit-length vector. You can consider V as the documents vectors in the new m- dimensional space (after transformation and scaling). USV -1 restores the coordinates of the documents in the original k-dimensional space.

49 Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), Latent Sementic Indexing (LSI) We want even more compact representation. We just use the most important n dimensions (n << rank(c)). The transformed representation is an approximation of the original one. Original C U S V T k m k n n n m n k C k m = k U k n n S n n * * * n * 0 n Transformed V T m n 0 * * * * m m n

50 n = 3 U k n H1 H2 H3 H4 index retrieval search information data computer science H1 H2 H3 H4 D1 D2 D3 D4 H H S n n H H V T m n H H H H

51 LSI-Restored Document Representation Using the top three hidden dimensions, SS = C k m (Original) Uk nsn nvm n T (Restored) D1 D2 D3 D4 index retrieval search information data computer science D1 D2 D3 D4 index retrieval search information data computer science Shading means difference >1

52 n = 2 U k n H1 H2 H3 H4 index retrieval search information data computer science H1 H2 H3 H4 D1 D2 D3 D4 H H S n n H H V T m n H H H H

53 LSI-Restored Document Representation Using the top two hidden dimensions, SS = C k m (Original) Uk nsn nvm n T (Restored) D1 D2 D3 D4 index retrieval search information data computer science D1 D2 D3 D4 index retrieval search information data computer science Shading means difference >1

54 n = 1 U k n H1 H2 H3 H4 index retrieval search information data computer science H1 H2 H3 H4 D1 D2 D3 D4 H H S n n H H V T m n H H H H

55 LSI-Restored Document Representation Using the top one hidden dimensions, SS = C k m (Original) Uk nsn nvm n T (Restored) D1 D2 D3 D4 index retrieval search information data computer science D1 D2 D3 D4 index retrieval search information data computer science Shading means difference >1

56 LSI: Retrieval We can use U and S to encode a query or a new document to the n-dimensional space as well. For example, q = information retrieval index, we can transform the query q to u. T T u 1 n 1 qk 1Uk ns = n n q index 1 retrieval 1 search 0 information 1 data 0 computer 0 science 0 H1 H2 H3 H H1 H2 H3 H4 H H H H

57 LSI: Retrieval We retrieve search results by comparing the transformed query u with the transformed representations of the documents. Original query q index 1 retrieval 1 search 0 information 1 data 0 computer 0 science 0 u = q U S T T 1 n 1 k 1 k n n n The projection of the query in the new space. u H H H Make comparison with documents in the new dimensions. D1 D2 D3 D4 H H H

58 LSI: Retrieval What does the transformed query mean? We can restore u to the original k-dimensional space. LSI helps expand the original query to include the term search, which seems helpful in this example. q (Original) k 1 k n n n n 1 q index 1 retrieval 1 search 0 information 1 data 0 computer 0 science 0 u H H H U S u 0.68 index 0.66 retrieval 0.72 search 0.91 information 0.02 data 0.04 computer science (Restored)

59 Results on the MED Dataset: 1033 medical abstracts, 30 queries Seems not helpful for improving the precision of the top-ranked results. Figure from Deerwester et al. (1990)

60 Results on the CISI Dataset: 1460 information science abstracts, 35 queries In general seems not helpful. Figure from Deerwester et al. (1990)

61 Is LSI any good? Decomposes language into basis vectors In a sense, is looking for core concepts In theory, this means that system will retrieve documents using synonyms or related words of your query words An appealing technique! We should improve bag-of-words! The original paper has been cited over 10,000 times! Somehow limited improvements, especially in terms of precision In many cases, LSI improves recall by sacrificing precision The human query is more specific (although sometimes may miss a few important words). The LSI query becomes fuzzy Why bag-of-words is so strong? Probably because human language and its vocabulary has already been evolving for several thousand years. Modified from James Allan s CS646 slides.

62 VSM Summary Standard vector space Each dimension corresponds to a term in the vocabulary Vector elements are real-valued, reflecting term importance Any vector (document, query,...) can be compared to any other Cosine correlation is the similarity metric used most often Still widely used today! Latent Semantic Indexing (LSI) Each dimension corresponds to a basic concept Documents and queries mapped into basic concepts Same as standard vector space after that Whether it s good depends on what you want Modified from James Allan s CS646 slides.

63 VSM Disadvantages Assumed independence relationship among terms Though this is a very common retrieval model assumption Lack of justification for some vector operations e.g., choice of similarity function e.g., choice of term weights Barely a retrieval model Doesn t explicitly model relevance, a person s information need, language models, etc. Assumes a query and a document can be treated the same (symmetric) Lack of a cognitive (or other) justification Modified from James Allan s CS646 slides.

64 VSM Advantages Simplicity Ability to incorporate term weights Any type of term weights can be added No model that has to justify the use of a weight Ability to handle distributed term representations e.g., LSI Can measure similarities between almost anything: documents and queries documents and documents queries and queries sentences and sentences etc. Modified from James Allan s CS646 slides.

65 Wed (9/28) Probabilistics retrieval models. HW1 is due Wed midnight 11:59pm!

vector space retrieval many slides courtesy James Amherst

vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the