Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Size: px
Start display at page:

Download "Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26"

Transcription

1 Fall 2016 CS646: Information Retrieval Lecture 6 Boolean Search and Vector Space Model Jiepu Jiang University of Massachusetts Amherst 2016/09/26

2 Outline Today Boolean Retrieval Vector Space Model Latent sementic indexing

3 Boolean Retrieval Search by exact match Queries are combined with logic operators such as AND, OR, NOT Can be nested, e.g., x AND ( y OR ( a AND b ) OR c ) Conceptually returns a set of results, without ranking Although most systems implemented certain rankings Boolean query Boolean Retrieval Can also filter results using Boolean queries in ranked retrieval x AND y x OR y NOT x x y x y x

4 Boolean Retrieval: Advantages Precise Transparant, controllable, predictable (for trained users) Can be very efficiently implemented (ignores term frequency) Works well when you know what the collection contains and what you re looking for Works well if the size of the corpus can be handled by human One has the chance to get familiar with the corpus Still widely used today, especially in specialized search systems Westlaw law search ISI web of knowledge science citation index search Almost all library catalog systems Many prefer Boolean retrieval (librarians, lawyers, physicians) Modified from James Allan s CS646 slides.

5 Ch. 6 Boolean Retrieval: Disadvantages Lacking a ranking mechanism, which is especially important Effectiveness highly depends on the user s ability to formulate good queries Users need sufficient training and knowledge about the corpus to formulate good queries General users do not have the knowledge Sometimes impossible to be familiar with the corpus, e.g., web People are lazy A much higher cost to formulate Boolean queries Not that really controllable AND gives too few; OR gives too many. Modified from James Allan s CS646 slides.

6 Ch. 6 Ranked Retrieval (Best Match Search) Returns a ranked list of results Necessary for large corpus such as the web Free text search Necessary and much easier for ordinary users But can also use Boolean queries to filter search results Just rank the filtered results, e.g., index search NOT database Requires some ranking model (the core of best-match search) The basis for processing free text query is Boolean search Web search engine: Boolean AND (fast & usually enough) May not work for bad queries; drop one or a few terms IR experiment & research: Boolean OR (more accurate experiment results) Modified from James Allan s CS646 slides.

7 Ch. 6 Ranked Retrieval (Best Match Search) Vector space model (today!) Probabilistics models, e.g., BM25 (Wed) Language modeling approachs (next week) Document representation (two weeks later) Query representation (two weeks later) Learning-to-rank (three weeks later) Midterm (11/3, 7-9pm) Right after we finished evaluation Retrieval model (about 50%) Evaluation (about 30%) Other (about 20%)

8 Outline Today Boolean Search Vector Space Model Latent sementic indexing Gerard Salton ( ) (Gerard Salton Award, 1983) Photo from

9 Vector Space Model: A Simple Example Consider each indexed term as a unique dimension V = { t 1, t 2,, t k }; the vocabulary has k unique terms t 1 : cat Example Vocabulary: {cat, dog, lion} t 1 = cat t 2 = dog t 3 = lion t 2 : dog t 3 : lion

10 Vector Space Model: A Simple Example Consider each indexed term as a unique dimension Represent each document as a vector in the k-dimensional space DD = [ w 1, w 2,, w k ]; for example, w i can be the frequency of t i in D D 1 : cat cat cat t 1 : cat DD 1 [ ] D 1 = (3, 0, 0) Notation t - an index term w - a term s weight (in this example, the term s frequency) t 2 : dog t 3 : lion

11 Vector Space Model: A Simple Example Consider each indexed term as a unique dimension Represent each document as a vector in the k-dimensional space DD = [ w 1, w 2,, w k ]; for example, w i can be the frequency of t i in D t 1 : cat D 2 : cat dog cat DD 2 (2, 1, 0) [ ] D 2 = We ignore word sequence in this simple VSM example also called bag-ofwords model. t 3 : lion t 2 : dog

12 Vector Space Model: A Simple Example Consider each indexed term as a unique dimension Represent each document as a vector in the k-dimensional space DD = [ w 1, w 2,, w k ]; for example, w i can be the frequency of t i in D t 1 : cat D 3 : cat dog lion dog D 3 = [ ] We ignore word sequence in this simple VSM example also called bag-ofwords model. t 3 : lion DD 3 (1, 2, 1) t 2 : dog

13 Vector Space Model: A Simple Example VSM makes it easy to measure similarity, which implies relevance Suppose we have four documents as follows To which extent do the four documents relate to each other? Which is the most related to D 4? Probably D 3 > D 2 > D 1 We need a model to capture such relatedness/relevance D 1 : cat cat cat D 2 : cat cat dog D 3 : cat dog dog D 4 : dog dog dog (discuss only cat) (discuss both, but more on cat) (discuss both, but more on dog) (discuss only dog)

14 Recall Luhn s idea (Lecture 2) Similarity implies relevance The more two representations agreed in given elements and their distribution, the higher would be the probability of their representing similar information. Hans Peter Luhn ( ) H. P. Luhn. A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development, 1(4), 1957: Photo from

15 Vector Space Model: A Simple Example The direction of a vector indicates the distribution of words. Comparing two documents word distributions is equivalent to measuring the size of the angle between two documents vectors. A smaller angle indicates a higher degree of similarity. This applies to any k-dimensional space. Angle(D 4, D 3 ) < Angle(D 4, D 2 ) < Angle(D 4, D 1 ) Similarity(D 4, D 3 ) > Similarity(D 4, D 2 ) > Similarity(D 4, D 1 ) t 1 : cat DD 1 3,0,0 DD 2 2,1,0 DD 3 1,2,0 DD 4 0,3,0 DD 5 3,6,0 t 2 : dog

16 Vector Space Model: A Simple Example Computationally it is easier to use cosine as a surrogate y = cosine(x) is monotonically decreasing when xx 0, ππ 2 A higher cosine value indicates a higher level of similarity. Angle(D 4, D 3 ) < Angle(D 4, D 2 ) < Angle(D 4, D 1 ) cosine(x) cos(d 4, D 3 ) > cos(d 4, D 2 ) > cos(d 4, D 1 ) x ππ 2

17 Cosine Similarity: Computation Dot product cos xy, x y = = x y ( ) i= 1 Euclidean length x = x x x [ ] k y = y y y [ ] k k xy i k k 2 2 x i yi i= 1 i= 1 i

18 Cosine similarity ignores vector length x = x x x [ ] k y = y y y [ ] k Dot product of two unit vectors cos xy, ( ) k x y x y xi yi = = = x y x y x y i= 1 unit vectors

19 Vector Space Model: A Simple Example VSM has many applications, such as text clustering (Lecture 10). For the purpose of retrieval, we can simply represent a query as a vector using the same approach and rank results by cosine(q,d). query: dog q = [ 1 0 0] similarity D 3 : cat dog dog D 2 : cat cat dog D 1 : cat cat cat t 1 : cat DD 1 3,0,0 DD 2 2,1,0 DD 3 1,2,0 qq 0,1,0 t 2 : dog

20 Vector Space Model: A Simple Example VSM has many applications, such as text clustering (Lecture 10). For the purpose of retrieval, we can simply represent a query as a vector using the same approach and rank results by cosine(q,d). query: cat dog cat q = [ 2 1 0] similarity D 2 : cat cat dog t 1 : cat DD 1 3,0,0 DD 2 2,1,0 = qq 2,1,0 D 1 : cat cat cat D 3 : cat dog dog DD 3 1,2,0 t 2 : dog

21 Vector Space Model: A Simple Example Hmmm, seems problematic We ll talk about something extensions very soon No best-match search model so far is perfect But reasonably good and useful query: cat dog cat similarity D 2 : cat cat dog D 1 : cat cat cat D 3 : cat dog dog real relevance D 2 : cat cat dog D 3 : cat dog dog D 1 : cat cat cat

22 Cosine Similarity: IR Computation cos, qd q d q d i 1 i 1 qd Independent of ranking q = q q q d = d d d [ ] k [ ] k k i i i i k k k qi di di i 1 i 1 i 1 k qd It is faster if the index stored the Euclidean length (different from document length).

23 Vector Space Model The framework is generic Define a k-dimensional space Represent each document as a vector in the k-dimensional space Represent a query as a vector in the k-dimensional space Measure relevance by the similarity of the query and the document The simple example is just a particular implementation Each term as a unique dimension. Document a within-document term frequency vector Query a query term frequency vector Measuring relevance by cos qd, ( ) q D

24 VSM: Dimension In most cases we just consider each unique word as a dimension. It looks very simple, but it works reasonably well. Many limitations Terms (dimensions) are independent of each other synonyms: retrieve, search, seek, words related to the same topic: retrieval, index, precision We will discuss an extension (LSI) very soon Ignore word sequence (all bag-of-words models do ) the woman was shot by the suspect is equivalent to the suspect was shot by the woman Term proximity (Lecture 9, next week) But it s not easy to solve the limitations effectively Many solutions do not outperform the simple approach

25 VSM: Document Vector Consider each indexed term as a unique dimension V = { t 1, t 2,, t k } DD = [ w 1, w 2,, w k ] w i : how important the term t i is for representing D s information How to compute w i? usually problem dependent Generally two types of factors to consider: document-dependent Does w i appear in D? How many times? Where? e.g., title, heading document-independent Is w i an important word? Is it a noun/verb/adj/? Is it a number/name/emoticon?

26 VSM: TF-IDF Weighting DD = [ w 1, w 2,, w k ] w i : how important the term t i is for representing D s information How to compute w i? A popular approach is to use a TF-IDF weighting TF: within-document term frequency IDF: inverse document frequency w = TF t, D IDF t ( ) ( ) i i i document-dependent document-independent

27 Choices of TF: Binary Only consider whether or not a term appears in a document Ignore repeated occurrences of the same term When to use? Very short documents: repeated occurrence of the same term is rare and unstable (due to the small text sample). e.g. twitter search, passage/sentence retrieval, TF binary ( t, D) i ( i ) ( ) 1 c t, D > 0 = 0 c ti, D = 0 Notation c(t, D): the number of times t appears in D (term frequency).

28 Choices of TF: Raw Frequency Frequent terms in a document are more important (for representing that document s information). The importance of a term is proportional to its frequency in D. When to use? When you believe it is true or you have the evidence that it is true Frequent terms may have too strong influence ( ) = ( ) TFraw ti, D c ti, D Notation c(t, D): the number of times t appears in D (term frequency).

29 Choices of TF: Log Frequency Frequent terms in a document are more important (for representing that document s information). But repreated occurrences of the term will be penalized The first occurrence of a term is the most important. Repeated occurrences are less and less important. A greater log base (b) penalizes the value by a greater extent. TF raw ( t, D) i ( ) c( t D) c( t D) 1+ log bc ti, D i, > 0 = 0 i, = 0 Notation c(t, D): the number of times t appears in D (term frequency).

30 y = x= log 1 x y = log 2 x y = log x y = log 10 x

31 Choices of TF: Log Frequency Prefers matching more unique terms c(t 1,d) = 2, c(t 2,d) = 1 is better than c(t 1,d) = 3, c(t 2,d) = 0 query: cat dog cat q = [ ] similarity D 2 : cat cat dog D 3 : cat dog dog D 1 : cat cat cat t 1 : cat DD ,0,0 log log DD ,1,0 = qq 1.69,1,0 DD 3 1,1.69,0 t 2 : dog

32 Choices of TF: Others But raw frequency, log frequency, and binary are the most popular ones. Table from the CDM textbook.

33 Choices of IDF: Uniform Every term is equally important Almost always a bad idea When to use? I can t remember any time it worked IDF t = uniform ( ) 1 i

34 Choices of IDF: KSJ The original IDF by Karen Spärck Jones (KSJ) The log base does not affect ranking (in most retrieval models) The total frequency of the term in the corpus does not directly influence IDF (although almost always highly correlated with n t ). IDF KSJ t = ( ) log N n t Notation N: the total number of documents in the corpus. n t : the total number of documents contains t.

35 Choices of IDF: BM25 The IDF used in BM25 (Wed) Have some probabilistic interpretations: P(w NR) (Wed) +0.5 used for smoothing zero values, no influence in a large corpus A greater discouting: very frequency terms have zero weight IDF BM 25 ( t) N nt N log nt < nt = N 0 nt 2 Notation N: the total number of documents in the corpus. n t : the total number of documents contains t.

36 VSM: Query Vector Can use the approach for document vector But can be different as well SMART supports using different approaches for query and documents But it requires a lot of hand-tuning But usually avoid using very different approaches Some consideration: Repeated occurrences of a query term may not mean the term is more important to the user s information need. Anyhow, query is short and term repeatition is very rare.

37 VSM: Similarity Measure Cosine is the most frequently used one. Cosine normalize vectors by Euclidean length. Euclidean distance is another option Do not fit for most IR applications Strongly influenced by document length Not only have a similar distribution, but also a similar length. A few areas that may apply: plagiarism detection, finding similar documents

38 Standard VSM Summary Very simple Map everything to a vector Compare using the angle between vectors Challenge: finding a good weighting scheme Variants of TF-IDF are the most common Okapi TF function is popular, particular in research systems The VSM model provides no guidance Another challenge: comparison/similarity function Cosine is the most common Generic inner product (without unit vectors) also occurs The VSM model provides no guidance

39 Outline Today Boolean Search Vector Space Model Latent sementic indexing Scott Deerwester Scott s photo from Susan T. Dumais Gerard Salton Award, 2009 Athena Lecturer Award, 2014 Sue s photo from Wikipedia

40 Terms are not independent dimensions Some terms tend to co-occur more/less often than others. P(t 1 t 2 ) = P(t 1 ) P(t 2 ): t 1 and t 2 co-occur randomly (independent). P(t 1 t 2 ) > P(t 1 ) P(t 2 ): t 1 and t 2 often co-occur together (dependent). P(t 1 t 2 ) < P(t 1 ) P(t 2 ): t 1 and t 2 often do not co-occur (dependent). D1 D2 D3 D4 index retrieval search information data computer science

41 Terms are not independent dimensions Some terms tend to co-occur more/less often than others. P(t 1 t 2 ) = P(t 1 ) P(t 2 ): t 1 and t 2 co-occur randomly (independent). P(t 1 t 2 ) > P(t 1 ) P(t 2 ): t 1 and t 2 often co-occur together (dependent). P(t 1 t 2 ) < P(t 1 ) P(t 2 ): t 1 and t 2 often do not co-occur (dependent). D1 D2 D3 D4 index retrieval search information data computer science

42 Terms are not independent dimensions We don t need as many as V dimensions. We can have a more compact representation of the corpus. The latent dimensions may also capture certain semantics (such as a group of synonyms or words related to the same topic). Latent dimension 1 index search retrieval information data science computer Latent dimension 2

43 Singular Value Decomposition (SVD) It require at most m dimensions to fully represent a corpus with m documents, because rrrrrrrr CC kk mm mm We simply assume rrrrrrrr CC kk mm = mm and m < k in this example SVD is a dimension reduction technique that transforms the corpus from the original k-dimensional space to a m-dimensional space. Original If you are not familiar with the definition of rank, please check Wikipedia: C = U S V T k m k m m m m m k C k m = k U k m m S m m * * * 0 * * 0 * * * m Transformed V T m m m m m m

44 D1 D2 D3 D4 index retrieval C = k m search information data computer science To Do SVD in MatLab: [U, S, V] = svd(c, 0);

45 H1 H2 H3 H4 index retrieval search U k m information data computer science H1 H2 H3 H4 D1 D2 D3 D4 H H S m m H H V T m m H H H H

46 U defines the directions of the m axes of the new m-dimensional space in the original k-dimensional space. H1 H2 H3 H4 index retrieval U = k m search information data computer science Each H i is a latent dimension, and each H i is a unit vector. The latent dimensions (H1, ) are orthogonal to each other. e.g., cos(h1, H2) = 0, cos(h2, H4) = 0,

47 If you are not familiar with the definition of singular value, please check Wikipedia: S is the diagonal matrix of singular values. H1 H2 H3 H4 H S = m m H H H You can consider singular values as the scaling factors of between the original and the new space. We can also consider singular values as the importance or informativeness of the latent dimensions. By convention, S is sorted and should have only positive values.

48 V represents the documents using the new dimensions. D1 D2 D3 D4 H V T = m m H H H Each column in V is also a unit-length vector. You can consider V as the documents vectors in the new m- dimensional space (after transformation and scaling). USV -1 restores the coordinates of the documents in the original k-dimensional space.

49 Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), Latent Sementic Indexing (LSI) We want even more compact representation. We just use the most important n dimensions (n << rank(c)). The transformed representation is an approximation of the original one. Original C U S V T k m k n n n m n k C k m = k U k n n S n n * * * n * 0 n Transformed V T m n 0 * * * * m m n

50 n = 3 U k n H1 H2 H3 H4 index retrieval search information data computer science H1 H2 H3 H4 D1 D2 D3 D4 H H S n n H H V T m n H H H H

51 LSI-Restored Document Representation Using the top three hidden dimensions, SS = C k m (Original) Uk nsn nvm n T (Restored) D1 D2 D3 D4 index retrieval search information data computer science D1 D2 D3 D4 index retrieval search information data computer science Shading means difference >1

52 n = 2 U k n H1 H2 H3 H4 index retrieval search information data computer science H1 H2 H3 H4 D1 D2 D3 D4 H H S n n H H V T m n H H H H

53 LSI-Restored Document Representation Using the top two hidden dimensions, SS = C k m (Original) Uk nsn nvm n T (Restored) D1 D2 D3 D4 index retrieval search information data computer science D1 D2 D3 D4 index retrieval search information data computer science Shading means difference >1

54 n = 1 U k n H1 H2 H3 H4 index retrieval search information data computer science H1 H2 H3 H4 D1 D2 D3 D4 H H S n n H H V T m n H H H H

55 LSI-Restored Document Representation Using the top one hidden dimensions, SS = C k m (Original) Uk nsn nvm n T (Restored) D1 D2 D3 D4 index retrieval search information data computer science D1 D2 D3 D4 index retrieval search information data computer science Shading means difference >1

56 LSI: Retrieval We can use U and S to encode a query or a new document to the n-dimensional space as well. For example, q = information retrieval index, we can transform the query q to u. T T u 1 n 1 qk 1Uk ns = n n q index 1 retrieval 1 search 0 information 1 data 0 computer 0 science 0 H1 H2 H3 H H1 H2 H3 H4 H H H H

57 LSI: Retrieval We retrieve search results by comparing the transformed query u with the transformed representations of the documents. Original query q index 1 retrieval 1 search 0 information 1 data 0 computer 0 science 0 u = q U S T T 1 n 1 k 1 k n n n The projection of the query in the new space. u H H H Make comparison with documents in the new dimensions. D1 D2 D3 D4 H H H

58 LSI: Retrieval What does the transformed query mean? We can restore u to the original k-dimensional space. LSI helps expand the original query to include the term search, which seems helpful in this example. q (Original) k 1 k n n n n 1 q index 1 retrieval 1 search 0 information 1 data 0 computer 0 science 0 u H H H U S u 0.68 index 0.66 retrieval 0.72 search 0.91 information 0.02 data 0.04 computer science (Restored)

59 Results on the MED Dataset: 1033 medical abstracts, 30 queries Seems not helpful for improving the precision of the top-ranked results. Figure from Deerwester et al. (1990)

60 Results on the CISI Dataset: 1460 information science abstracts, 35 queries In general seems not helpful. Figure from Deerwester et al. (1990)

61 Is LSI any good? Decomposes language into basis vectors In a sense, is looking for core concepts In theory, this means that system will retrieve documents using synonyms or related words of your query words An appealing technique! We should improve bag-of-words! The original paper has been cited over 10,000 times! Somehow limited improvements, especially in terms of precision In many cases, LSI improves recall by sacrificing precision The human query is more specific (although sometimes may miss a few important words). The LSI query becomes fuzzy Why bag-of-words is so strong? Probably because human language and its vocabulary has already been evolving for several thousand years. Modified from James Allan s CS646 slides.

62 VSM Summary Standard vector space Each dimension corresponds to a term in the vocabulary Vector elements are real-valued, reflecting term importance Any vector (document, query,...) can be compared to any other Cosine correlation is the similarity metric used most often Still widely used today! Latent Semantic Indexing (LSI) Each dimension corresponds to a basic concept Documents and queries mapped into basic concepts Same as standard vector space after that Whether it s good depends on what you want Modified from James Allan s CS646 slides.

63 VSM Disadvantages Assumed independence relationship among terms Though this is a very common retrieval model assumption Lack of justification for some vector operations e.g., choice of similarity function e.g., choice of term weights Barely a retrieval model Doesn t explicitly model relevance, a person s information need, language models, etc. Assumes a query and a document can be treated the same (symmetric) Lack of a cognitive (or other) justification Modified from James Allan s CS646 slides.

64 VSM Advantages Simplicity Ability to incorporate term weights Any type of term weights can be added No model that has to justify the use of a weight Ability to handle distributed term representations e.g., LSI Can measure similarities between almost anything: documents and queries documents and documents queries and queries sentences and sentences etc. Modified from James Allan s CS646 slides.

65 Wed (9/28) Probabilistics retrieval models. HW1 is due Wed midnight 11:59pm!

vector space retrieval many slides courtesy James Amherst

vector space retrieval many slides courtesy James Amherst vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the

More information

Latent Semantic Analysis. Hongning Wang

Latent Semantic Analysis. Hongning Wang Latent Semantic Analysis Hongning Wang CS@UVa Recap: vector space model Represent both doc and query by concept vectors Each concept defines one dimension K concepts define a high-dimensional space Element

More information

Information Retrieval

Information Retrieval Introduction to Information CS276: Information and Web Search Christopher Manning and Pandu Nayak Lecture 13: Latent Semantic Indexing Ch. 18 Today s topic Latent Semantic Indexing Term-document matrices

More information

5 10 12 32 48 5 10 12 32 48 4 8 16 32 64 128 4 8 16 32 64 128 2 3 5 16 2 3 5 16 5 10 12 32 48 4 8 16 32 64 128 2 3 5 16 docid score 5 10 12 32 48 O'Neal averaged 15.2 points 9.2 rebounds and 1.0 assists

More information

Chap 2: Classical models for information retrieval

Chap 2: Classical models for information retrieval Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81 Outline Basic IR Models 1 Basic

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ANLP Lecture 22 Lexical Semantics with Dense Vectors ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous

More information

Boolean and Vector Space Retrieval Models

Boolean and Vector Space Retrieval Models Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) 1

More information

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

.. CSC 566 Advanced Data Mining Alexander Dekhtyar.. .. CSC 566 Advanced Data Mining Alexander Dekhtyar.. Information Retrieval Latent Semantic Indexing Preliminaries Vector Space Representation of Documents: TF-IDF Documents. A single text document is a

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Ranked retrieval Thus far, our queries have all been Boolean. Documents either

More information

Manning & Schuetze, FSNLP, (c)

Manning & Schuetze, FSNLP, (c) page 554 554 15 Topics in Information Retrieval co-occurrence Latent Semantic Indexing Term 1 Term 2 Term 3 Term 4 Query user interface Document 1 user interface HCI interaction Document 2 HCI interaction

More information

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Natural Language Processing. Topics in Information Retrieval. Updated 5/10 Natural Language Processing Topics in Information Retrieval Updated 5/10 Outline Introduction to IR Design features of IR systems Evaluation measures The vector space model Latent semantic indexing Background

More information

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002 CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002 Recap of last time Index size Index construction techniques Dynamic indices Real world considerations 2 Back of the envelope

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 8: Evaluation & SVD Paul Ginsparg Cornell University, Ithaca, NY 20 Sep 2011

More information

Manning & Schuetze, FSNLP (c) 1999,2000

Manning & Schuetze, FSNLP (c) 1999,2000 558 15 Topics in Information Retrieval (15.10) y 4 3 2 1 0 0 1 2 3 4 5 6 7 8 Figure 15.7 An example of linear regression. The line y = 0.25x + 1 is the best least-squares fit for the four points (1,1),

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search IR models: Vector Space Model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Brosing boolean vector probabilistic

More information

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model Vector Space Model Yufei Tao KAIST March 5, 2013 In this lecture, we will study a problem that is (very) fundamental in information retrieval, and must be tackled by all search engines. Let S be a set

More information

Notes on Latent Semantic Analysis

Notes on Latent Semantic Analysis Notes on Latent Semantic Analysis Costas Boulis 1 Introduction One of the most fundamental problems of information retrieval (IR) is to find all documents (and nothing but those) that are semantically

More information

Variable Latent Semantic Indexing

Variable Latent Semantic Indexing Variable Latent Semantic Indexing Prabhakar Raghavan Yahoo! Research Sunnyvale, CA November 2005 Joint work with A. Dasgupta, R. Kumar, A. Tomkins. Yahoo! Research. Outline 1 Introduction 2 Background

More information

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval

More information

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan Term Weighting and the Vector Space Model borrowing from: Pandu Nayak and Prabhakar Raghavan IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes

More information

Ranked Retrieval (2)

Ranked Retrieval (2) Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF

More information

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1 Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1 Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR)

More information

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding

More information

CS 572: Information Retrieval

CS 572: Information Retrieval CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments: Some slides were adapted from Chris Manning, and from Thomas Hoffman 1 Plan for next few weeks Project 1: done (submit by Friday).

More information

1 Information retrieval fundamentals

1 Information retrieval fundamentals CS 630 Lecture 1: 01/26/2006 Lecturer: Lillian Lee Scribes: Asif-ul Haque, Benyah Shaparenko This lecture focuses on the following topics Information retrieval fundamentals Vector Space Model (VSM) Deriving

More information

IR Models: The Probabilistic Model. Lecture 8

IR Models: The Probabilistic Model. Lecture 8 IR Models: The Probabilistic Model Lecture 8 ' * ) ( % $ $ +#! "#! '& & Probability of Relevance? ' ', IR is an uncertain process Information need to query Documents to index terms Query terms and index

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture; IIR Sections

More information

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic

More information

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING. Crista Lopes

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING. Crista Lopes INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING Crista Lopes Outline Precision and Recall The problem with indexing so far Intuition for solving it Overview of the solution The Math How to measure

More information

Scoring, Term Weighting and the Vector Space

Scoring, Term Weighting and the Vector Space Scoring, Term Weighting and the Vector Space Model Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content [J

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Prof. Chris Clifton 6 September 2017 Material adapted from course created by Dr. Luo Si, now leading Alibaba research group 1 Vector Space Model Disadvantages:

More information

DISTRIBUTIONAL SEMANTICS

DISTRIBUTIONAL SEMANTICS COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.

More information

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review INFO 4300 / CS4300 Information Retrieval IR 9: Linear Algebra Review Paul Ginsparg Cornell University, Ithaca, NY 24 Sep 2009 1/ 23 Overview 1 Recap 2 Matrix basics 3 Matrix Decompositions 4 Discussion

More information

Fast LSI-based techniques for query expansion in text retrieval systems

Fast LSI-based techniques for query expansion in text retrieval systems Fast LSI-based techniques for query expansion in text retrieval systems L. Laura U. Nanni F. Sarracco Department of Computer and System Science University of Rome La Sapienza 2nd Workshop on Text-based

More information

Information Retrieval. Lecture 6

Information Retrieval. Lecture 6 Information Retrieval Lecture 6 Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring tf idf and vector spaces This lecture

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 18: Latent Semantic Indexing Hinrich Schütze Center for Information and Language Processing, University of Munich 2013-07-10 1/43

More information

Generic Text Summarization

Generic Text Summarization June 27, 2012 Outline Introduction 1 Introduction Notation and Terminology 2 3 4 5 6 Text Summarization Introduction Notation and Terminology Two Types of Text Summarization Query-Relevant Summarization:

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS

More information

How Latent Semantic Indexing Solves the Pachyderm Problem

How Latent Semantic Indexing Solves the Pachyderm Problem How Latent Semantic Indexing Solves the Pachyderm Problem Michael A. Covington Institute for Artificial Intelligence The University of Georgia 2011 1 Introduction Here I present a brief mathematical demonstration

More information

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin) Information Retrieval and Topic Models Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin) Sec. 1.1 Unstructured data in 1620 Which plays of Shakespeare

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture;

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 3 Modeling Introduction to IR Models Basic Concepts The Boolean Model Term Weighting The Vector Model Probabilistic Model Retrieval Evaluation, Modern Information Retrieval,

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,

More information

Lecture 5: Web Searching using the SVD

Lecture 5: Web Searching using the SVD Lecture 5: Web Searching using the SVD Information Retrieval Over the last 2 years the number of internet users has grown exponentially with time; see Figure. Trying to extract information from this exponentially

More information

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component

More information

9 Searching the Internet with the SVD

9 Searching the Internet with the SVD 9 Searching the Internet with the SVD 9.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this

More information

13 Searching the Web with the SVD

13 Searching the Web with the SVD 13 Searching the Web with the SVD 13.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 11: Probabilistic Information Retrieval 1 Outline Basic Probability Theory Probability Ranking Principle Extensions 2 Basic Probability Theory For events A

More information

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-017 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 1 Boolean Retrieval Thus far,

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS

More information

CAIM: Cerca i Anàlisi d Informació Massiva

CAIM: Cerca i Anàlisi d Informació Massiva 1 / 21 CAIM: Cerca i Anàlisi d Informació Massiva FIB, Grau en Enginyeria Informàtica Slides by Marta Arias, José Balcázar, Ricard Gavaldá Department of Computer Science, UPC Fall 2016 http://www.cs.upc.edu/~caim

More information

Learning Features from Co-occurrences: A Theoretical Analysis

Learning Features from Co-occurrences: A Theoretical Analysis Learning Features from Co-occurrences: A Theoretical Analysis Yanpeng Li IBM T. J. Watson Research Center Yorktown Heights, New York 10598 liyanpeng.lyp@gmail.com Abstract Representing a word by its co-occurrences

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 13: Query Expansion and Probabilistic Retrieval Paul Ginsparg Cornell University,

More information

PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211

PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211 PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211 IIR 18: Latent Semantic Indexing Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University,

More information

Embeddings Learned By Matrix Factorization

Embeddings Learned By Matrix Factorization Embeddings Learned By Matrix Factorization Benjamin Roth; Folien von Hinrich Schütze Center for Information and Language Processing, LMU Munich Overview WordSpace limitations LinAlgebra review Input matrix

More information

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25 Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25 Trevor Cohn (Slide credits: William Webber) COMP90042, 2015, Semester 1 What we ll learn in this lecture Probabilistic models for

More information

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models 3. Retrieval Models Motivation Information Need User Retrieval Model Result: Query 1. 2. 3. Document Collection 2 Agenda 3.1 Boolean Retrieval 3.2 Vector Space Model 3.3 Probabilistic IR 3.4 Statistical

More information

PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 11: Probabilistic Information Retrieval Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk

More information

Can Vector Space Bases Model Context?

Can Vector Space Bases Model Context? Can Vector Space Bases Model Context? Massimo Melucci University of Padua Department of Information Engineering Via Gradenigo, 6/a 35031 Padova Italy melo@dei.unipd.it Abstract Current Information Retrieval

More information

Latent semantic indexing

Latent semantic indexing Latent semantic indexing Relationship between concepts and words is many-to-many. Solve problems of synonymy and ambiguity by representing documents as vectors of ideas or concepts, not terms. For retrieval,

More information

1. Ignoring case, extract all unique words from the entire set of documents.

1. Ignoring case, extract all unique words from the entire set of documents. CS 378 Introduction to Data Mining Spring 29 Lecture 2 Lecturer: Inderjit Dhillon Date: Jan. 27th, 29 Keywords: Vector space model, Latent Semantic Indexing(LSI), SVD 1 Vector Space Model The basic idea

More information

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 ) Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds

More information

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics Querying

More information

Matrices, Vector Spaces, and Information Retrieval

Matrices, Vector Spaces, and Information Retrieval Matrices, Vector Spaces, and Information Authors: M. W. Berry and Z. Drmac and E. R. Jessup SIAM 1999: Society for Industrial and Applied Mathematics Speaker: Mattia Parigiani 1 Introduction Large volumes

More information

MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson

MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Latent

More information

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides CSE 494/598 Lecture-4: Correlation Analysis LYDIA MANIKONDA HT TP://WWW.PUBLIC.ASU.EDU/~LMANIKON / **Content adapted from last year s slides Announcements Project-1 Due: February 12 th 2016 Analysis report:

More information

Latent Semantic Models. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Latent Semantic Models. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze Latent Semantic Models Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Vector Space Model: Pros Automatic selection of index terms Partial matching of queries

More information

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze Chapter 10: Information Retrieval See corresponding chapter in Manning&Schütze Evaluation Metrics in IR 2 Goal In IR there is a much larger variety of possible metrics For different tasks, different metrics

More information

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze Term Weighting and Vector Space Model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Ranked retrieval Thus far, our queries have all been Boolean. Documents either

More information

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,

More information

Assignment 3. Latent Semantic Indexing

Assignment 3. Latent Semantic Indexing Assignment 3 Gagan Bansal 2003CS10162 Group 2 Pawan Jain 2003CS10177 Group 1 Latent Semantic Indexing OVERVIEW LATENT SEMANTIC INDEXING (LSI) considers documents that have many words in common to be semantically

More information

Linear Algebra Background

Linear Algebra Background CS76A Text Retrieval and Mining Lecture 5 Recap: Clustering Hierarchical clustering Agglomerative clustering techniques Evaluation Term vs. document space clustering Multi-lingual docs Feature selection

More information

Dealing with Text Databases

Dealing with Text Databases Dealing with Text Databases Unstructured data Boolean queries Sparse matrix representation Inverted index Counts vs. frequencies Term frequency tf x idf term weights Documents as vectors Cosine similarity

More information

PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 6: Scoring, term weighting, the vector space model Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics,

More information

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006 Scribes: Gilly Leshed, N. Sadat Shami Outline. Review. Mixture of Poissons ( Poisson) model 3. BM5/Okapi method 4. Relevance feedback. Review In discussing probabilistic models for information retrieval

More information

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,

More information

Latent Semantic Analysis. Hongning Wang

Latent Semantic Analysis. Hongning Wang Latent Semantic Analysis Hongning Wang CS@UVa VS model in practice Document and query are represented by term vectors Terms are not necessarily orthogonal to each other Synonymy: car v.s. automobile Polysemy:

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 9: Collaborative Filtering, SVD, and Linear Algebra Review Paul Ginsparg

More information

Information Retrieval Basic IR models. Luca Bondi

Information Retrieval Basic IR models. Luca Bondi Basic IR models Luca Bondi Previously on IR 2 d j q i IRM SC q i, d j IRM D, Q, R q i, d j d j = w 1,j, w 2,j,, w M,j T w i,j = 0 if term t i does not appear in document d j w i,j and w i:1,j assumed to

More information

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval Ranking-II Temporal Representation and Retrieval Models Temporal Information Retrieval Ranking in Information Retrieval Ranking documents important for information overload, quickly finding documents which

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

PROBABILISTIC LATENT SEMANTIC ANALYSIS

PROBABILISTIC LATENT SEMANTIC ANALYSIS PROBABILISTIC LATENT SEMANTIC ANALYSIS Lingjia Deng Revised from slides of Shuguang Wang Outline Review of previous notes PCA/SVD HITS Latent Semantic Analysis Probabilistic Latent Semantic Analysis Applications

More information

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya CS 375 Advanced Machine Learning Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya Outline SVD and LSI Kleinberg s Algorithm PageRank Algorithm Vector Space Model Vector space model represents

More information

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III CS221 / Autumn 2017 / Liang & Ermon Lecture 15: Bayesian networks III cs221.stanford.edu/q Question Which is computationally more expensive for Bayesian networks? probabilistic inference given the parameters

More information

CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides

CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides CSE 494/598 Lecture-6: Latent Semantic Indexing LYDIA MANIKONDA HT TP://WWW.PUBLIC.ASU.EDU/~LMANIKON / **Content adapted from last year s slides Announcements Homework-1 and Quiz-1 Project part-2 released

More information

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 10: Neural Language Models Princeton University COS 495 Instructor: Yingyu Liang Natural language Processing (NLP) The processing of the human languages by computers One of

More information

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-2018 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 2 1 Boolean Retrieval Thus

More information

Latent Semantic Tensor Indexing for Community-based Question Answering

Latent Semantic Tensor Indexing for Community-based Question Answering Latent Semantic Tensor Indexing for Community-based Question Answering Xipeng Qiu, Le Tian, Xuanjing Huang Fudan University, 825 Zhangheng Road, Shanghai, China xpqiu@fudan.edu.cn, tianlefdu@gmail.com,

More information

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016 GloVe on Spark Alex Adamson SUNet ID: aadamson June 6, 2016 Introduction Pennington et al. proposes a novel word representation algorithm called GloVe (Global Vectors for Word Representation) that synthesizes

More information

A Note on the Effect of Term Weighting on Selecting Intrinsic Dimensionality of Data

A Note on the Effect of Term Weighting on Selecting Intrinsic Dimensionality of Data BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 9, No 1 Sofia 2009 A Note on the Effect of Term Weighting on Selecting Intrinsic Dimensionality of Data Ch. Aswani Kumar 1,

More information

ChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign

ChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign Axiomatic Analysis and Optimization of Information Retrieval Models ChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign http://www.cs.uiuc.edu/homes/czhai

More information

Social Data Mining Trainer: Enrico De Santis, PhD

Social Data Mining Trainer: Enrico De Santis, PhD Social Data Mining Trainer: Enrico De Santis, PhD enrico.desantis@uniroma1.it CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Outlines Vector Semantics From plain text to

More information

Leverage Sparse Information in Predictive Modeling

Leverage Sparse Information in Predictive Modeling Leverage Sparse Information in Predictive Modeling Liang Xie Countrywide Home Loans, Countrywide Bank, FSB August 29, 2008 Abstract This paper examines an innovative method to leverage information from

More information

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors? CS276A Information Retrieval Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring tf idf and vector spaces Lecture 7 This

More information

Extended IR Models. Johan Bollen Old Dominion University Department of Computer Science

Extended IR Models. Johan Bollen Old Dominion University Department of Computer Science Extended IR Models. Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen January 20, 2004 Page 1 UserTask Retrieval Classic Model Boolean

More information

16 The Information Retrieval "Data Model"

16 The Information Retrieval Data Model 16 The Information Retrieval "Data Model" 16.1 The general model Not presented in 16.2 Similarity the course! 16.3 Boolean Model Not relevant for exam. 16.4 Vector space Model 16.5 Implementation issues

More information