Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26
|
|
- Caroline Watson
- 6 years ago
- Views:
Transcription
1 Fall 2016 CS646: Information Retrieval Lecture 6 Boolean Search and Vector Space Model Jiepu Jiang University of Massachusetts Amherst 2016/09/26
2 Outline Today Boolean Retrieval Vector Space Model Latent sementic indexing
3 Boolean Retrieval Search by exact match Queries are combined with logic operators such as AND, OR, NOT Can be nested, e.g., x AND ( y OR ( a AND b ) OR c ) Conceptually returns a set of results, without ranking Although most systems implemented certain rankings Boolean query Boolean Retrieval Can also filter results using Boolean queries in ranked retrieval x AND y x OR y NOT x x y x y x
4 Boolean Retrieval: Advantages Precise Transparant, controllable, predictable (for trained users) Can be very efficiently implemented (ignores term frequency) Works well when you know what the collection contains and what you re looking for Works well if the size of the corpus can be handled by human One has the chance to get familiar with the corpus Still widely used today, especially in specialized search systems Westlaw law search ISI web of knowledge science citation index search Almost all library catalog systems Many prefer Boolean retrieval (librarians, lawyers, physicians) Modified from James Allan s CS646 slides.
5 Ch. 6 Boolean Retrieval: Disadvantages Lacking a ranking mechanism, which is especially important Effectiveness highly depends on the user s ability to formulate good queries Users need sufficient training and knowledge about the corpus to formulate good queries General users do not have the knowledge Sometimes impossible to be familiar with the corpus, e.g., web People are lazy A much higher cost to formulate Boolean queries Not that really controllable AND gives too few; OR gives too many. Modified from James Allan s CS646 slides.
6 Ch. 6 Ranked Retrieval (Best Match Search) Returns a ranked list of results Necessary for large corpus such as the web Free text search Necessary and much easier for ordinary users But can also use Boolean queries to filter search results Just rank the filtered results, e.g., index search NOT database Requires some ranking model (the core of best-match search) The basis for processing free text query is Boolean search Web search engine: Boolean AND (fast & usually enough) May not work for bad queries; drop one or a few terms IR experiment & research: Boolean OR (more accurate experiment results) Modified from James Allan s CS646 slides.
7 Ch. 6 Ranked Retrieval (Best Match Search) Vector space model (today!) Probabilistics models, e.g., BM25 (Wed) Language modeling approachs (next week) Document representation (two weeks later) Query representation (two weeks later) Learning-to-rank (three weeks later) Midterm (11/3, 7-9pm) Right after we finished evaluation Retrieval model (about 50%) Evaluation (about 30%) Other (about 20%)
8 Outline Today Boolean Search Vector Space Model Latent sementic indexing Gerard Salton ( ) (Gerard Salton Award, 1983) Photo from
9 Vector Space Model: A Simple Example Consider each indexed term as a unique dimension V = { t 1, t 2,, t k }; the vocabulary has k unique terms t 1 : cat Example Vocabulary: {cat, dog, lion} t 1 = cat t 2 = dog t 3 = lion t 2 : dog t 3 : lion
10 Vector Space Model: A Simple Example Consider each indexed term as a unique dimension Represent each document as a vector in the k-dimensional space DD = [ w 1, w 2,, w k ]; for example, w i can be the frequency of t i in D D 1 : cat cat cat t 1 : cat DD 1 [ ] D 1 = (3, 0, 0) Notation t - an index term w - a term s weight (in this example, the term s frequency) t 2 : dog t 3 : lion
11 Vector Space Model: A Simple Example Consider each indexed term as a unique dimension Represent each document as a vector in the k-dimensional space DD = [ w 1, w 2,, w k ]; for example, w i can be the frequency of t i in D t 1 : cat D 2 : cat dog cat DD 2 (2, 1, 0) [ ] D 2 = We ignore word sequence in this simple VSM example also called bag-ofwords model. t 3 : lion t 2 : dog
12 Vector Space Model: A Simple Example Consider each indexed term as a unique dimension Represent each document as a vector in the k-dimensional space DD = [ w 1, w 2,, w k ]; for example, w i can be the frequency of t i in D t 1 : cat D 3 : cat dog lion dog D 3 = [ ] We ignore word sequence in this simple VSM example also called bag-ofwords model. t 3 : lion DD 3 (1, 2, 1) t 2 : dog
13 Vector Space Model: A Simple Example VSM makes it easy to measure similarity, which implies relevance Suppose we have four documents as follows To which extent do the four documents relate to each other? Which is the most related to D 4? Probably D 3 > D 2 > D 1 We need a model to capture such relatedness/relevance D 1 : cat cat cat D 2 : cat cat dog D 3 : cat dog dog D 4 : dog dog dog (discuss only cat) (discuss both, but more on cat) (discuss both, but more on dog) (discuss only dog)
14 Recall Luhn s idea (Lecture 2) Similarity implies relevance The more two representations agreed in given elements and their distribution, the higher would be the probability of their representing similar information. Hans Peter Luhn ( ) H. P. Luhn. A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development, 1(4), 1957: Photo from
15 Vector Space Model: A Simple Example The direction of a vector indicates the distribution of words. Comparing two documents word distributions is equivalent to measuring the size of the angle between two documents vectors. A smaller angle indicates a higher degree of similarity. This applies to any k-dimensional space. Angle(D 4, D 3 ) < Angle(D 4, D 2 ) < Angle(D 4, D 1 ) Similarity(D 4, D 3 ) > Similarity(D 4, D 2 ) > Similarity(D 4, D 1 ) t 1 : cat DD 1 3,0,0 DD 2 2,1,0 DD 3 1,2,0 DD 4 0,3,0 DD 5 3,6,0 t 2 : dog
16 Vector Space Model: A Simple Example Computationally it is easier to use cosine as a surrogate y = cosine(x) is monotonically decreasing when xx 0, ππ 2 A higher cosine value indicates a higher level of similarity. Angle(D 4, D 3 ) < Angle(D 4, D 2 ) < Angle(D 4, D 1 ) cosine(x) cos(d 4, D 3 ) > cos(d 4, D 2 ) > cos(d 4, D 1 ) x ππ 2
17 Cosine Similarity: Computation Dot product cos xy, x y = = x y ( ) i= 1 Euclidean length x = x x x [ ] k y = y y y [ ] k k xy i k k 2 2 x i yi i= 1 i= 1 i
18 Cosine similarity ignores vector length x = x x x [ ] k y = y y y [ ] k Dot product of two unit vectors cos xy, ( ) k x y x y xi yi = = = x y x y x y i= 1 unit vectors
19 Vector Space Model: A Simple Example VSM has many applications, such as text clustering (Lecture 10). For the purpose of retrieval, we can simply represent a query as a vector using the same approach and rank results by cosine(q,d). query: dog q = [ 1 0 0] similarity D 3 : cat dog dog D 2 : cat cat dog D 1 : cat cat cat t 1 : cat DD 1 3,0,0 DD 2 2,1,0 DD 3 1,2,0 qq 0,1,0 t 2 : dog
20 Vector Space Model: A Simple Example VSM has many applications, such as text clustering (Lecture 10). For the purpose of retrieval, we can simply represent a query as a vector using the same approach and rank results by cosine(q,d). query: cat dog cat q = [ 2 1 0] similarity D 2 : cat cat dog t 1 : cat DD 1 3,0,0 DD 2 2,1,0 = qq 2,1,0 D 1 : cat cat cat D 3 : cat dog dog DD 3 1,2,0 t 2 : dog
21 Vector Space Model: A Simple Example Hmmm, seems problematic We ll talk about something extensions very soon No best-match search model so far is perfect But reasonably good and useful query: cat dog cat similarity D 2 : cat cat dog D 1 : cat cat cat D 3 : cat dog dog real relevance D 2 : cat cat dog D 3 : cat dog dog D 1 : cat cat cat
22 Cosine Similarity: IR Computation cos, qd q d q d i 1 i 1 qd Independent of ranking q = q q q d = d d d [ ] k [ ] k k i i i i k k k qi di di i 1 i 1 i 1 k qd It is faster if the index stored the Euclidean length (different from document length).
23 Vector Space Model The framework is generic Define a k-dimensional space Represent each document as a vector in the k-dimensional space Represent a query as a vector in the k-dimensional space Measure relevance by the similarity of the query and the document The simple example is just a particular implementation Each term as a unique dimension. Document a within-document term frequency vector Query a query term frequency vector Measuring relevance by cos qd, ( ) q D
24 VSM: Dimension In most cases we just consider each unique word as a dimension. It looks very simple, but it works reasonably well. Many limitations Terms (dimensions) are independent of each other synonyms: retrieve, search, seek, words related to the same topic: retrieval, index, precision We will discuss an extension (LSI) very soon Ignore word sequence (all bag-of-words models do ) the woman was shot by the suspect is equivalent to the suspect was shot by the woman Term proximity (Lecture 9, next week) But it s not easy to solve the limitations effectively Many solutions do not outperform the simple approach
25 VSM: Document Vector Consider each indexed term as a unique dimension V = { t 1, t 2,, t k } DD = [ w 1, w 2,, w k ] w i : how important the term t i is for representing D s information How to compute w i? usually problem dependent Generally two types of factors to consider: document-dependent Does w i appear in D? How many times? Where? e.g., title, heading document-independent Is w i an important word? Is it a noun/verb/adj/? Is it a number/name/emoticon?
26 VSM: TF-IDF Weighting DD = [ w 1, w 2,, w k ] w i : how important the term t i is for representing D s information How to compute w i? A popular approach is to use a TF-IDF weighting TF: within-document term frequency IDF: inverse document frequency w = TF t, D IDF t ( ) ( ) i i i document-dependent document-independent
27 Choices of TF: Binary Only consider whether or not a term appears in a document Ignore repeated occurrences of the same term When to use? Very short documents: repeated occurrence of the same term is rare and unstable (due to the small text sample). e.g. twitter search, passage/sentence retrieval, TF binary ( t, D) i ( i ) ( ) 1 c t, D > 0 = 0 c ti, D = 0 Notation c(t, D): the number of times t appears in D (term frequency).
28 Choices of TF: Raw Frequency Frequent terms in a document are more important (for representing that document s information). The importance of a term is proportional to its frequency in D. When to use? When you believe it is true or you have the evidence that it is true Frequent terms may have too strong influence ( ) = ( ) TFraw ti, D c ti, D Notation c(t, D): the number of times t appears in D (term frequency).
29 Choices of TF: Log Frequency Frequent terms in a document are more important (for representing that document s information). But repreated occurrences of the term will be penalized The first occurrence of a term is the most important. Repeated occurrences are less and less important. A greater log base (b) penalizes the value by a greater extent. TF raw ( t, D) i ( ) c( t D) c( t D) 1+ log bc ti, D i, > 0 = 0 i, = 0 Notation c(t, D): the number of times t appears in D (term frequency).
30 y = x= log 1 x y = log 2 x y = log x y = log 10 x
31 Choices of TF: Log Frequency Prefers matching more unique terms c(t 1,d) = 2, c(t 2,d) = 1 is better than c(t 1,d) = 3, c(t 2,d) = 0 query: cat dog cat q = [ ] similarity D 2 : cat cat dog D 3 : cat dog dog D 1 : cat cat cat t 1 : cat DD ,0,0 log log DD ,1,0 = qq 1.69,1,0 DD 3 1,1.69,0 t 2 : dog
32 Choices of TF: Others But raw frequency, log frequency, and binary are the most popular ones. Table from the CDM textbook.
33 Choices of IDF: Uniform Every term is equally important Almost always a bad idea When to use? I can t remember any time it worked IDF t = uniform ( ) 1 i
34 Choices of IDF: KSJ The original IDF by Karen Spärck Jones (KSJ) The log base does not affect ranking (in most retrieval models) The total frequency of the term in the corpus does not directly influence IDF (although almost always highly correlated with n t ). IDF KSJ t = ( ) log N n t Notation N: the total number of documents in the corpus. n t : the total number of documents contains t.
35 Choices of IDF: BM25 The IDF used in BM25 (Wed) Have some probabilistic interpretations: P(w NR) (Wed) +0.5 used for smoothing zero values, no influence in a large corpus A greater discouting: very frequency terms have zero weight IDF BM 25 ( t) N nt N log nt < nt = N 0 nt 2 Notation N: the total number of documents in the corpus. n t : the total number of documents contains t.
36 VSM: Query Vector Can use the approach for document vector But can be different as well SMART supports using different approaches for query and documents But it requires a lot of hand-tuning But usually avoid using very different approaches Some consideration: Repeated occurrences of a query term may not mean the term is more important to the user s information need. Anyhow, query is short and term repeatition is very rare.
37 VSM: Similarity Measure Cosine is the most frequently used one. Cosine normalize vectors by Euclidean length. Euclidean distance is another option Do not fit for most IR applications Strongly influenced by document length Not only have a similar distribution, but also a similar length. A few areas that may apply: plagiarism detection, finding similar documents
38 Standard VSM Summary Very simple Map everything to a vector Compare using the angle between vectors Challenge: finding a good weighting scheme Variants of TF-IDF are the most common Okapi TF function is popular, particular in research systems The VSM model provides no guidance Another challenge: comparison/similarity function Cosine is the most common Generic inner product (without unit vectors) also occurs The VSM model provides no guidance
39 Outline Today Boolean Search Vector Space Model Latent sementic indexing Scott Deerwester Scott s photo from Susan T. Dumais Gerard Salton Award, 2009 Athena Lecturer Award, 2014 Sue s photo from Wikipedia
40 Terms are not independent dimensions Some terms tend to co-occur more/less often than others. P(t 1 t 2 ) = P(t 1 ) P(t 2 ): t 1 and t 2 co-occur randomly (independent). P(t 1 t 2 ) > P(t 1 ) P(t 2 ): t 1 and t 2 often co-occur together (dependent). P(t 1 t 2 ) < P(t 1 ) P(t 2 ): t 1 and t 2 often do not co-occur (dependent). D1 D2 D3 D4 index retrieval search information data computer science
41 Terms are not independent dimensions Some terms tend to co-occur more/less often than others. P(t 1 t 2 ) = P(t 1 ) P(t 2 ): t 1 and t 2 co-occur randomly (independent). P(t 1 t 2 ) > P(t 1 ) P(t 2 ): t 1 and t 2 often co-occur together (dependent). P(t 1 t 2 ) < P(t 1 ) P(t 2 ): t 1 and t 2 often do not co-occur (dependent). D1 D2 D3 D4 index retrieval search information data computer science
42 Terms are not independent dimensions We don t need as many as V dimensions. We can have a more compact representation of the corpus. The latent dimensions may also capture certain semantics (such as a group of synonyms or words related to the same topic). Latent dimension 1 index search retrieval information data science computer Latent dimension 2
43 Singular Value Decomposition (SVD) It require at most m dimensions to fully represent a corpus with m documents, because rrrrrrrr CC kk mm mm We simply assume rrrrrrrr CC kk mm = mm and m < k in this example SVD is a dimension reduction technique that transforms the corpus from the original k-dimensional space to a m-dimensional space. Original If you are not familiar with the definition of rank, please check Wikipedia: C = U S V T k m k m m m m m k C k m = k U k m m S m m * * * 0 * * 0 * * * m Transformed V T m m m m m m
44 D1 D2 D3 D4 index retrieval C = k m search information data computer science To Do SVD in MatLab: [U, S, V] = svd(c, 0);
45 H1 H2 H3 H4 index retrieval search U k m information data computer science H1 H2 H3 H4 D1 D2 D3 D4 H H S m m H H V T m m H H H H
46 U defines the directions of the m axes of the new m-dimensional space in the original k-dimensional space. H1 H2 H3 H4 index retrieval U = k m search information data computer science Each H i is a latent dimension, and each H i is a unit vector. The latent dimensions (H1, ) are orthogonal to each other. e.g., cos(h1, H2) = 0, cos(h2, H4) = 0,
47 If you are not familiar with the definition of singular value, please check Wikipedia: S is the diagonal matrix of singular values. H1 H2 H3 H4 H S = m m H H H You can consider singular values as the scaling factors of between the original and the new space. We can also consider singular values as the importance or informativeness of the latent dimensions. By convention, S is sorted and should have only positive values.
48 V represents the documents using the new dimensions. D1 D2 D3 D4 H V T = m m H H H Each column in V is also a unit-length vector. You can consider V as the documents vectors in the new m- dimensional space (after transformation and scaling). USV -1 restores the coordinates of the documents in the original k-dimensional space.
49 Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), Latent Sementic Indexing (LSI) We want even more compact representation. We just use the most important n dimensions (n << rank(c)). The transformed representation is an approximation of the original one. Original C U S V T k m k n n n m n k C k m = k U k n n S n n * * * n * 0 n Transformed V T m n 0 * * * * m m n
50 n = 3 U k n H1 H2 H3 H4 index retrieval search information data computer science H1 H2 H3 H4 D1 D2 D3 D4 H H S n n H H V T m n H H H H
51 LSI-Restored Document Representation Using the top three hidden dimensions, SS = C k m (Original) Uk nsn nvm n T (Restored) D1 D2 D3 D4 index retrieval search information data computer science D1 D2 D3 D4 index retrieval search information data computer science Shading means difference >1
52 n = 2 U k n H1 H2 H3 H4 index retrieval search information data computer science H1 H2 H3 H4 D1 D2 D3 D4 H H S n n H H V T m n H H H H
53 LSI-Restored Document Representation Using the top two hidden dimensions, SS = C k m (Original) Uk nsn nvm n T (Restored) D1 D2 D3 D4 index retrieval search information data computer science D1 D2 D3 D4 index retrieval search information data computer science Shading means difference >1
54 n = 1 U k n H1 H2 H3 H4 index retrieval search information data computer science H1 H2 H3 H4 D1 D2 D3 D4 H H S n n H H V T m n H H H H
55 LSI-Restored Document Representation Using the top one hidden dimensions, SS = C k m (Original) Uk nsn nvm n T (Restored) D1 D2 D3 D4 index retrieval search information data computer science D1 D2 D3 D4 index retrieval search information data computer science Shading means difference >1
56 LSI: Retrieval We can use U and S to encode a query or a new document to the n-dimensional space as well. For example, q = information retrieval index, we can transform the query q to u. T T u 1 n 1 qk 1Uk ns = n n q index 1 retrieval 1 search 0 information 1 data 0 computer 0 science 0 H1 H2 H3 H H1 H2 H3 H4 H H H H
57 LSI: Retrieval We retrieve search results by comparing the transformed query u with the transformed representations of the documents. Original query q index 1 retrieval 1 search 0 information 1 data 0 computer 0 science 0 u = q U S T T 1 n 1 k 1 k n n n The projection of the query in the new space. u H H H Make comparison with documents in the new dimensions. D1 D2 D3 D4 H H H
58 LSI: Retrieval What does the transformed query mean? We can restore u to the original k-dimensional space. LSI helps expand the original query to include the term search, which seems helpful in this example. q (Original) k 1 k n n n n 1 q index 1 retrieval 1 search 0 information 1 data 0 computer 0 science 0 u H H H U S u 0.68 index 0.66 retrieval 0.72 search 0.91 information 0.02 data 0.04 computer science (Restored)
59 Results on the MED Dataset: 1033 medical abstracts, 30 queries Seems not helpful for improving the precision of the top-ranked results. Figure from Deerwester et al. (1990)
60 Results on the CISI Dataset: 1460 information science abstracts, 35 queries In general seems not helpful. Figure from Deerwester et al. (1990)
61 Is LSI any good? Decomposes language into basis vectors In a sense, is looking for core concepts In theory, this means that system will retrieve documents using synonyms or related words of your query words An appealing technique! We should improve bag-of-words! The original paper has been cited over 10,000 times! Somehow limited improvements, especially in terms of precision In many cases, LSI improves recall by sacrificing precision The human query is more specific (although sometimes may miss a few important words). The LSI query becomes fuzzy Why bag-of-words is so strong? Probably because human language and its vocabulary has already been evolving for several thousand years. Modified from James Allan s CS646 slides.
62 VSM Summary Standard vector space Each dimension corresponds to a term in the vocabulary Vector elements are real-valued, reflecting term importance Any vector (document, query,...) can be compared to any other Cosine correlation is the similarity metric used most often Still widely used today! Latent Semantic Indexing (LSI) Each dimension corresponds to a basic concept Documents and queries mapped into basic concepts Same as standard vector space after that Whether it s good depends on what you want Modified from James Allan s CS646 slides.
63 VSM Disadvantages Assumed independence relationship among terms Though this is a very common retrieval model assumption Lack of justification for some vector operations e.g., choice of similarity function e.g., choice of term weights Barely a retrieval model Doesn t explicitly model relevance, a person s information need, language models, etc. Assumes a query and a document can be treated the same (symmetric) Lack of a cognitive (or other) justification Modified from James Allan s CS646 slides.
64 VSM Advantages Simplicity Ability to incorporate term weights Any type of term weights can be added No model that has to justify the use of a weight Ability to handle distributed term representations e.g., LSI Can measure similarities between almost anything: documents and queries documents and documents queries and queries sentences and sentences etc. Modified from James Allan s CS646 slides.
65 Wed (9/28) Probabilistics retrieval models. HW1 is due Wed midnight 11:59pm!
vector space retrieval many slides courtesy James Amherst
vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the
More informationLatent Semantic Analysis. Hongning Wang
Latent Semantic Analysis Hongning Wang CS@UVa Recap: vector space model Represent both doc and query by concept vectors Each concept defines one dimension K concepts define a high-dimensional space Element
More informationInformation Retrieval
Introduction to Information CS276: Information and Web Search Christopher Manning and Pandu Nayak Lecture 13: Latent Semantic Indexing Ch. 18 Today s topic Latent Semantic Indexing Term-document matrices
More information5 10 12 32 48 5 10 12 32 48 4 8 16 32 64 128 4 8 16 32 64 128 2 3 5 16 2 3 5 16 5 10 12 32 48 4 8 16 32 64 128 2 3 5 16 docid score 5 10 12 32 48 O'Neal averaged 15.2 points 9.2 rebounds and 1.0 assists
More informationChap 2: Classical models for information retrieval
Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81 Outline Basic IR Models 1 Basic
More informationSparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent
More informationANLP Lecture 22 Lexical Semantics with Dense Vectors
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous
More informationBoolean and Vector Space Retrieval Models
Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) 1
More information.. CSC 566 Advanced Data Mining Alexander Dekhtyar..
.. CSC 566 Advanced Data Mining Alexander Dekhtyar.. Information Retrieval Latent Semantic Indexing Preliminaries Vector Space Representation of Documents: TF-IDF Documents. A single text document is a
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Ranked retrieval Thus far, our queries have all been Boolean. Documents either
More informationManning & Schuetze, FSNLP, (c)
page 554 554 15 Topics in Information Retrieval co-occurrence Latent Semantic Indexing Term 1 Term 2 Term 3 Term 4 Query user interface Document 1 user interface HCI interaction Document 2 HCI interaction
More informationNatural Language Processing. Topics in Information Retrieval. Updated 5/10
Natural Language Processing Topics in Information Retrieval Updated 5/10 Outline Introduction to IR Design features of IR systems Evaluation measures The vector space model Latent semantic indexing Background
More informationCS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002 Recap of last time Index size Index construction techniques Dynamic indices Real world considerations 2 Back of the envelope
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 8: Evaluation & SVD Paul Ginsparg Cornell University, Ithaca, NY 20 Sep 2011
More informationManning & Schuetze, FSNLP (c) 1999,2000
558 15 Topics in Information Retrieval (15.10) y 4 3 2 1 0 0 1 2 3 4 5 6 7 8 Figure 15.7 An example of linear regression. The line y = 0.25x + 1 is the best least-squares fit for the four points (1,1),
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search IR models: Vector Space Model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Brosing boolean vector probabilistic
More informationVector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model
Vector Space Model Yufei Tao KAIST March 5, 2013 In this lecture, we will study a problem that is (very) fundamental in information retrieval, and must be tackled by all search engines. Let S be a set
More informationNotes on Latent Semantic Analysis
Notes on Latent Semantic Analysis Costas Boulis 1 Introduction One of the most fundamental problems of information retrieval (IR) is to find all documents (and nothing but those) that are semantically
More informationVariable Latent Semantic Indexing
Variable Latent Semantic Indexing Prabhakar Raghavan Yahoo! Research Sunnyvale, CA November 2005 Joint work with A. Dasgupta, R. Kumar, A. Tomkins. Yahoo! Research. Outline 1 Introduction 2 Background
More informationBoolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).
Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval
More informationTerm Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan
Term Weighting and the Vector Space Model borrowing from: Pandu Nayak and Prabhakar Raghavan IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes
More informationRanked Retrieval (2)
Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF
More informationRetrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1
Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1 Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR)
More informationSemantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing
Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding
More informationCS 572: Information Retrieval
CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments: Some slides were adapted from Chris Manning, and from Thomas Hoffman 1 Plan for next few weeks Project 1: done (submit by Friday).
More information1 Information retrieval fundamentals
CS 630 Lecture 1: 01/26/2006 Lecturer: Lillian Lee Scribes: Asif-ul Haque, Benyah Shaparenko This lecture focuses on the following topics Information retrieval fundamentals Vector Space Model (VSM) Deriving
More informationIR Models: The Probabilistic Model. Lecture 8
IR Models: The Probabilistic Model Lecture 8 ' * ) ( % $ $ +#! "#! '& & Probability of Relevance? ' ', IR is an uncertain process Information need to query Documents to index terms Query terms and index
More informationInformation Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture; IIR Sections
More informationRETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic
More informationINF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING. Crista Lopes
INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING Crista Lopes Outline Precision and Recall The problem with indexing so far Intuition for solving it Overview of the solution The Math How to measure
More informationScoring, Term Weighting and the Vector Space
Scoring, Term Weighting and the Vector Space Model Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content [J
More informationCS47300: Web Information Search and Management
CS47300: Web Information Search and Management Prof. Chris Clifton 6 September 2017 Material adapted from course created by Dr. Luo Si, now leading Alibaba research group 1 Vector Space Model Disadvantages:
More informationDISTRIBUTIONAL SEMANTICS
COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.
More informationINFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review
INFO 4300 / CS4300 Information Retrieval IR 9: Linear Algebra Review Paul Ginsparg Cornell University, Ithaca, NY 24 Sep 2009 1/ 23 Overview 1 Recap 2 Matrix basics 3 Matrix Decompositions 4 Discussion
More informationFast LSI-based techniques for query expansion in text retrieval systems
Fast LSI-based techniques for query expansion in text retrieval systems L. Laura U. Nanni F. Sarracco Department of Computer and System Science University of Rome La Sapienza 2nd Workshop on Text-based
More informationInformation Retrieval. Lecture 6
Information Retrieval Lecture 6 Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring tf idf and vector spaces This lecture
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 18: Latent Semantic Indexing Hinrich Schütze Center for Information and Language Processing, University of Munich 2013-07-10 1/43
More informationGeneric Text Summarization
June 27, 2012 Outline Introduction 1 Introduction Notation and Terminology 2 3 4 5 6 Text Summarization Introduction Notation and Terminology Two Types of Text Summarization Query-Relevant Summarization:
More informationText Analytics (Text Mining)
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS
More informationHow Latent Semantic Indexing Solves the Pachyderm Problem
How Latent Semantic Indexing Solves the Pachyderm Problem Michael A. Covington Institute for Artificial Intelligence The University of Georgia 2011 1 Introduction Here I present a brief mathematical demonstration
More informationInformation Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)
Information Retrieval and Topic Models Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin) Sec. 1.1 Unstructured data in 1620 Which plays of Shakespeare
More informationInformation Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture;
More informationModern Information Retrieval
Modern Information Retrieval Chapter 3 Modeling Introduction to IR Models Basic Concepts The Boolean Model Term Weighting The Vector Model Probabilistic Model Retrieval Evaluation, Modern Information Retrieval,
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,
More informationLecture 5: Web Searching using the SVD
Lecture 5: Web Searching using the SVD Information Retrieval Over the last 2 years the number of internet users has grown exponentially with time; see Figure. Trying to extract information from this exponentially
More informationMachine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012
Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component
More information9 Searching the Internet with the SVD
9 Searching the Internet with the SVD 9.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this
More information13 Searching the Web with the SVD
13 Searching the Web with the SVD 13.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 11: Probabilistic Information Retrieval 1 Outline Basic Probability Theory Probability Ranking Principle Extensions 2 Basic Probability Theory For events A
More informationRanked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-017 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 1 Boolean Retrieval Thus far,
More informationText Analytics (Text Mining)
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS
More informationCAIM: Cerca i Anàlisi d Informació Massiva
1 / 21 CAIM: Cerca i Anàlisi d Informació Massiva FIB, Grau en Enginyeria Informàtica Slides by Marta Arias, José Balcázar, Ricard Gavaldá Department of Computer Science, UPC Fall 2016 http://www.cs.upc.edu/~caim
More informationLearning Features from Co-occurrences: A Theoretical Analysis
Learning Features from Co-occurrences: A Theoretical Analysis Yanpeng Li IBM T. J. Watson Research Center Yorktown Heights, New York 10598 liyanpeng.lyp@gmail.com Abstract Representing a word by its co-occurrences
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 13: Query Expansion and Probabilistic Retrieval Paul Ginsparg Cornell University,
More informationPV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211
PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211 IIR 18: Latent Semantic Indexing Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University,
More informationEmbeddings Learned By Matrix Factorization
Embeddings Learned By Matrix Factorization Benjamin Roth; Folien von Hinrich Schütze Center for Information and Language Processing, LMU Munich Overview WordSpace limitations LinAlgebra review Input matrix
More informationLecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25
Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25 Trevor Cohn (Slide credits: William Webber) COMP90042, 2015, Semester 1 What we ll learn in this lecture Probabilistic models for
More informationMotivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models
3. Retrieval Models Motivation Information Need User Retrieval Model Result: Query 1. 2. 3. Document Collection 2 Agenda 3.1 Boolean Retrieval 3.2 Vector Space Model 3.3 Probabilistic IR 3.4 Statistical
More informationPV211: Introduction to Information Retrieval
PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 11: Probabilistic Information Retrieval Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk
More informationCan Vector Space Bases Model Context?
Can Vector Space Bases Model Context? Massimo Melucci University of Padua Department of Information Engineering Via Gradenigo, 6/a 35031 Padova Italy melo@dei.unipd.it Abstract Current Information Retrieval
More informationLatent semantic indexing
Latent semantic indexing Relationship between concepts and words is many-to-many. Solve problems of synonymy and ambiguity by representing documents as vectors of ideas or concepts, not terms. For retrieval,
More information1. Ignoring case, extract all unique words from the entire set of documents.
CS 378 Introduction to Data Mining Spring 29 Lecture 2 Lecturer: Inderjit Dhillon Date: Jan. 27th, 29 Keywords: Vector space model, Latent Semantic Indexing(LSI), SVD 1 Vector Space Model The basic idea
More informationPart A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )
Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds
More informationVector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson
Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics Querying
More informationMatrices, Vector Spaces, and Information Retrieval
Matrices, Vector Spaces, and Information Authors: M. W. Berry and Z. Drmac and E. R. Jessup SIAM 1999: Society for Industrial and Applied Mathematics Speaker: Mattia Parigiani 1 Introduction Large volumes
More informationMATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson
MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Latent
More informationCSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides
CSE 494/598 Lecture-4: Correlation Analysis LYDIA MANIKONDA HT TP://WWW.PUBLIC.ASU.EDU/~LMANIKON / **Content adapted from last year s slides Announcements Project-1 Due: February 12 th 2016 Analysis report:
More informationLatent Semantic Models. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze
Latent Semantic Models Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Vector Space Model: Pros Automatic selection of index terms Partial matching of queries
More informationChapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze
Chapter 10: Information Retrieval See corresponding chapter in Manning&Schütze Evaluation Metrics in IR 2 Goal In IR there is a much larger variety of possible metrics For different tasks, different metrics
More informationTerm Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze
Term Weighting and Vector Space Model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Ranked retrieval Thus far, our queries have all been Boolean. Documents either
More informationLatent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology
Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,
More informationAssignment 3. Latent Semantic Indexing
Assignment 3 Gagan Bansal 2003CS10162 Group 2 Pawan Jain 2003CS10177 Group 1 Latent Semantic Indexing OVERVIEW LATENT SEMANTIC INDEXING (LSI) considers documents that have many words in common to be semantically
More informationLinear Algebra Background
CS76A Text Retrieval and Mining Lecture 5 Recap: Clustering Hierarchical clustering Agglomerative clustering techniques Evaluation Term vs. document space clustering Multi-lingual docs Feature selection
More informationDealing with Text Databases
Dealing with Text Databases Unstructured data Boolean queries Sparse matrix representation Inverted index Counts vs. frequencies Term frequency tf x idf term weights Documents as vectors Cosine similarity
More informationPV211: Introduction to Information Retrieval
PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 6: Scoring, term weighting, the vector space model Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics,
More informationCS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006
Scribes: Gilly Leshed, N. Sadat Shami Outline. Review. Mixture of Poissons ( Poisson) model 3. BM5/Okapi method 4. Relevance feedback. Review In discussing probabilistic models for information retrieval
More informationLatent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology
Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,
More informationLatent Semantic Analysis. Hongning Wang
Latent Semantic Analysis Hongning Wang CS@UVa VS model in practice Document and query are represented by term vectors Terms are not necessarily orthogonal to each other Synonymy: car v.s. automobile Polysemy:
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 9: Collaborative Filtering, SVD, and Linear Algebra Review Paul Ginsparg
More informationInformation Retrieval Basic IR models. Luca Bondi
Basic IR models Luca Bondi Previously on IR 2 d j q i IRM SC q i, d j IRM D, Q, R q i, d j d j = w 1,j, w 2,j,, w M,j T w i,j = 0 if term t i does not appear in document d j w i,j and w i:1,j assumed to
More informationRanking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval
Ranking-II Temporal Representation and Retrieval Models Temporal Information Retrieval Ranking in Information Retrieval Ranking documents important for information overload, quickly finding documents which
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationPROBABILISTIC LATENT SEMANTIC ANALYSIS
PROBABILISTIC LATENT SEMANTIC ANALYSIS Lingjia Deng Revised from slides of Shuguang Wang Outline Review of previous notes PCA/SVD HITS Latent Semantic Analysis Probabilistic Latent Semantic Analysis Applications
More informationCS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya
CS 375 Advanced Machine Learning Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya Outline SVD and LSI Kleinberg s Algorithm PageRank Algorithm Vector Space Model Vector space model represents
More informationCS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III
CS221 / Autumn 2017 / Liang & Ermon Lecture 15: Bayesian networks III cs221.stanford.edu/q Question Which is computationally more expensive for Bayesian networks? probabilistic inference given the parameters
More informationCSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides
CSE 494/598 Lecture-6: Latent Semantic Indexing LYDIA MANIKONDA HT TP://WWW.PUBLIC.ASU.EDU/~LMANIKON / **Content adapted from last year s slides Announcements Homework-1 and Quiz-1 Project part-2 released
More informationDeep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 10: Neural Language Models Princeton University COS 495 Instructor: Yingyu Liang Natural language Processing (NLP) The processing of the human languages by computers One of
More informationRanked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-2018 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 2 1 Boolean Retrieval Thus
More informationLatent Semantic Tensor Indexing for Community-based Question Answering
Latent Semantic Tensor Indexing for Community-based Question Answering Xipeng Qiu, Le Tian, Xuanjing Huang Fudan University, 825 Zhangheng Road, Shanghai, China xpqiu@fudan.edu.cn, tianlefdu@gmail.com,
More informationCME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016
GloVe on Spark Alex Adamson SUNet ID: aadamson June 6, 2016 Introduction Pennington et al. proposes a novel word representation algorithm called GloVe (Global Vectors for Word Representation) that synthesizes
More informationA Note on the Effect of Term Weighting on Selecting Intrinsic Dimensionality of Data
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 9, No 1 Sofia 2009 A Note on the Effect of Term Weighting on Selecting Intrinsic Dimensionality of Data Ch. Aswani Kumar 1,
More informationChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign
Axiomatic Analysis and Optimization of Information Retrieval Models ChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign http://www.cs.uiuc.edu/homes/czhai
More informationSocial Data Mining Trainer: Enrico De Santis, PhD
Social Data Mining Trainer: Enrico De Santis, PhD enrico.desantis@uniroma1.it CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Outlines Vector Semantics From plain text to
More informationLeverage Sparse Information in Predictive Modeling
Leverage Sparse Information in Predictive Modeling Liang Xie Countrywide Home Loans, Countrywide Bank, FSB August 29, 2008 Abstract This paper examines an innovative method to leverage information from
More informationRecap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?
CS276A Information Retrieval Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring tf idf and vector spaces Lecture 7 This
More informationExtended IR Models. Johan Bollen Old Dominion University Department of Computer Science
Extended IR Models. Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen January 20, 2004 Page 1 UserTask Retrieval Classic Model Boolean
More information16 The Information Retrieval "Data Model"
16 The Information Retrieval "Data Model" 16.1 The general model Not presented in 16.2 Similarity the course! 16.3 Boolean Model Not relevant for exam. 16.4 Vector space Model 16.5 Implementation issues
More information