Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Similar documents
Boolean and Vector Space Retrieval Models

Data Mining Recitation Notes Week 3

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

1 Information retrieval fundamentals

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Information Retrieval

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Nearest Neighbor Search with Keywords

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Latent Semantic Analysis. Hongning Wang

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

CAIM: Cerca i Anàlisi d Informació Massiva

Information Retrieval and Web Search

Scoring, Term Weighting and the Vector Space

9 Searching the Internet with the SVD

Nearest Neighbor Search with Keywords: Compression

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Modern Information Retrieval

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Information Retrieval

Dealing with Text Databases

Lecture 5: Web Searching using the SVD

IR: Information Retrieval

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

13 Searching the Web with the SVD

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

High-Dimensional Indexing by Distributed Aggregation

Information Retrieval Basic IR models. Luca Bondi

IR Models: The Probabilistic Model. Lecture 8

Information Retrieval

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Link Mining PageRank. From Stanford C246

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

CS47300: Web Information Search and Management

Lecture 2 August 31, 2007

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

PV211: Introduction to Information Retrieval

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Algorithms for Data Science: Lecture on Finding Similar Items

Informa(on Retrieval

University of Illinois at Urbana-Champaign. Midterm Examination

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1

Towards Collaborative Information Retrieval

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006

Information Retrieval. Lecture 6

Chap 2: Classical models for information retrieval

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

Information Retrieval

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Lecture 7. Econ August 18

Manning & Schuetze, FSNLP, (c)

Combating Web Spam with TrustRank

Informa(on Retrieval

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Querying. 1 o Semestre 2008/2009

Latent Semantic Analysis. Hongning Wang

Singular Value Decompsition

Introduction to Information Retrieval

16 The Information Retrieval "Data Model"

PV211: Introduction to Information Retrieval

DISTRIBUTIONAL SEMANTICS

Math 304 Handout: Linear algebra, graphs, and networks.

Variable Latent Semantic Indexing

Skylines. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland

Evaluation Metrics. Jaime Arguello INLS 509: Information Retrieval March 25, Monday, March 25, 13

Generic Text Summarization

ECS130 Scientific Computing. Lecture 1: Introduction. Monday, January 7, 10:00 10:50 am

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.

Untitled September 04, 2007

Vectors. September 2, 2015


Ranked Retrieval (2)

Information Retrieval

CS 572: Information Retrieval

ChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval

Name Solutions Linear Algebra; Test 3. Throughout the test simplify all answers except where stated otherwise.

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Intensity-Duration-Frequency (IDF) Curves Example

PV211: Introduction to Information Retrieval

Text Analytics (Text Mining)

CS425: Algorithms for Web Scale Data

Maschinelle Sprachverarbeitung

Text Analytics (Text Mining)

STA141C: Big Data & High Performance Statistical Computing

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Midterm Examination Practice

vector space retrieval many slides courtesy James Amherst

Transcription:

Vector Space Model Yufei Tao KAIST March 5, 2013

In this lecture, we will study a problem that is (very) fundamental in information retrieval, and must be tackled by all search engines. Let S be a set of data documents D 1, D 2,..., D t, where t = S. Let Q be a query document. Each (data/query) document is a sequence of terms. We want to rank the data documents in descending order of their relevance to Q. Namely, the most relevant document should be placed first, the second most relevant placed next, and so on. We want to design a way for a computer to do it for us. Let us refer to the problem as the relevancy problem.

Example Consider that each document D i S (1 i t) is a news article. Now, given a query article Q about basketballs, we would like to find the documents that are similar to Q. In other words, ideally, we should rank the articles about basketballs before those about, say, soccer, and sports articles before non-sports articles.

How is this related to web search? Let each D i be a webpage in the Internet. Let Q, on the other hand, be the (super-short) document that is the sequence of terms a user inputs into Google s search box. By ranking the data documents, we are essentially listing them in descending order of how relevant they are to Q. This is the order by which we will present the documents to the user. Think Some search engines provide an option see more results like this. How can we implement this functionality if we can solve the relevancy problem?

WARNING The relevancy problem does not have an unambiguous correct answer (relevancy judgment is a subjective matter). Thus, the quality of a method can often be evaluated only empirically, i.e., to see how well it actually works in practice. There are a few methods that have been empirically shown to be quite effective. Among them, we will discuss an approached based on the vector space model because it is acknowledged by many scientists as being highly effective. It is also the approach taken by today s search engines.

The vector space model is based on the following rationale: Rationale If two documents have similar terms, then they are relevant. You can easily challenge the above rationale by giving many counter-examples. Even though you are absolutely right, the chance is that, every solution would fail when confronted with a carefully crafted counter-example. Indeed, it is not our goal to find a perfect approach that guarantees a correct answer in all scenarios remember that even correctness is hard to define. We will be satisfied if our heuristic approach works sufficiently well empirically, by which standard the above rationale is really not a bad one.

Let DICT be a dictionary which contains all the terms we want to consider in evaluating text relevancy. Think DICT should not include all the English words. Why? Let d = DICT, namely, the number of terms in DICT. Let us denote those terms as w 1, w 2,..., w d, respectively. Remark In practice, words like study, studies, studying and studied are all regarded the same. They can be all converted to the same form by a process commonly known as stemming.

Vector Space Model We will convert each document D i S (1 i t) into a point p i of dimensionality d. More specifically, let the coordinates of p i be p i [1], p i [2],..., p i [d], respectively. Then, for each j [1, d], p i [j] gives the relevance of term w j to document D i. Next, we will clarify how p i [j] is computed.

Computation of p i [j] Heuristic 1 If w j appears more often in D i, then w j is more relevant to D i. Definition (Term Frequency) The term frequency (TF) f ij of w j in D i equals the number of occurrences of w j in D i.

Computation of p i [j] Heuristic 2 If w j appears in many documents, then it has low relevance to all of them. Definition (Inverse Document Frequency) The inverse document frequency (IDF) idf j of w j equals idf j = log 2 S λ where λ is the number of documents in S containing w j.

Computation of p i [j] Now we can compute p i [j] as: p i [j] = log 2 (1 + f ij ) idf j The above equation is well known as the tf-idf formula. Think How does this formula reflect Heuristics 1 and 2? Also, why the logs? Why the +1?

At this point, we have represented each document D i S (1 i t) as a d-dimensional point p i, where d = S. Let P be the set {p 1,..., p t }. In a similar manner, we can also convert the query document to a d-dimensional point q = (q[1],..., q[d]), where q[j] = log 2 (1 + f j ) idf j where f j is the term frequency of term w j (i.e., the j-term in the dictionary DICT ) in Q. We will evaluate the relevance of D i to Q, denoted as score(d i, Q), by looking at only p i and q. Think It is a bad idea to set score(d i, Q) simply to the Euclidean distance (i.e., the line segment distance ) between p i and q. Why?

Cosine Metric Let us view p i as a vector: from the origin to p i. Similarly, let us also view q as a vector. The norms of p i and q are calculated as: p i = d p i [j] 2 q = Their dot product is: j=1 d q[j] 2 j=1 p i q = d (p i [j] q[j]) j=1

Cosine Metric Let θ be the angle between the vectors p i and q. We know: cos(θ) = p i q p i q We then take the above to be score(d i, Q), namely: score(d i, Q) = cos(θ) Hence, the larger the cosine value is, the more relevant D i is to q. The above formula is well known as the cosine metric.

Putting Everything Together We have introduced a full set of procedures required to tackle the relevancy problem. In a nutshell, the procedures are: 1 Convert each document D i S to a point p i. 2 Convert the query document to a point q. 3 Calculate score(d i, Q) for all i [1, S ]. 4 Sort the documents of S in descending order of their scores.

Think What are the disadvantages of the vector space model?