Variable Latent Semantic Indexing

Similar documents
Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Latent Semantic Analysis. Hongning Wang

Latent Semantic Analysis. Hongning Wang

Manning & Schuetze, FSNLP (c) 1999,2000

Latent semantic indexing

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Problems. Looks for literal term matches. Problems:

Information Retrieval

Information Retrieval

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Manning & Schuetze, FSNLP, (c)

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Latent Semantic Models. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

vector space retrieval many slides courtesy James Amherst

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

CS47300: Web Information Search and Management

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Linear Algebra Background

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Fast LSI-based techniques for query expansion in text retrieval systems

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Multi-Label Informed Latent Semantic Indexing

Introduction to Information Retrieval

Investigation of Latent Semantic Analysis for Clustering of Czech News Articles

Matrix decompositions and latent semantic indexing

Dealing with Text Databases

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Generic Text Summarization

PV211: Introduction to Information Retrieval

Ontology-Based News Recommendation

Large-Scale Behavioral Targeting

Data Mining Recitation Notes Week 3

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

PV211: Introduction to Information Retrieval

1 Information retrieval fundamentals

How Latent Semantic Indexing Solves the Pachyderm Problem

9 Searching the Internet with the SVD

Boolean and Vector Space Retrieval Models

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Linear Methods in Data Mining

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Lecture 5: Web Searching using the SVD

Information Retrieval Basic IR models. Luca Bondi

Matrix Decomposition and Latent Semantic Indexing (LSI) Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

USING SINGULAR VALUE DECOMPOSITION (SVD) AS A SOLUTION FOR SEARCH RESULT CLUSTERING

13 Searching the Web with the SVD

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

1. Ignoring case, extract all unique words from the entire set of documents.

IR Models: The Probabilistic Model. Lecture 8

Semantic Similarity from Corpora - Latent Semantic Analysis

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

DISTRIBUTIONAL SEMANTICS

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Collaborative topic models: motivations cont

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Information Retrieval

Gaussian Models

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Probabilistic Latent Semantic Analysis

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

A Note on the Effect of Term Weighting on Selecting Intrinsic Dimensionality of Data

INFO 630 / CS 674 Lecture Notes

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?

Assignment 3. Latent Semantic Indexing

Text Mining for Economics and Finance Unsupervised Learning

Topic Models and Applications to Short Documents

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Index. Copyright (c)2007 The Society for Industrial and Applied Mathematics From: Matrix Methods in Data Mining and Pattern Recgonition By: Lars Elden

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Embeddings Learned By Matrix Factorization

Text Analytics (Text Mining)

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Clustering based tensor decomposition

Information Retrieval. Lecture 6

Link Analysis Ranking

Latent Dirichlet Allocation (LDA)

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Data Mining Techniques

Ranked Retrieval (2)

Scoring, Term Weighting and the Vector Space

STA141C: Big Data & High Performance Statistical Computing

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Maschinelle Sprachverarbeitung

Latent Semantic Indexing

Transcription:

Variable Latent Semantic Indexing Prabhakar Raghavan Yahoo! Research Sunnyvale, CA November 2005 Joint work with A. Dasgupta, R. Kumar, A. Tomkins. Yahoo! Research.

Outline 1 Introduction 2 Background 3 Variable Latent Semantic Indexing 4 Experiments

3 Outline 1 Introduction 2 Background 3 Variable Latent Semantic Indexing 4 Experiments

Searching Text Corpora Word Count Apple 10... Drivers 12 Oranges 0... Tiger 20 Widget 5

5 Term-Document Matrices t 2 t 1 t 3 query document query document t 4 Each term is a dimension. Each document is a vector over terms. Query is a vector over terms. Weighting schemes Boolean, Okapi, TF-IDF etc. Document similarity" closeness in term-space.

6 Document Similarity query document 1 document 2 document n 1 1 1 0 0 0 1 1 1 0 0 1......... 1 0 1 A 0 0 Term-document matrix A, query vector q. Document relevance to query given by (weighted) number of terms in common. Relevance scores given by q T A.

Outline 1 Introduction 2 Background 3 Variable Latent Semantic Indexing 4 Experiments

LSI at a high level Term-document matrix A. Want a representation à such that à preserves semantic associations. uses less resources. Goal : measuring query-document similarity using à is efficient and gives better results. Basic intuition of SVD/LSI used in clustering, collaborative filtering.

9 Singular Value Decomposition Singular Value Decomposition of a 3 3 matrix......... A = U Σ V T... =...... σ 1 σ 2 σ 3.........

10 Latent Semantic Indexing documents Suppose only k topics in data. terms Keep top k singular values and vectors. = U σ 1 σ 2... V T Denoted as A (k).

Latent Semantic Indexing documents Suppose only k topics in data. terms Keep top k singular values and vectors. = U σ 1 σ 2... V T Denoted as A (k). A (k) is closest rank-k matrix to A. closest Frobenius, L 2 norms.

12 LSI in Answering Queries Propose A (k) as the representation Ã. Space reduced by factor k avg(#terms). Dimensionality of corpus number of topics in corpus. Results in cleaner representation of structure by ignoring irrelevant axes. Believed to identify synonymous terms e.g. car and automobile. Disambiguate based on context.

13 Latent Semantic Indexing t 2 t 1 documents in A documents in A ~ Finds best rank-k subspace fitting the documents. t 3 Ã = A (k). t 4

Motivating Variable LSI query t 3 t 2 t 1 documents in A documents in A ~ t 4 Finds best rank-k subspace fitting the documents. Ã = A (k). But, we were dealing with answering queries?

15 Outline 1 Introduction 2 Background 3 Variable Latent Semantic Indexing 4 Experiments

6 Query Distribution Query vectors have a skewed distribution over terms. In a corpus, might get queries only for a small subset of terms. e.g.: Queries about sport and politics... Co-occurence between query terms? e.g. data + mining. Ad-hoc solution: delete irrelevant terms. Any principled approach?

17 Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want à such that for most such vectors q, q T A q T Ã.

18 Variable Latent Semantic Indexing Probability distribution Q over set of terms. Query vector q chosen according to Q. Want à such that for most such vectors q, q T A q T Ã. First cut : minimize expectation of q T (A Ã).

19 Isotropic Query Distribution t 2 t 1 document random query document Query distribution has uniformly random direction. Expected q T (A Ã) is minimized at t 3 t 4 Ã = A (k).

20 Skewed Query Distribution t 2 t 1 document random query vector t 3 document t 4 Need to skew rank-k approximation to match query distribution.

Variable Latent Semantic Indexing Recall A: term-document matrix, Q: query distribution. Co-occurrence matrix: X = C 1/2 A. C = Ex q Q [ qq T ] Find rank-k approximation X (k) of X. Return à = C 1/2 X (k).

Proof Intuition original term-document space t 4

23 Proof Intuition original term-document space transformed term space X=C 1/2 A

Proof Intuition original term-document space transformed term space X=C 1/2 A X (k) low-rank space in transformed term space

25 Proof Intuition original term-document space transformed term space X=C 1/2 A X (k) return to original space -1/2X (k) C low-rank space in transformed term space

26 Outline 1 Introduction 2 Background 3 Variable Latent Semantic Indexing 4 Experiments

27 Experimental Setup Reuters data (1987). 21k documents. Five categories. 112k terms. 134 terms per document. preprocessing porter-stemmed, case-folded and stop-worded. term-document matrices with boolean, okapi weighting. used svdpackc.

28 Experimental Setup: Query Distribution Single-word terms distributed according to frequency in corpus. power law, ordered by distribution in corpus. power law on random ordering. Two topics: money, commodities Double-word: power law on ranked bigrams.

29 VLSI Results: L 2 error Typically, saves an order of magnitude in dimensions for L 2 error.

30 Results: Competitive Error comp. error = 1 comp. precision. Substantial improvements for competitive error too.

31 Summary LSI does effective dimension reduction, but can be fine-tuned using VLSI to query distributions. Space requirements same as that of LSI. Have to estimate co-occurrence matrix.

32 Future Work Personalized versions? Analyze retrieval using stochastic data models? Computational issues Using sampling for efficiency? Updating using query streams? Application to other domains?

33 Thanks!