CS 572: Information Retrieval

Similar documents
Linear Algebra Background

Information Retrieval

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Latent Semantic Models. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Introduction to Information Retrieval

PV211: Introduction to Information Retrieval

Probabilistic Latent Semantic Analysis

Problems. Looks for literal term matches. Problems:

Latent Semantic Analysis. Hongning Wang

Information Retrieval

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Matrix decompositions and latent semantic indexing

Notes on Latent Semantic Analysis

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Latent Semantic Analysis. Hongning Wang

Embeddings Learned By Matrix Factorization

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Matrix Decomposition and Latent Semantic Indexing (LSI) Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

CS47300: Web Information Search and Management

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Informa(on Retrieval

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Informa(on Retrieval

Matrices, Vector Spaces, and Information Retrieval

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Document Similarity in Information Retrieval

Information Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING. Crista Lopes

MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson

1. Ignoring case, extract all unique words from the entire set of documents.

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Document and Topic Models: plsa and LDA

Scoring, Term Weighting and the Vector Space

PV211: Introduction to Information Retrieval

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

A few applications of the SVD

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Data Mining and Matrices

Probabilistic Latent Semantic Analysis

A Note on the Effect of Term Weighting on Selecting Intrinsic Dimensionality of Data

Data Mining Techniques

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

EIGENVALE PROBLEMS AND THE SVD. [5.1 TO 5.3 & 7.4]

Let A an n n real nonsymmetric matrix. The eigenvalue problem: λ 1 = 1 with eigenvector u 1 = ( ) λ 2 = 2 with eigenvector u 2 = ( 1

Language Information Processing, Advanced. Topic Models

Latent semantic indexing

Singular Value Decompsition

CS145: INTRODUCTION TO DATA MINING

Manning & Schuetze, FSNLP (c) 1999,2000

Manning & Schuetze, FSNLP, (c)

Using Matrix Decompositions in Formal Concept Analysis

Knowledge Discovery and Data Mining 1 (VO) ( )

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Dealing with Text Databases

CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

PV211: Introduction to Information Retrieval

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Generative Clustering, Topic Modeling, & Bayesian Inference

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Cross-Lingual Language Modeling for Automatic Speech Recogntion

boolean queries Inverted index query processing Query optimization boolean model January 15, / 35

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Bindel, Fall 2009 Matrix Computations (CS 6210) Week 8: Friday, Oct 17

CS Lecture 18. Topic Models and LDA

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

CS 572: Information Retrieval

Latent Dirichlet Allocation Introduction/Overview

Eigenvalue Problems Computation and Applications

CS281 Section 4: Factor Analysis and PCA

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Assignment 3. Latent Semantic Indexing

Collaborative Filtering: A Machine Learning Perspective

Dimensionality Reduction and Principle Components Analysis

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

CS 231A Section 1: Linear Algebra & Probability Review

Text Analytics (Text Mining)

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Bruce Hendrickson Discrete Algorithms & Math Dept. Sandia National Labs Albuquerque, New Mexico Also, CS Department, UNM

Principal Component Analysis

Lecture: Face Recognition and Feature Reduction

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Fast LSI-based techniques for query expansion in text retrieval systems

Transcription:

CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments: Some slides were adapted from Chris Manning, and from Thomas Hoffman 1

Plan for next few weeks Project 1: done (submit by Friday). Project 2: (topic) language models: TBA tomorrow Monday 2/22: no class. Watch LDA Google talk by David Blei: https://www.youtube.com/watch?v=7bmsuybpx90 Wednesday 2/24: guest lecture: Prof. Joyce Ho Monday 2/29: Semantics (conclusion); NLP for IR Wednesday 3/2: NLP for IR + guest lecture Wednesday 3/2: Midterm (take-home) assigned. Due by 5pm Thursday 3/3. 2

Recall: Term-document matrix Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth anthony 5.25 3.18 0.0 0.0 0.0 0.35 brutus 1.21 6.10 0.0 1.0 0.0 0.0 caesar 8.59 2.54 0.0 1.51 0.25 0.0 calpurnia 0.0 1.54 0.0 0.0 0.0 0.0 cleopatra 2.85 0.0 0.0 0.0 0.0 0.0 mercy 1.51 0.0 1.90 0.12 5.25 0.88 worser 1.37 0.0 0.11 4.15 0.25 1.95... Today: Can we transform this matrix to identify the "meaning" or topic of the documents, and use that for retrieval/classification, etc? 3 3

Problems with Lexical Semantics Ambiguity and association in natural language Polysemy: Words often have a multitude of meanings and different types of usage (more severe in very heterogeneous collections). The word-based retrieval model is unable to discriminate between different meanings of the same word. 4

Problems with Lexical Semantics Synonymy: Different terms may have identical or similar meanings (weaker: words indicating the same topic). No associations between words are made in the matrix or vector space representation. 5

Polysemy and Context Document similarity on single word level: polysemy and context planet... saturn... contribution to similarity, if used in 1 st meaning, but not if in 2 nd meaning 1 meaning 2 ring jupiter space voyager car company dodge ford 6

Solution: Topic Models Idea: model words in context (e.g., document) Examples: Topic models in science: http://topics.cs.princeton.edu/science/browser/ Topic models in javascript (by David Mimno) http://mimno.infosci.cornell.edu/jslda/ 7

Application: Model Evolution of Topics 8

Progression of Topic Models Latent Semantic Analysis Indexing (LSI LSA) Probalistic LSI (plsi) Probabilistic LSI with Dirichlet priors (LDA): Mon 2/22: Google tech talk by David Blei Scalable topic models (SVD/NMF, Bayes MF): Wed 2/24, Prof. Joyce Ho Word2Vec, other extensions (Mon 2/29) 9

Latent Semantic Indexing (LSI) Perform a low-rank approximation of documentterm matrix (typical rank 100-300) General idea Map documents (and terms) to a low-dimensional representation. Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space). Compute document similarity based on the inner product in this latent semantic space 10

Goals of LSI Similar terms map to similar location in low dimensional space Noise reduction by dimension reduction 11

Latent Semantic Analysis Latent semantic space: illustrative example courtesy of Susan Dumais 12

Latent semantic indexing: Overview Decompose the term-document matrix into a product of matrices. decomposition: singular value decomposition (SVD). SVD: C = UΣV T (where C = term-document matrix) Then, use the SVD to compute a new, improved term-document matrix C. Hope: get better similarity values out of C (compared to C). Using SVD for this purpose is called latent semantic indexing or LSI. 13 13

Singular Value Decomposition For an M N matrix A of rank r there exists a factorization (Singular Value Decomposition = SVD) as follows: A U V T M M M N V is N N The columns of U are orthogonal eigenvectors of AA T. The columns of V are orthogonal eigenvectors of A T A. Eigenvalues 1 r of AA T are the eigenvalues of A T A. i i diag 1... r Singular values. 14

Singular Value Decomposition Illustration of SVD dimensions and sparseness 15

SVD example Let 1 1 A 0 1 1 0 M=3, N=2. Its SVD is A U V T M M M N V is N N 0 1/ 1/ 2 2 2 / 1/ 1/ 6 6 6 1/ 1/ 1/ 3 1 3 0 3 0 0 1/ 3 1/ 0 2 2 1/ 1/ 2 2 Typically, the singular values arranged in decreasing order. 16

Low-rank Approximation SVD can be used to compute optimal low-rank approximations. Approximation problem: Find A k of rank k such that A k min) X : rank ( X k A X F Frobenius norm A k and X are both m n matrices. Typically, want k << r. 2/17/2016 17 CS572: Information Retrieval. Spring 2016

Low-rank Approximation Solution via SVD A U diag( 1,...,,0,...,0) V k k T set smallest r-k singular values to zero k A k k i 1 i u i v T i column notation: sum of rank 1 matrices 18

Reduced SVD If we retain only k singular values, and set the rest to 0, then we don t need the matrix parts in red Then Σ is k k, U is M k, V T is k N, and A k is M N This is referred to as the reduced SVD It is the convenient (space-saving) and usual form for computational applications k 19

Approximation error How good (bad) is this approximation? It s the best possible, measured by the Frobenius norm of the error: X :min rank ( X ) k A X F A A k F k 1 where the i are ordered such that i i+1. Suggests why Frobenius error drops as k increases. 20

SVD Low-rank approximation Whereas the term-doc matrix A may have M=50000, N=10 million (and rank close to 50000) We can construct an approximation A 100 with rank 100. Of all rank 100 matrices, it would have the lowest Frobenius error. C. Eckart, G. Young, The approximation of a matrix by another of lower rank. Psychometrika, 1, 211-218, 1936. 21

Connection to Vector Space Model Intuition: Dimension reduction through LSI brings together related axes in the vector space. 22

Intuition from block matrices n documents Block 1 What s the rank of this matrix? m terms Block 2 0 s 0 s Block k = non-zero entries. 23

Intuition from block matrices n documents Block 1 m terms Block 2 0 s 0 s Block k Vocabulary partitioned into k topics (clusters); each doc discusses only one topic. 24

Intuition from block matrices n documents Block 1 What s the best rank-k approximation to this matrix? m terms Block 2 0 s 0 s Block k = non-zero entries. 25

Intuition from block matrices Likely there s a good rank-k approximation to this matrix. wiper tire V6 Block 1 Block 2 Few nonzero entries Few nonzero entries car automobile 1 0 0 1 Block k 26

Assumption/Hope Topic 1 Topic 2 Topic 3 27

Latent Semantic Indexing by SVD 28

Performing the maps Each row and column of A gets mapped into the k- dimensional LSI space, by the SVD. Claim this is not only the mapping with the best (Frobenius error) approximation to A, but in fact improves retrieval. A query q is also mapped into this space, by q k q T U k 1 k Query NOT a sparse vector. 29

Sec. 18.4 Performing the maps A T A is the dot product of pairs of documents A T A A kt A k = (U k k V kt ) T (U k k V kt ) = V k k U k T U k k V k T = (V k k ) (V k k ) T Since V k = A kt U k k -1 we should transform query q to q k as follows q k q T U k k 1 30

Empirical evidence Experiments on TREC 1/2/3 Dumais Lanczos SVD code (available on netlib) due to Berry used in these expts Running times of ~ one day on tens of thousands of docs [still an obstacle to use] Dimensions various values 250-350 reported. Reducing k improves recall. (Under 200 reported unsatisfactory) Generally expect recall to improve what about precision? 2/17/2016 CS572: Information Retrieval. Spring 2016 31

2/17/2016 CS572: Information 32 Retrieval. Spring 2016

2/17/2016 CS572: Information 33 Retrieval. Spring 2016

2/17/2016 CS572: Information 34 Retrieval. Spring 2016

2/17/2016 CS572: Information 35 Retrieval. Spring 2016

2/17/2016 CS572: Information 36 Retrieval. Spring 2016

2/17/2016 CS572: Information 37 Retrieval. Spring 2016

2/17/2016 CS572: Information 38 Retrieval. Spring 2016

2/17/2016 CS572: Information 39 Retrieval. Spring 2016

Empirical evidence: Conclusion Precision at or above median TREC precision Top scorer on almost 20% of TREC topics Slightly better on average than straight vector spaces Effect of dimensionality: Dimensions Precision 250 0.367 300 0.371 346 0.374 40

Failure modes Negated phrases TREC topics sometimes negate certain query/terms phrases automatic conversion of topics to Boolean queries As usual, freetext/vector space syntax of LSI queries precludes (say) Find any doc having to do with the following 5 companies See Berry, Dumais for more (resources slide) 41

LSI has many other applications In many settings in pattern recognition and retrieval, we have a feature-object matrix. For text, the terms are features and the docs are objects. Could be opinions & users Recommender Systems This matrix may be redundant in dimensionality. Can work with low-rank approximation. If entries are missing (e.g., users opinions), can recover if dimensionality is low. 42

Resources http://www.cs.utk.edu/~berry/lsi++/ http://lsi.argreenhouse.com/lsi/lsipapers.html Dumais (1993) LSI meets TREC: A status report. Dumais (1994) Latent Semantic Indexing (LSI) and TREC-2. Dumais (1995) Using LSI for information filtering: TREC-3 experiments. M. Berry, S. Dumais and G. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4):573-- 595, 1995. 43

Probabilistic View: Topic Language Models M C Information need P( Q M C, MT ) P( Q M C, MT, M d ) generation M T1 M T2 M d1 M d 2 d1 d2 query M Tm M d n dn document collection 2/17/2016 CS572: Information 44 Retrieval. Spring 2016

Latent Aspects: Example 45

46

(probabilistic) LSI: plsi 47

Aspect Model Generation process Choose a doc d with prob P(d) There are N d s Choose a latent class z with (generated) prob P(z d) There are K z s, and K << N K chosen in advance (how many topics in collection???) Generate a word w with (generated) prob P(w z) This creates pair (d, w), without direct concern for z Joining the probabilities: Remember: P(z d) means probability of z, given d 2/17/2016 CS572: Information 48 Retrieval. Spring 2016

Aspect Model (2) Log-likelihood Maximize this to find P(d), P(z d), P(w z) Apply Bayes theorem: end up with What is modeled? Doc-specific word distributions, P(w d), are based on combination of specific classes/factors/aspects, P(w z) Not just assigned to nearest cluster 2/17/2016 CS572: Information 49 Retrieval. Spring 2016

plsi Learning 50

plsi Generative Model 51

Approach: Expectation Maximization (EM) EM is popular technique to maximize likelihood estimation Alternates between: E-step: calculate future probabilities of z based on current estimates M-step: update estimate parameters based on calculated probabilities 2/17/2016 CS572: Information 52 Retrieval. Spring 2016

Simple example Data: -4-3 -2-1 0 1 2 3 4 5 OBJECTIVE: Fit mixture of Gaussian model with C=2 components Model: where Parameters: P(x ) keep fixed i.e. only estimate 53 x

Likelihood function Likelihood is a function of parameters, Probability is a function of r.v. x DIFFERENT from LAST PLOT 54

Probabilistic model Imagine model generating data Need to introduce label, z, for each data point Label is called a latent variable also called hidden, unobserved, missing c -4-3 -2-1 0 1 2 3 4 5 Simplifies the problem: if we knew the labels, we can decouple the components as estimate parameters separately for each one 55

Intuition of EM E-step: Compute a distribution on the labels of the points, using current parameters M-step: Update parameters using current guess of label distribution. E M E M E 56

EM for plsi 57

plsa for IR: T. Hoffman 2000 MED 1033 docs CRAN 1400 docs CACM 3204 docs CISI 1460 docs Reporting best results with K varying from 32, 48, 64, 80, 128 plsa* model takes the average across all models at different K values 58

Example of topics found from a Science Magazine papers collection 59

Using Aspects for Query Expansion 60

Relevance Results Cosine Similarity is the baseline In LSI, query vector(q) is multiplied to get the reduced space vector In PLSI, p(z d) and p(z q). In EM iterations, only P(z q) is adapted 61

Precision-Recall results(4/4) 62

Experiment: plsi w/ 128-factor decomposition 2/17/2016 CS572: Information 63 Retrieval. Spring 2016

Extension: Document Priors Model the document prior LDA: extension of plsi to better model document generation process [David Blei] Video: https://www.youtube.com/watch?v=7bmsuybpx90 Lecture slides: http://www.cs.columbia.edu/~blei/talks/blei_mlss_2012.pdf 64