CS47300: Web Information Search and Management

Similar documents
DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Latent Semantic Analysis. Hongning Wang

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Latent semantic indexing

Information Retrieval

Manning & Schuetze, FSNLP (c) 1999,2000

Dimensionality Reduction

Let A an n n real nonsymmetric matrix. The eigenvalue problem: λ 1 = 1 with eigenvector u 1 = ( ) λ 2 = 2 with eigenvector u 2 = ( 1

Latent Semantic Analysis. Hongning Wang

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

EIGENVALE PROBLEMS AND THE SVD. [5.1 TO 5.3 & 7.4]

Singular Value Decompsition

CS 572: Information Retrieval

Assignment 3. Latent Semantic Indexing

Fast LSI-based techniques for query expansion in text retrieval systems

CS47300: Web Information Search and Management

Machine learning for pervasive systems Classification in high-dimensional spaces

Generic Text Summarization

Latent Semantic Models. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Latent Semantic Analysis (Tutorial)

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Linear Algebra - Part II

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Notes on Latent Semantic Analysis

Linear Algebra Background

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Problems. Looks for literal term matches. Problems:

Introduction to Information Retrieval

Variable Latent Semantic Indexing

Parallel Singular Value Decomposition. Jiaxing Tan

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

Manning & Schuetze, FSNLP, (c)

Matrix Decomposition and Latent Semantic Indexing (LSI) Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Probabilistic Latent Semantic Analysis

Multi-Label Informed Latent Semantic Indexing

Linear Algebra Review. Vectors

A few applications of the SVD

The Singular Value Decomposition

Matrices, Vector Spaces, and Information Retrieval

Data Mining and Matrices

USING SINGULAR VALUE DECOMPOSITION (SVD) AS A SOLUTION FOR SEARCH RESULT CLUSTERING

Jordan Normal Form and Singular Decomposition

Background Mathematics (2/2) 1. David Barber

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig


Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

CS 143 Linear Algebra Review

Matrix decompositions and latent semantic indexing

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Linear Algebra & Geometry why is linear algebra useful in computer vision?

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

PV211: Introduction to Information Retrieval

Matrix Decomposition in Privacy-Preserving Data Mining JUN ZHANG DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF KENTUCKY

UNIT 6: The singular value decomposition.

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Assignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Singular Value Decomposition

vector space retrieval many slides courtesy James Amherst

Singular Value Decomposition (SVD) and Polar Form

1 Singular Value Decomposition and Principal Component

Lecture II: Linear Algebra Revisited

1 Feature Vectors and Time Series

LEC 3: Fisher Discriminant Analysis (FDA)

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Computational Methods. Eigenvalues and Singular Values

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Lecture: Face Recognition and Feature Reduction

Applied Linear Algebra in Geoscience Using MATLAB

Investigation of Latent Semantic Analysis for Clustering of Czech News Articles

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING. Crista Lopes

Learning Query and Document Similarities from Click-through Bipartite Graph with Metadata

Basic Calculus Review

Foundations of Computer Vision

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review

Deep Learning Book Notes Chapter 2: Linear Algebra

Bruce Hendrickson Discrete Algorithms & Math Dept. Sandia National Labs Albuquerque, New Mexico Also, CS Department, UNM

Lecture: Face Recognition and Feature Reduction

Learning Query and Document Similarities from Click-through Bipartite Graph with Metadata

Data Mining Techniques

Text Analytics (Text Mining)

STA141C: Big Data & High Performance Statistical Computing

Principal components analysis COMS 4771

EIGENVALUES AND SINGULAR VALUE DECOMPOSITION

Boolean and Vector Space Retrieval Models

Knowledge Discovery and Data Mining 1 (VO) ( )

Lecture 5: Web Searching using the SVD

Functional Analysis Review

Conceptual Questions for Review

Collaborative Filtering. Radek Pelánek

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

13 Searching the Web with the SVD

Transcription:

CS47300: Web Information Search and Management Prof. Chris Clifton 6 September 2017 Material adapted from course created by Dr. Luo Si, now leading Alibaba research group 1 Vector Space Model Disadvantages: Hard to choose the dimension of the vector ( basic concept ) Terms may not be the best choice Assume independent relationship among terms Heuristic for choosing vector operations Choose of term weights Choose of similarity function Assume a query and a document can be treated in the same way Jan-17 20 Christopher W. Clifton 1

Vector Space Model What is a good vector representation? Orthogonal: the dimensions are linearly independent ( no overlapping ) No ambiguity (e.g., Java) Wide coverage and good granularity Good interpretation (e.g., representation of semantic meaning) Many possibilities: words, stemmed words, latent concepts. Dual space of terms and documents C1 C2 C3 C4 B1 B2 B3 information 1 1 0 0 0 0 0 retrieval 1 1 0 0 0 0 0 machine 1 1 1 1 0 0 0 learning 0 1 1 1 0 0 0 system 1 0 1 0 1 0 0 protein 0 0 0 0 0 1 1 gene 0 0 0 0 1 1 0 mutation 0 0 1 0 0 1 1 expression 0 0 0 0 1 0 1 Jan-17 20 Christopher W. Clifton 2

(LSI): Explore correlation between terms and documents Two terms are correlated (may share similar semantic concepts) if they often co-occur Two documents are correlated (share similar topics) if they have many common words Associate each term and document with a small number of semantic concepts/topics Use singular value decomposition (SVD) to find a small set of concepts/topics m: number of concepts/topics Representation of concept in document space; V T V=I m Representation of concept in term space; U T U=I m Diagonal matrix: concept space 7 Jan-17 20 Christopher W. Clifton 3

Use singular value decomposition (SVD) to find a small set of concepts/topics m: number of concepts/topics Representation of document in concept space Representation of term in concept space Diagonal matrix: concept space Properties of Diagonal elements of S as S k in descending order, the larger the more important x k = σ i k u k S k v k is the rank-k matrix that best approximates, where U k and V k are the column vector of U and V 9 Jan-17 20 Christopher W. Clifton 4

Other properties of The columns of U are eigenvectors of T The columns of V are eigenvectors of T The singular values on the diagonal of S, are the positive square roots of the nonzero eigenvalues of both AA T and A T A 10 11 Jan-17 20 Christopher W. Clifton 5

12 13 Jan-17 20 Christopher W. Clifton 6

14 Importance of Concepts Importance of Concept Reflects Error of Approximating with small S Size of S k 16 Jan-17 20 Christopher W. Clifton 7

SVD representation Reduce high dimensional representation of document or query into low dimensional concept space SVD tries to preserve the Euclidean distance of document/term vector C1 C2 Concept 1 Concept 2 17 SVD Representation B C Representation of the documents in two dimensional concept space 18 Jan-17 20 Christopher W. Clifton 8

SVD Representation B C Representation of the terms in two dimensional concept space 19 Retrieval with respect to a query Map (fold-in) a query into the representation of the concept space Use the new representation of the query to calculate the similarity between query and all documents Cosine Similarity 20 Jan-17 20 Christopher W. Clifton 9

Query: Machine Learning Protein Representation of the query in the term vector space: [0 0 1 1 0 1 0 0 0] T Representation of the query in the latent semantic space (2 concepts): =[-0.3571 0.1635] T B Query C 22 Jan-17 20 Christopher W. Clifton 10

CS47300: Web Information Search and Management Prof. Chris Clifton 6 September 2017 Material adapted from course created by Dr. Luo Si, now leading Alibaba research group 23 Comparison of Retrieval Results in term space and concept space Query: Machine Learning Protein Jan-17 20 Christopher W. Clifton 11

Problems with latent semantic indexing Difficult to decide the number of concepts There is no probabilistic interpolation for the results The complexity of the LSI model obtained from SVD is costly Retrieval Models Outline Exact-match retrieval method Unranked Boolean retrieval method Ranked Boolean retrieval method Best-match retrieval Vector space retrieval method Latent semantic indexing Jan-17 20 Christopher W. Clifton 12