CS47300: Web Information Search and Management

Size: px

Start display at page:

Download "CS47300: Web Information Search and Management"

Dwain McDowell
5 years ago
Views:

1 CS47300: Web Information Search and Management Prof. Chris Clifton 6 September 2017 Material adapted from course created by Dr. Luo Si, now leading Alibaba research group 1 Vector Space Model Disadvantages: Hard to choose the dimension of the vector ( basic concept ) Terms may not be the best choice Assume independent relationship among terms Heuristic for choosing vector operations Choose of term weights Choose of similarity function Assume a query and a document can be treated in the same way Jan Christopher W. Clifton 1

2 Vector Space Model What is a good vector representation? Orthogonal: the dimensions are linearly independent ( no overlapping ) No ambiguity (e.g., Java) Wide coverage and good granularity Good interpretation (e.g., representation of semantic meaning) Many possibilities: words, stemmed words, latent concepts. Dual space of terms and documents C1 C2 C3 C4 B1 B2 B3 information retrieval machine learning system protein gene mutation expression Jan Christopher W. Clifton 2

3 (LSI): Explore correlation between terms and documents Two terms are correlated (may share similar semantic concepts) if they often co-occur Two documents are correlated (share similar topics) if they have many common words Associate each term and document with a small number of semantic concepts/topics Use singular value decomposition (SVD) to find a small set of concepts/topics m: number of concepts/topics Representation of concept in document space; V T V=I m Representation of concept in term space; U T U=I m Diagonal matrix: concept space 7 Jan Christopher W. Clifton 3

Use singular value decomposition (SVD) to find a small set of concepts/topics m: number of concepts/topics Representation of document in concept space Representation of term in concept space Diagonal

4 Use singular value decomposition (SVD) to find a small set of concepts/topics m: number of concepts/topics Representation of document in concept space Representation of term in concept space Diagonal matrix: concept space Properties of Diagonal elements of S as S k in descending order, the larger the more important x k = σ i k u k S k v k is the rank-k matrix that best approximates, where U k and V k are the column vector of U and V 9 Jan Christopher W. Clifton 4

5 Other properties of The columns of U are eigenvectors of T The columns of V are eigenvectors of T The singular values on the diagonal of S, are the positive square roots of the nonzero eigenvalues of both AA T and A T A Jan Christopher W. Clifton 5

6 12 13 Jan Christopher W. Clifton 6

7 14 Importance of Concepts Importance of Concept Reflects Error of Approximating with small S Size of S k 16 Jan Christopher W. Clifton 7

8 SVD representation Reduce high dimensional representation of document or query into low dimensional concept space SVD tries to preserve the Euclidean distance of document/term vector C1 C2 Concept 1 Concept 2 17 SVD Representation B C Representation of the documents in two dimensional concept space 18 Jan Christopher W. Clifton 8

9 SVD Representation B C Representation of the terms in two dimensional concept space 19 Retrieval with respect to a query Map (fold-in) a query into the representation of the concept space Use the new representation of the query to calculate the similarity between query and all documents Cosine Similarity 20 Jan Christopher W. Clifton 9

10 Query: Machine Learning Protein Representation of the query in the term vector space: [ ] T Representation of the query in the latent semantic space (2 concepts): =[ ] T B Query C 22 Jan Christopher W. Clifton 10

11 CS47300: Web Information Search and Management Prof. Chris Clifton 6 September 2017 Material adapted from course created by Dr. Luo Si, now leading Alibaba research group 23 Comparison of Retrieval Results in term space and concept space Query: Machine Learning Protein Jan Christopher W. Clifton 11

12 Problems with latent semantic indexing Difficult to decide the number of concepts There is no probabilistic interpolation for the results The complexity of the LSI model obtained from SVD is costly Retrieval Models Outline Exact-match retrieval method Unranked Boolean retrieval method Ranked Boolean retrieval method Best-match retrieval Vector space retrieval method Latent semantic indexing Jan Christopher W. Clifton 12

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD DATA MINING LECTURE 8 Dimensionality Reduction PCA -- SVD The curse of dimensionality Real data usually have thousands, or millions of dimensions E.g., web documents, where the dimensionality is the vocabulary