Intelligent Data Analysis. Mining Textual Data. School of Computer Science University of Birmingham

Similar documents
Intelligent Data Analysis Lecture Notes on Document Mining

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Notes on Latent Semantic Analysis

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Latent Semantic Analysis. Hongning Wang

1. Ignoring case, extract all unique words from the entire set of documents.

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

Latent semantic indexing

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

Intelligent Data Analysis. Principal Component Analysis. School of Computer Science University of Birmingham

Manning & Schuetze, FSNLP, (c)

Information Retrieval

Machine learning for pervasive systems Classification in high-dimensional spaces

Latent Semantic Analysis. Hongning Wang

Leverage Sparse Information in Predictive Modeling

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Vector Space Models. wine_spectral.r

A few applications of the SVD

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Data Mining and Matrices

Linear Algebra Background

Fast LSI-based techniques for query expansion in text retrieval systems

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Machine Learning - MT & 14. PCA and MDS

Manning & Schuetze, FSNLP (c) 1999,2000

Introduction to Machine Learning

Probabilistic Latent Semantic Analysis

Problems. Looks for literal term matches. Problems:

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Latent Semantic Models. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

CS 572: Information Retrieval

Unsupervised learning: beyond simple clustering and PCA

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

CAIM: Cerca i Anàlisi d Informació Massiva

20 Unsupervised Learning and Principal Components Analysis (PCA)

A Note on the Effect of Term Weighting on Selecting Intrinsic Dimensionality of Data

PROBABILISTIC LATENT SEMANTIC ANALYSIS

More PCA; and, Factor Analysis

What is Principal Component Analysis?

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Dimensionality Reduction

Boolean and Vector Space Retrieval Models

Lecture 7: Con3nuous Latent Variable Models

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Classification 2: Linear discriminant analysis (continued); logistic regression

Matrix decompositions and latent semantic indexing

CS 340 Lec. 6: Linear Dimensionality Reduction

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity

Singular Value Decomposition and Principal Component Analysis (PCA) I

Statistics for Applications. Chapter 9: Principal Component Analysis (PCA) 1/16

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Chap 2: Classical models for information retrieval

CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

EIGENVALE PROBLEMS AND THE SVD. [5.1 TO 5.3 & 7.4]

Document and Topic Models: plsa and LDA

Query Propagation in Possibilistic Information Retrieval Networks

Motivating the Covariance Matrix

Semantic Similarity from Corpora - Latent Semantic Analysis

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

CS145: INTRODUCTION TO DATA MINING

Introduction to Graphical Models

Solutions Problem Set 8 Math 240, Fall

Matrix Decomposition and Latent Semantic Indexing (LSI) Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Let A an n n real nonsymmetric matrix. The eigenvalue problem: λ 1 = 1 with eigenvector u 1 = ( ) λ 2 = 2 with eigenvector u 2 = ( 1

Lecture 7 Spectral methods

Principal Component Analysis

Chapter 4: Factor Analysis

Short Term Memory Quantifications in Input-Driven Linear Dynamical Systems

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Linear Algebra in a Nutshell: PCA. can be seen as a dimensionality reduction technique Baroni & Evert. Baroni & Evert

Data Mining and Matrices

Face Detection and Recognition

Bruce Hendrickson Discrete Algorithms & Math Dept. Sandia National Labs Albuquerque, New Mexico Also, CS Department, UNM

Linear Algebra and Eigenproblems

Matching the dimensionality of maps with that of the data

Eigenfaces. Face Recognition Using Principal Components Analysis

Machine Learning 11. week

1 Principal Components Analysis

Text Mining for Economics and Finance Unsupervised Learning

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Data Mining Techniques

Information Retrieval. Lecture 6

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

PCA and LDA. Man-Wai MAK

Multi-Label Informed Latent Semantic Indexing

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Unsupervised Learning

Transcription:

Intelligent Data Analysis Mining Textual Data Peter Tiňo School of Computer Science University of Birmingham

Representing documents as numerical vectors Use a special set of terms T = {t 1, t 2,..., t T }. Represent documents from D = {d 1, d 2,..., d N } by checking, for each document d j D, the occurrence of each of terms t k T in d j. The intensity of occurrence of a term t k in document d j is quantified by a number (weight) x k,j R. Each document d j, j = 1, 2,..., N, is then represented by a document vector d j summarizing contributions of all terms d j = (x 1,j, x 2,j,..., x T,j ) R T. Peter Tiňo 1

Terms are What are the terms t k? 1. intelligently picked words that are useful to check in a given application (Bag of Words). Bag of words can vary from application to application. More sophisticated representations are often not necessary. Bag of words works nicely in many applications. Can you think why? 2. Phrases (rather than individual words). Results not very encouraging. Can you think why? 3. A combination of the two. A promising direction. Peter Tiňo 2

Usually x k,j [0, 1]. Calculating the weights x k,j 1. Binary weights: x k,j = 1, if term t k is present in document d j ; x k,j = 0, otherwise. 2. Real valued weights: The more frequent t k is in d j, the more it is representative of its content. The more documents t k is. occurs in, the less discriminating it x k,j = tfidf(t k, d j ) = N k,j log N N k, Peter Tiňo 3

where N - number of documents in D N k - number of documents in D in which term t k occurs N k,j - number of times t k occurs in document d j The weights are often normalized to unit length: x k,j x k,j x k,j. Now, x k,j [0, 1]. Peter Tiňo 4

Document retrieval/clustering/classification Work directly in the document representation space! Document retrieval: Query of terms: q - view as a mini document q R T. Compare with documents d j in D along the term directions using e.g. a similarity measure cos α = q d j q d j Output the best matching documents from D. Peter Tiňo 5

Think of document clustering and classification... Peter Tiňo 6

This is all nice, but...... the data is extremely sparse! Typically we can expect 1000 s of terms and 10000 s of documents. Think of the situation in terms of a set of N document representations in a T -dimensional space. How can we hope for a reasonable statistics? In addition, there is yet another problem... Peter Tiňo 7

Synonymy and Polysemy Synonymy - different authors describe the same idea using different words (terms). Polysemy - the same word (term) can have multiple meanings. But maybe we do not need all the terms! Need to detect a smaller set of dominant abstract contexts in which the terms tend to co-occur in the document collection D and then concentrate only in these...... sounds like Principal Component Analysis and dimensionality reduction :-) Peter Tiňo 8

Latent Semantic Analysis (LSA) As usual, collect all the document representations d j, j = 1, 2,..., N, in a T N (design) matrix X = (x k,j ). Documents d j R T (data points) are stored as columns. matrix of docu- Scaled covariance (may not be zero-mean!) ments (in the term space): C = X X. Eigenvectors + eigenvalues of C: u 1, u 2,..., u T R T λ 1, λ 2,..., λ T R Peter Tiňo 9

Pick only the l most significant principal (prototypical) directions (documents) u 1, u 2,..., u l (concepts) corresponding to the highest λ 1 λ 2... λ l. Form a concept matrix U = [u 1, u 2,..., u l ] (principal documents u k are columns). Project each document representation d j R T onto the low(l)- dimensional representation of d j in the concept space d j = U d j R l Peter Tiňo 10

The amount of variation (scale) along projection direction u k proportional to λ k. is To make up for different scalings we rescale projection along u k by λ 1 k : d j = Λ 1 U d j, where Λ = diag{λ 1, λ 2,..., λ l }. Peter Tiňo 11

Document retrieval/clustering/classification through LSA Instead of T -dim document representation space through original terms, represent documents in the l-dim concept space! Document retrieval: Query of terms: q - view as a mini document q R T. Its representation in the concept space: q = Λ 1 U q. Compare with documents d j e.g. a similarity measure along the concept directions using cos α = q dj q d j Peter Tiňo 12

Output the best matching documents from D. Think of document clustering and classification... Peter Tiňo 13

Reason about terms An alternative view: Given a term t k, its meaning is captured by the document collection D through the term vector t k summarizing contributions from all documents t k = (x k,1, x k,2,..., x k,n ) R N. Collect all the term representations t k, k = 1, 2,..., T, in a N T (design) matrix X = (x j,k ). Terms t k R N (data points) are stored as columns of X. Scaled covariance matrix of terms (in the document space): C = X X. Peter Tiňo 14

Eigenvectors + eigenvalues of C : v 1, v 2,..., v N R N λ 1, λ 2,..., λ N R Note: because T < N, λ j = λ j for j = 1, 2,..., T = 0, for j = T + 1, T + 2,..., N. λ j Pick only the l most significant principal (prototypical) directions (term meanings) v 1, v 2,..., v l (term concepts) corresponding to the highest λ 1 λ 2... λ l. Form a term concept matrix V = [v 1, v 2,..., v l ] (principal term meanings v j are columns). Peter Tiňo 15

Project each term meaning representation t k R N onto the low(l)-dimensional representation of t k in the term concept space tk = Λ 1 V t k R l It is possible now e.g. to cluster the terms according to their underlying meaning using e.g. a similarity measure cos α = tr tk t r t k Peter Tiňo 16