Intelligent Data Analysis Mining Textual Data Peter Tiňo School of Computer Science University of Birmingham
Representing documents as numerical vectors Use a special set of terms T = {t 1, t 2,..., t T }. Represent documents from D = {d 1, d 2,..., d N } by checking, for each document d j D, the occurrence of each of terms t k T in d j. The intensity of occurrence of a term t k in document d j is quantified by a number (weight) x k,j R. Each document d j, j = 1, 2,..., N, is then represented by a document vector d j summarizing contributions of all terms d j = (x 1,j, x 2,j,..., x T,j ) R T. Peter Tiňo 1
Terms are What are the terms t k? 1. intelligently picked words that are useful to check in a given application (Bag of Words). Bag of words can vary from application to application. More sophisticated representations are often not necessary. Bag of words works nicely in many applications. Can you think why? 2. Phrases (rather than individual words). Results not very encouraging. Can you think why? 3. A combination of the two. A promising direction. Peter Tiňo 2
Usually x k,j [0, 1]. Calculating the weights x k,j 1. Binary weights: x k,j = 1, if term t k is present in document d j ; x k,j = 0, otherwise. 2. Real valued weights: The more frequent t k is in d j, the more it is representative of its content. The more documents t k is. occurs in, the less discriminating it x k,j = tfidf(t k, d j ) = N k,j log N N k, Peter Tiňo 3
where N - number of documents in D N k - number of documents in D in which term t k occurs N k,j - number of times t k occurs in document d j The weights are often normalized to unit length: x k,j x k,j x k,j. Now, x k,j [0, 1]. Peter Tiňo 4
Document retrieval/clustering/classification Work directly in the document representation space! Document retrieval: Query of terms: q - view as a mini document q R T. Compare with documents d j in D along the term directions using e.g. a similarity measure cos α = q d j q d j Output the best matching documents from D. Peter Tiňo 5
Think of document clustering and classification... Peter Tiňo 6
This is all nice, but...... the data is extremely sparse! Typically we can expect 1000 s of terms and 10000 s of documents. Think of the situation in terms of a set of N document representations in a T -dimensional space. How can we hope for a reasonable statistics? In addition, there is yet another problem... Peter Tiňo 7
Synonymy and Polysemy Synonymy - different authors describe the same idea using different words (terms). Polysemy - the same word (term) can have multiple meanings. But maybe we do not need all the terms! Need to detect a smaller set of dominant abstract contexts in which the terms tend to co-occur in the document collection D and then concentrate only in these...... sounds like Principal Component Analysis and dimensionality reduction :-) Peter Tiňo 8
Latent Semantic Analysis (LSA) As usual, collect all the document representations d j, j = 1, 2,..., N, in a T N (design) matrix X = (x k,j ). Documents d j R T (data points) are stored as columns. matrix of docu- Scaled covariance (may not be zero-mean!) ments (in the term space): C = X X. Eigenvectors + eigenvalues of C: u 1, u 2,..., u T R T λ 1, λ 2,..., λ T R Peter Tiňo 9
Pick only the l most significant principal (prototypical) directions (documents) u 1, u 2,..., u l (concepts) corresponding to the highest λ 1 λ 2... λ l. Form a concept matrix U = [u 1, u 2,..., u l ] (principal documents u k are columns). Project each document representation d j R T onto the low(l)- dimensional representation of d j in the concept space d j = U d j R l Peter Tiňo 10
The amount of variation (scale) along projection direction u k proportional to λ k. is To make up for different scalings we rescale projection along u k by λ 1 k : d j = Λ 1 U d j, where Λ = diag{λ 1, λ 2,..., λ l }. Peter Tiňo 11
Document retrieval/clustering/classification through LSA Instead of T -dim document representation space through original terms, represent documents in the l-dim concept space! Document retrieval: Query of terms: q - view as a mini document q R T. Its representation in the concept space: q = Λ 1 U q. Compare with documents d j e.g. a similarity measure along the concept directions using cos α = q dj q d j Peter Tiňo 12
Output the best matching documents from D. Think of document clustering and classification... Peter Tiňo 13
Reason about terms An alternative view: Given a term t k, its meaning is captured by the document collection D through the term vector t k summarizing contributions from all documents t k = (x k,1, x k,2,..., x k,n ) R N. Collect all the term representations t k, k = 1, 2,..., T, in a N T (design) matrix X = (x j,k ). Terms t k R N (data points) are stored as columns of X. Scaled covariance matrix of terms (in the document space): C = X X. Peter Tiňo 14
Eigenvectors + eigenvalues of C : v 1, v 2,..., v N R N λ 1, λ 2,..., λ N R Note: because T < N, λ j = λ j for j = 1, 2,..., T = 0, for j = T + 1, T + 2,..., N. λ j Pick only the l most significant principal (prototypical) directions (term meanings) v 1, v 2,..., v l (term concepts) corresponding to the highest λ 1 λ 2... λ l. Form a term concept matrix V = [v 1, v 2,..., v l ] (principal term meanings v j are columns). Peter Tiňo 15
Project each term meaning representation t k R N onto the low(l)-dimensional representation of t k in the term concept space tk = Λ 1 V t k R l It is possible now e.g. to cluster the terms according to their underlying meaning using e.g. a similarity measure cos α = tr tk t r t k Peter Tiňo 16