Text Document Clustering Using Global Term Context Vectors

Knowledge and Information Systems, 31:3, pp. 455 474, 212 (This is recompilation of the published material.) Text Document Clustering Using Global Term Context Vectors Argyris Kalogeratos Aristidis Likas Abstract Despite the advantages of the traditional Vector Space Model (VSM) representation, there are known deficiencies concerning the term independence assumption. The high dimensionality and sparsity of the text feature space, and phenomena such as polysemy and synonymy can only be handled if a way is provided to measure term similarity. Many approaches have been proposed that map document vectors onto a new feature space where learning algorithms can achieve better solutions. This paper presents the Global Term Context Vector- VSM (GTCV-VSM) method for text document representation. It is an extension to VSM that: i) it captures local contextual information for each term occurrence in the term sequences of documents; ii) the local contexts for the occurrences of a term are combined to define the global context of that term; iii) using the global context of all terms a proper semantic matrix is constructed; iv) this matrix is further used to linearly map traditional VSM (Bag of Words - BOW) document vectors onto a semantically smoothed feature space where problems such as text document clustering can be solved more efficiently. We present an experimental study demonstrating the improvement of clustering results when the proposed GTCV-VSM representation is used compared to traditional VSM-based approaches. Keywords Text Mining Document Clustering Semantic Matrix Data Projection 1 Introduction The text document clustering procedure aims towards automatically partitioning a given collection of unlabeled text documents into a (usually predefined) number Argyris Kalogeratos Department of Computer Science, University of Ioannina, GR 4511, Ioannina, Greece E-mail: akaloger@cs.uoi.gr Aristidis Likas Department of Computer Science, University of Ioannina, GR 4511, Ioannina, Greece E-mail: arly@cs.uoi.gr Tel.: +3 2651 881 Fax: +3 2651 8882

2 Argyris Kalogeratos, Aristidis Likas of groups, called clusters, such that similar documents are assigned to the same cluster while dissimilar documents are assigned to different clusters. This is a task that discovers the underlying structure in a set of data objects and enables the efficient organization and navigation in large text collections. The challenging characteristics of the text document clustering problem are related to the complexity of the natural language. Text documents are represented in high dimensional and sparse (HDS) feature spaces, due to their large term vocabularies (the number of different terms of a document collection, or text features in general). In an HDS feature space, the difference between the distance of two similar objects and the distance of two dissimilar objects is relatively small [4]. This phenomenon prevents clustering methods from achieving good data partitions. Moreover, the text semantics, e.g. term correlations, are mostly implicit and non-trivial, hence difficult to extract without prior knowledge for a specific problem. The traditional document representation is the Vector Space Model (VSM ) [28] where each document is represented by a vector of weights corresponding to text features. Many variations of VSM have been proposed [17] that differ in what they consider as a feature or term. The most common approach is to consider different words as distinct terms, which is the widely known Bag Of Words (BOW ) model. An extension is the Bag Of Phrases model (BOP) [23] that extracts a set of informative phrases or word n-grams (n consecutive words). Especially for noisy document collections, e.g. containing many spelling errors, or collections whose language is not known in advance, it is often better to use VSM to model the distribution of character n-grams in documents. Herein, we consider word features and we refer to them as terms, however, the procedures we describe can be directly extended to more complex features. Despite the simplicity of the popular word-based VSM version, there are common language phenomena that it cannot handle. More specifically, it cannot distinguish the different senses of a polysemous word in different contexts, or realize the common sense between synonyms. It also fails to recognize multi-word expressions (e.g. Olympic Games ). These deficiencies are in part due to the over-simplistic assumption of term independence, where each dimension of the HDS feature space is considered to be vertical to the others, and makes the classic VSM model incapable of capturing the complex language semantics. The VSM representations of documents can be improved by examining the relations between terms either at a low level, such as terms co-occurrence frequency, or at a higher semantic similarity level. Among the popular approaches is the Latent Semantic Indexing (LSI ) [7] that solves an eigenproblem using Singular Value Decomposition (SVD) to determine a proper feature space to project data. Concept Indexing [16] computes a k-partition by clustering the documents, and then uses the centroid vectors of the clusters as the axes of the reduced space. Similarly, Concept Decomposition [8] approximates in a least-squares fashion the term-by-document data matrix using centroid vectors. A more simple but quite efficient method is the Generalized Vector Space Model (GVSM ) [31]. GSVM represents documents in the document similarity space, i.e. each document is represented as a vector containing its similarities to the rest of the documents in the collection. The Context Vector Model (CVM-VSM ) [5] is a VSM-extension that describes the semantics of each term by introducing a term context vector that stores its similarities to the other terms. The similarity

Text Document Clustering Using Global Term Context Vectors 3 between terms is based on a document-wise term co-occurrence frequency. The term context vectors are then used to map document vectors into a feature space of equal size to the original, but less sparse. The ontology-based VSM approaches [13, 15] map the terms of the original space onto a feature space defined by a hierarchically structured thesaurus, called ontology. Ontologies provide information about the words of a language and their possible semantic relations, thus an efficient mapping can disambiguate the word senses in the context of each document. The main disadvantage is that, in most cases, the ontologies are static and rather generic knowledge bases which may cause heavy semantic smoothing of the data. A special text representation problem is related to very short texts [14, 25]. In this work, we present the Global Term Context Vector-VSM (GTCV-VSM ) representation which is an entirely corpus-based extension to the traditional VSM that incorporates contextual information for each vocabulary term. First, the local context for each term occurrence in the term sequences of documents is captured and represented in vector space by exploiting the idea of the Locally Weighted Bag of Words [18]. Then all the local contexts of a term are combined to form its global context vector. Global context vectors constitute a semantic matrix which efficiently maps the traditional VSM document vectors onto a semantically richer feature space of same dimensionality to the original. As indicated by our experimental study, in the new space, superior clustering solutions are achieved using well-known clustering algorithms such as the spherical k-means [8] or spectral clustering [24]. The rest of this paper is organized as follows. Sect 2 provides some background on document representation using the Vector Space Model. In Sect 3, we describe recent approaches for representing a text document using histograms that describe the local context at each location of the document term sequence. In Sect 4, we present our proposed approach for document representation. The experimental results are presented in Sect 5, and finally in Sect 6, we provide conclusions and directions for future work. 2 Document Representation in Vector Space In order to apply any clustering algorithm, the raw collection of N text documents must be first preprocessed and represented in a suitable feature space. A standard approach is to eliminate trivial words (e.g. stopwords) and words that appear in a small number of documents. Then, stemming [26] is applied, which aims to replace each word by its corresponding word stem. The V derived word stems constitute the collection s term vocabulary, denoted as V={ν 1,..., ν V }. Thus, a text document, which is a finite term sequence of T vocabulary terms, is denoted as d seq = d seq (1),..., d seq (T ), with d seq (i) V. For example, the phrase The dog ate a cat and a mouse! is a sequence d seq = dog, ate, cat, mouse. 2.1 The Bag of Words Model According to the typical VSM approach, the Bag of Words (BOW ) model, a document is represented by a vector d R V, where each word term ν i of the vocabulary is associated with a single vector dimension. The most popular weighting scheme

4 Argyris Kalogeratos, Aristidis Likas is the normalized tf idf that introduces the inverse document frequency as an external weight to enforce the terms that have discrimination power and appear in a small number of documents. For the ν i vocabulary term, it is computed as idf i =log(n/df i ), where N denotes the total number of documents and df i denotes the document-frequency, i.e. the number of documents that contain term ν i. Thus, the normalized tf idf BOW vector is a mapping of the term sequence d seq defined as follows Φ bow : d seq d = h (tf 1 idf 1,..., tf V idf V ) R V, (1) where normalization is performed with respect to the Euclidean norm using the coefficient h. The document collection can then be represented using the N document vectors as rows in the Document-Term matrix D, which is a N V matrix whose rows and columns are indexed by the documents and the vocabulary terms, respectively. In the VSM there are several alternatives to quantify the semantic similarity between document pairs. Among them, Cosine similarity has shown to be an effective measure [11] and for a pair of document vectors d i and d j is given by sim cos (d i, d j ) = d i d j d i 2 d j 2 [, 1]. (2) Unit similarity value implies the two documents are described by identical distributions of term frequencies. Note that this is equal to the dot product d i d j if document vectors are normalized in the unit positive V -dimensional hypersphere. 2.2 Extensions to VSM The BOW model, despite having a series of advantages, such as generality, and simplicity, it cannot model efficiently the rich semantic content of text. The Bag Of Phrases model uses phrases of two or three consecutive words as features. Its disadvantage is the fact that it has been observed that as phrases become longer they obtain superior semantic value, but at the same time, they become statistically inferior with respect to single-word representations [19]. A category of methods developed aiming on tackling this difficulty recognize the frequent wordsets (unordered itemsets) in a document collection [3, 1], while the method proposed in [2] exploits the frequent word subsequences (ordered) that are stored in a Generalized Suffix Tree (GST ) for each document. Modern variations of VSM are used to tackle the difficulties occurring due to HDS spaces, by projecting the document vectors onto a new feature space called concept space. Each concept is represented as a concept vector of relations between the concept and the vocabulary terms. Generally, this approach of document mapping can be expressed as Φ V SM : d d = Sd R V, V V, (3) where the V V matrix S stores the concept vectors as rows. This projection matrix is also known as semantic matrix. The Cosine similarity between two normalized document images in the concept space can be computed as a dot-product sim (cos) sem (d i, d j) = (Sd i ) (Sd j ) = (h S i Sd i ) (h S j Sd j ) = h S i h S j (d i S Sd j ), (4)

Text Document Clustering Using Global Term Context Vectors 5 where the scalar normalization coefficient for each document is h S i =1/ Sd i 2. The similarity defined in Eq. 4 can be interpreted in two ways: i) as a dot product of the document images (Sd i ) (Sd j ) that both belong to the new space R V and ii) as a composite measure that takes into account the pairwise correlations between the original features expressed by the matrix S S. There is a variety of methods proposing alternative ways to define the semantic matrix though many of them are based on the above linear mapping. The widely used Latent Semantic Indexing (LSI ) [7] projects the document vectors onto a space spanned by the eigenvectors corresponding to the V largest eigenvalues of the matrix D D. The eigenvectors are extracted by the means of Singular Value Decomposition (SVD) on matrix D and they capture the latent semantic information of the feature space. In this case, each eigenvector is a different concept vector and V is a user parameter much smaller than V, while there is also a considerable computational cost to perform the SVD. In Concept Indexing [16], the concept vectors are the centroids of a V -partition obtained by applying document clustering. In [9], statistical information such as the covariance matrix is combined with traditional mapping approaches in to latent space (LSI, PCA) to compose a hybrid vector mapping. A computationally simpler alternative that utilizes the Document-Term Matrix D as a semantic matrix is the Generalized Vector Space Model (GVSM ) [31], i.e. S gvsm =D and the image of a document is given by d =Dd. By examining the product Dd R N 1, we can conclude that a GVSM projected document vector d has lower dimensionality if N V. Moreover, if both d and D are properly normalized, then image vector d consists of the N Cosine similarities between the document vector d and the rest of the N 1 documents in the collection. This observation implies that the GVSM works in the document similarity space by considering each document as a different concept. On the other hand, the respective product Sgvsm S gvsm =D D (used in Eq. 4) is a V V Term Similarity Matrix whose r-th row has the dot-product similarities between term ν r and the rest of the V 1 of vocabulary terms. Note that terms become more similar as their corresponding normalized frequency distributions into the N documents are more alike. Based on the GVSM model, it is proposed in [1] to build local semantic matrices for each cluster during document clustering. A rather different approach proposed in [5] for information retrieval is the Context Vector Model (CVM-VSM ) where, instead of a few concise concept vectors, it computes the context in which each of the V vocabulary terms appears in the dataset, called term context vector (tcv). This model computes a V V matrix S cvm containing the term context vectors as rows. Each tcv i vector aims to capture the V pairwise similarities of term ν i to the rest of the vocabulary terms. Such similarity is computed using a co-occurrence frequency measure. Each matrix element [S cvm ] ij stores the similarity between terms ν i and ν j computed as [S cvm ] ij = 1, i = j N r=1 tf ritf rj N r=1 (tf ri V q=1, q i tf rq), i j. (5) Note that this measure is not symmetric, generally [S cvm ] ij [S cvm ] ji, due to the denominator that normalizes the pairwise similarity to [, 1] with respect to the total amount of similarity between term ν i and the other vocabulary terms. The

6 Argyris Kalogeratos, Aristidis Likas rows of matrix S cvm can be normalized with respect to the Euclidean norm and each document image is then computed as the centroid of the normalized context vectors of all terms appearing in that document Φ cvm : d d = V tf i tcv i, (6) where tf i is the frequency of term ν i. The motivation for using term context vectors is to capture the semantic content of a document based on the co-occurrence frequency of terms in the same document, averaged over the whole corpus. The CVM-VSM representation is less sparse than BOW. Moreover, weights such as idf can be incorporated to the transformed document vectors computed using Eq. 6. In [5] several more complicated weighting alternatives have been tested in the context of information retrieval that in our text document clustering experiments did not perform better than the standard idf weights. In a higher semantic level than term co-occurrences, additional information for vocabulary terms provided by ontologies has also been exploited to compute the term similarities and to construct a proper semantic matrix. WordNet [22] and Wikipedia [3] have been used for this purpose in [6, 15, 29], respectively. i=1 2.3 Discussion Summarizing the properties of the above mentioned vector-based document representations, in the traditional BOW approach, the dimensions of the term feature space are considered to be independent to each other. Such an assumption is very simplistic, since there exist semantic relations among terms that are ignored. The VSM-extensions aim to achieve semantic smoothing, a process that redistributes the term weights of a vector model, or map data in a new feature space, by taking into account the correlations between terms. For instance, if the term child appears in a document, then it could be assumed that the term kid is also related to the specific document, or even terms like boy, girl, toy. The resulting representation model is also a VSM, but the document vectors become less sparse and the independence of features is mitigated in an indirect way. The smoothing is usually achieved by a linear mapping of data vectors to a new feature space using a semantic matrix S. It is convenient to think that the new document vector d =Sd contains the dot product similarities between the original BOW vector d and the rows of the semantic matrix S. A basic difference between the various semantic smoothing methods is related to the dimension of the new feature space which is determined by the number V of row vectors of matrix S. In case their number is less than the size V of the vocabulary, such vectors are called as concept vectors and are usually produced using the LSI method. Each concept vector has a distribution of weights associated to the V original terms that define their contribution of to the corresponding concept. Of course the resulting representation of the smoothed vector d is less interpretable than the original and there is always a problem of determining the proper number of concept vectors. An alternative approach for semantic smoothing assumes that each row vector of matrix S is associated with one vocabulary term. Unlike a concept vector that

Text Document Clustering Using Global Term Context Vectors 7 describes abstract semantics of higher level, here, the elements of each vector describe the relation of this term to the other terms. Those relations constitute the so called term context, thus the respective vector is called term context vector. Each element of the mapped vector d will contain the dot product similarity between document d and the corresponding term context vector, i.e. for each term v i the element d i provides the degree to which the original document d contains the term v i and its context, instead of just computing its frequency as happens in the BOW representation. Note also that in BOW representation, a dot product would give zero similarity for two documents that do not have common terms. On the contrary, the dot product between a document vector and a term context vector of a term v i that does not appear in that document may give a non-zero similarity. This happens if the document contains at least one term v j with nonzero weight in the context of term v i. For this reason, the smoothed representation d is usually less sparse that d and retains their interpretability of dimensions. Moreover, concept-based methods may be applied on the new representations. The motivation of our work is to establish the importance of term context vectors and to define an efficient way to compute them. The CVM-VSM method considers that the term context is computed based on term co-occurrence frequency at the document-level. It does not take into account the sequential nature of text and thus ignore the local distance of terms when computing term context. On the other hand, the GTCV-VSM proposed in this work extends the previous approach by considering term context at three levels: i) It uses the notion of local term context vector (ltcv) to model the context around the location in the text sequence where a term appears. These vectors are computed using a local smoothing kernel as suggested in the LoWBOW approach [18] which is described in the next section. The kernel takes into account the distance in which other terms appear around the sequence location under consideration. ii) It computes the document term context vector (dtcv) for each term that summarizes the term context at the document-level and iii) it computes the final global term context vector (gtcv) for each term representing the overall term context at corpus-level. The gtcv vectors constitute the rows of the semantic matrix S. Thus the intuition behind GTCV- VSM approach is to capture the local term context from term sequences and then to construct a representation for global term context by averaging ltcvs at the document and corpus-level. 3 Utilizing Local Contextual Information A text document can be considered as a finite term sequence of its T consecutive terms denoted as d seq = d seq (1),..., d seq (T ) but, except for Bag of Phrases, so far in this paper the previously mentioned VSM-extensions ignore this property. A category of methods have been proposed aiming to capture local information directly from the term sequence of a document. The representation proposed in [27], first considers a segmentation of the sequence that is done by dragging a window of n terms along the sequence and computing the local BOW vectors for each of the overlapping segments. All these local BOW vectors constitute the document representation called Local Word Bag (LWB). To compute the similarity between a pair of documents, the authors introduce a variant of the VG-Pyramid

8 Argyris Kalogeratos, Aristidis Likas Matching Kernel [12] that maps the two sets of local BOW vectors to a multiresolution histogram and computes a weighted histogram intersection. Another approach for text representation presented in [18], is the Locally Weighted Bag of Words (LoWBOW ) that preserves local contextual information of text documents by the effective modeling of the text sequential structure. At first, a number of L equally distant locations are defined in the term sequence. Each sequence location l i, i=1,..., L, is then associated with a local histogram which is a point in the multinomial simplex P V 1, where V is the number of vocabulary terms. More specifically, for (V 1), the P V 1 space is the (V 1)-dimensional subset of R V that contains all probability vectors (histograms) over V objects (for a discussion on the multinomial simplex see the Appendix of [18]) { } V P V 1 = H R V : H i, i = 1,..., V and H i = 1. (7) Contrary to LWB, in LoWBOW the local histogram is computed using a smoothing kernel to weight the contribution of terms appearing around the referenced location in the term sequence, and to assign more importance to closely neighboring terms. Denoting as H δ(d seq (t)) the trivial term histogram of V terms whose probability mass is concentrated only at the term that occurs at the location t in d seq { [ ] 1, νi Hδ(dseq(t)) = = d seq (t), i = 1,..., V, (8) i, ν i d seq (t) then the locally smoothed histogram at a location l in the d seq term sequence is computed as in [18] lowbow(d seq, l) = i=1 T H δ(d seq (t))k l,σ (t), (9) t=1 where T is the length of d seq. K l,σ (t) denotes the weight for location t in sequence given by a discrete Gaussian weighting kernel function of mean value l and standard deviation σ. Specifically, the weighting function is a Gaussian probability density function restricted in [1, T ] and renormalized so that T t=1 K l,σ(t) = 1. It is easy to verify that the result of the histogram smoothing of Eq. 9 is also a histogram. It must be noted that for σ= the lowbow histogram (Eq. 9) coincides with the trivial histogram H δ(d seq (l)), where all the probability mass is concentrated at the term at location l. As σ grows, part of the probability mass is transfered to the terms occurring near location l. In this way, the lowbow histogram at location l is enriched with information about the terms occurring in the neighborhood of l. The smoothing parameter σ adjusts the locality of term semantics that is taken into account by the model. Thus, instead of mining unordered local vectors as in [27], the LoWBOW approach embeds the term sequence of a document in the P V 1 simplex. The sequence of the L locally smoothed histograms (denoted as lowbow histograms) form a curve in the (V -1)-dimensional simplex (denoted as LoWBOW curve). Fig. 1 illustrates the LoWBOW curves generated for a toy example and describes the role of parameter σ. In this figure we aim to illustrate i) the LoWBOW curve representation, i.e. the curve that corresponds to a sequence

Text Document Clustering Using Global Term Context Vectors 9 1.9 sampling at every two terms (locations) in the sequence ν 3 sampling at every term(location) in the sequence end 1.9 ν 3.7.7 end.5.5.3.3.1.1 ν 1 start (a) σ = 1.2 ν 2 ν 1 start (b) σ = 2 ν 2 1 1 ν 3 1 1 ν 3 1 1.9.9.7.5.3 end.7.5.3 dtcv(ν1) σ = 3 σ = 2 σ = 1.2 dtcv(ν3) σ = 1.2 σ = 2 σ = 3 start end BOW σ =inf.1 ν 1 start (c) σ = 3 ν 2.1 ν 1 σ = 3 dtcv(ν2) σ = 2 σ = 1.2 (d) ν 2 1 1 1 1 Fig. 1 A toy example where the sequence ν 1, ν 2, ν 2, ν 2, ν 1, ν 3, ν 3, ν 1 ν 1, ν 1, ν 2, ν 2, ν 3 is considered that uses three different terms ν 1, ν 2, ν 3 (vocabulary size: V =3). The subfigures present LoWBOW curves in the (V 1)-dimensional simplex for increasing values of the parameter σ that induce more smoothing to the curve. Each point of the curve corresponds to a local histogram computed at a sequence location. The more a term affects the local context at a location in the sequence, the more the curve point (the lowbow histogram related to that location) moves towards the respective corner of the simplex. For σ= local histograms correspond to simplex corners, thus the curve moves from corner to corner of the simplex. Two different sampling rates for LoWBOW representation are illustrated: sampling at every term location in the sequence (dashed line) which is the our strategy to collect contextual information for each term, and sampling every two terms (solid line). d) For σ=, the LoWBOW curve reduces to a single point that coincides with the BOW histogram of the sequence. In (d) we present as stars the average ltcv histograms for each term (dtcv histograms) for the three different values of σ and α= for all terms. As the value of σ increases, the dtcv histograms of all terms become more similar tending to coincide with the BOW representation. of histograms (local context vectors), where each local context vector is computed at a specific location of the sequence and corresponds to a point in the (V -1)- dimensional simplex; ii) the impact of the smoothing coefficient σ on the computed local context vectors. This figure illustrates that the increase of smoothing makes the lowbow histograms (points of the curve) more similar. This can also be verified by observing that as smoothing increases, the curve becomes more concentrated around a central location of the simplex. For σ= all histograms become similar to

1 Argyris Kalogeratos, Aristidis Likas the BOW representation and the curve reduces to a single point. On the contrary, for σ= the histograms correspond to simplex corners. A similarity measure between LoWBOW curves has been proposed in [18] that assumes a sequential correspondence between two documents and computes the sum of the similarities between the L pairs of LoWBOW histograms. Obviously, it is expected for this similarity measure to underestimate the thematic similarity between documents that follow different order in the presentation of similar semantic content. 4 A Semantic Matrix based on Global Term Context Vectors In this section we present the Global Term Context Vector-VSM (GTCV-VSM) approach for capturing the semantics of the original term feature space of a document collection. The method computes the contextual information of each vocabulary term, that is subsequently utilized in order to create a semantic matrix. In analogy with CVM-VSM, our approach reduces data sparsity but not dimensionality. The interpretability of the derived vector dimensions remains as strong as in the BOW model as the value of each dimension of the mapped vector corresponds to one vocabulary term. Methods that reduce data dimensionality could also be applied on the new representations at a subsequent phase. Compared to CVM-VSM, GTCV- VSM generalizes the way the term context is computed by taking into account the distance between terms in the term sequence of each document. This is achieved by exploiting the idea of LoWBOW to describe the local contextual information at a certain location in a term sequence. It must be noted that our method borrows from the LoWBOW approach only way the local histogram is computed at each location of the term sequence and does not make use of the LoWBOW curve representation. More specifically, we define the local term context vector (ltcv) as a histogram associated with the exact occurrence of term d seq (l) at location l in a sequence d seq. Hence, one ltcv vector is computed at every location in the term sequence, i.e. l=1,..., T. Note that GTCV-VSM does not preserve any curve representation. This means that we are not interested in the temporal order of the local term context vectors. The ltcv(d seq, l) is a modified lowbow(d seq, l) probability vector that represents contextual information around location l, while adjusting explicitly the self-weight α d seq (l) of the reference term appearing at location l α d [ltcv(d seq seq (l), l)] i = (1 α d seq (l)) idf i [lowbow(d seq, l)] V i j=1,j i idf j, ν i = d seq (l),, ν [lowbow(d seq, l)] i d seq (l). j (1) The self-weight ( α d seq (l) 1) adjusts the relative importance between contextual information (computed using the lowbow histogram) and the self-representation of each term. Fig. 2 illustrates an example of how the value of parameter α affects the local term weighting around a reference term in a sequence. When the parameter σ of the Gaussian smoothing kernel is set to zero, or α=1, the ltcv(d seq, l) reduces to a trivial histogram H δ(d (seq) (l)) (see Eq. 8). The other extreme is the infinite σ value, where for small α values all the ltcv computed in a document d become similar to the tf histogram for that document.

Text Document Clustering Using Global Term Context Vectors 11.3.3.3 5 5 5 Local term weighting.15.1.15.1.15.1.5.5.5 1 2 3 4 5 Term sequence 1 2 3 4 5 Term sequence 1 2 3 4 5 Term sequence Fig. 2 Various weight distributions for the neighboring terms around a reference term occurring in the middle of a term sequence of length 5. The distributions are obtained by varying the value of parameter α in Eq. 1. This distribution defines the contribution of each term to the context of the specific reference term. The scale value of the local kernel is set to σ=5, while self-weight α is set to.5 (left),.1 (middle), and (right). The latter observation is the reason for considering an explicit self-weight in Eq. 1, because a flat smoothing kernel obtained for large σ value can make a lowbow vector to have improperly low self-weight for the reference term. For example, if a term appears once in a document, then the lowbow vector with σ= at that location would contain very low weight for that term. Generally, the value of α ν determines how much the context vector of term ν should be dominated by the self-weight of term ν. In our method we set this parameter independently for each individual term as a function of its idf ν component α ν = λ + (1 λ) ( 1 idf ν logn ), λ [, 1], (11) where λ is a lower bound for all a ν, ν=1,..., V (in our experiments we used λ=). The rationale for the above equation is that for terms with high document frequency (i.e. low idf ν ), we assign high α ν values that suppress the local context in the respective context vectors. In other words, the context is considered more important for terms that occur in fewer documents. In Fig. 3a, we present an example illustrating the ltcv vectors of two term sequences presented in Fig. 3c. We further define the document term context vector (dtcv) as a probability vector that summarizes the context of a specific term at the document-level by averaging the ltcv histograms corresponding to the occurrences of this term in the document. More specifically, suppose that a term ν appears no i,ν > times in the term sequence d seq i (i.e. in the i-th document) which is of length T i. Then the dtcv of this term ν for document i is computed as: dtcv(d seq i, ν) = 1 no i,ν ltcv(d seq i, l i,ν (j)), (12) no ν,i where l i,ν (j) is an integer value in [1,..., T i ] denoting the location of the j-th occurrence of ν in d seq i. j=1

12 Argyris Kalogeratos, Aristidis Likas Vocabulary terms advanc: v1 electron: v2 commun: v3 help: v4 conduct: v5 busi: v6 interoper: v7 problem: v8 applic: v9 profession:v1 product:v11 commerc:v12 secur:v13 Local term context histograms (columns) for document A v1 v2 v3 v4 v5 v6 v7 v8 v6 v9 Term sequence (d seq A ) (a) Vocabulary terms v1 v2 v3 v4 v5 v6 v7 v8 v9 v1 v11 v12 v13 Local term context histograms (columns) for document B v1 v6 v11 v1 v7 v2 v12 v9 v13 v6 v6 v2 v3 Term sequence (d seq B ) Averaged term context histograms (columns) Vocabulary terms advanc: v1 electron: v2 commun: v3 help: v4 conduct: v5 busi: v6 interoper: v7 problem: v8 applic: v9 profession:v1 product:v11 commerc:v12 secur:v13 v1 v2 v3 v4 v5 v6 v7 v8 v9 v1 v11 v12 v13 Vocabulary terms (b) (c) Fig. 3 An example of how ltcv histograms are used to summarize the overall context in which a term appears in the two term sequences of (c) using Eq. 14. a) The term sequences (x-axis) of documents A, B are presented and the corresponding local term context vectors are illustrated as grey-scaled columns. Those vectors are computed at every location in the sequence using a Gaussian smoothing kernel with σ=1 and α= for all terms. Brighter intensity at cell i, j indicates higher contribution of the term ν i to the local context of the term appearing at location j in the sequence. b) The resulting transposed semantic matrix (S ), where the grey-scaled columns illustrate the global contextual information for each vocabulary term computed by averaging the respective local context histograms (Eq. 13). c) The two initial term sequences (the stem of each non-trivial term is emphasized). Assuming the same idf weight for each vocabulary term, the table presents the BOW vector, the transformed vector d using Eq. 14 as well as the effect of semantic smoothing (diff =BOW d ) on document vectors. The redistribution of term weights that results by the proposed mapping reveals is done in such a way that low frequency terms are gaining weight against the more frequent ones. Note also that the similarity between the two documents is.756 for the BOW model and 96 for the GTCV-VSM.

Text Document Clustering Using Global Term Context Vectors 13 Next, the global term context vector (gtcv), is defined for a vocabulary term ν so as to represent the overall contextual information for all appearances of ν in the corpus of all N term sequences (documents). gtcv(ν) = h gtcv(ν) ( N i=1 tf i,v dtcv(d seq i, ν) ). (13) The coefficient h gtcv(ν) normalizes the vector gtcv(ν) with respect to the Euclidean norm, and tf i,ν is the frequency of the term ν in the i-th document. Thus, the gtcv(ν) of term ν is computed using a weighted average of the document context vectors dtcv(d seq i, ν) obtained for each document i in which term ν appears. Thus, in contrast to LoWBOW curve approach which focuses on the sequence of local histograms that describe the writing structure of a document, our method focuses on the extraction of the global semantic context of a term by averaging the local contextual information at all the corpus locations where this term appears. Finally, the extracted global contextual information is used to construct the V V semantic matrix S gtcv where each row ν is the gtcv(ν) vector of the corresponding vocabulary term ν. Fig. 1d provides an example of illustrating the dtcv(d seq i, ν) vectors for each document (the points denoted as stars ). Fig. 3b illustrates the final gtcv vectors obtained by averaging the document-level contexts for each vocabulary term. To map a document using the proposed Global Term Context Vector-VSM approach, we compute the vector d where each element ν is Cosine similarity between the BOW representation d of the document and the global term context vector gtcv(ν): Φ gtcv : d d = S gtcv d, d R V. (14) Note that the transformed document vector d is V -dimensional that retains the interpretability, since each dimension still corresponds to a unique vocabulary term. Moreover, if σ= and α>, then S gtcv d=d. Looking at Eq. 4, the product Sgtcv S gtcv essentially computes a Term Similarity Matrix where the similarity between two terms is based on the distribution of term weights in their respective global term context vectors, i.e., on the similarity of their global context histograms. The table of Fig. 3c illustrates the effect of redistribution (compared to BOW) of the term weights (semantic smoothing) in the transformed document vectors achieved by the proposed mapping. The procedure of representing the input documents using GTCV-VSM takes place in the preprocessing phase. Let T i the length of the i-th document and V i its vocabulary. Let also V the size of the whole corpus vocabulary. Then the cost to compute one ltcv vector at a location of the term sequence using Eq. 1, and to add its V i non-zero dimensions to the respective dtcv, is O(T i +V i ). This is done T i times and the final dtcv of each different term of the document is added to the respective the gtcv rows. Thus, using proper notation for the average length T i and vocabulary size V i of the documents in a corpus, the cost of constructing the semantic matrix can be expressed as O(N T i (T i +2 V i )). However, since V i T i V, the overall computational cost of the GTCV-VSM is determined by the O(N V 2 ) cost of the matrix multiplication of the mapping of Eq. 14.

14 Argyris Kalogeratos, Aristidis Likas Table 1 Characteristics of text document collections. N denotes the number of documents, V is the size of the global vocabulary and V i the average document vocabulary, Balance is the ratio of the smallest to the largest class and T i is the average length of the term sequences of documents. Name Topics Classes N Balance V V i T i D 1 2 NGs: graphics, windows.x, motor, baseball, 6 2 2/4 4343 48.8 11 space, mideast D 2 2 NGs: atheism, autos, baseball, electronics, 7 35 5/5 6442 52.6 18 med, mac, motor, politics.misc D 3 2 NGs: atheism, christian, guns, mideast 4 16 4/4 48 62 131 D 4 2 NGs: forsale, autos, baseball, motor, hockey 5 125 25/25 4762 44.1 14 D 5 Reuters 21578: acq, corn, crude, earn, grain, 1 9979 237/3964 5613 39.1 76 interest, money-fx, ship, trade, wheat 5 Clustering Experiments Our experimental setup was based on five different datasets: D 1 -D 4 are subsets of the 2-Newsgroups 1, while D 5 is the Mod Apte split [2] version of the Reuters- 21578 2 benchmark document collection where the 1 classes with larger number of training examples are kept. The characteristics of these datasets are presented in Table 1. The preprocessing of datasets included the removal of all tags, headers and metadata from the documents, while applied word stemming and discarded terms appearing in less than five documents. It is worth mentioning how we preprocessed the term sequences of documents. We considered a dummy term that replaced in the sequences all the low-frequency terms that were discarded so as to maintain the relative distance between the terms that remained in each sequence. For similar reasons, two dummy terms were considered at the end of every sentence denoted by characters as (e.g..,?,! ). The dummy term is ignored when constructing the final data vectors. For each dataset, we have considered several data mappings Φ and after each mapping the spherical k-means (spk-means) [8] and spectral clustering (spectral-c) [24] algorithms were applied to cluster the mapped documents vectors into the k predefined number of clusters corresponding to the different topics (classes) in a collection. In contrast to k-means that is based on the Euclidean distance [21], spk-means uses the Cosine similarity and maximizes the Cohesion of the clusters C={c 1,..., c k } k Cohesion(C) = u j d i, (15) j=1 d i c j where u j is the normalized centroid of cluster c j with respect to the Euclidean norm. Spectral clustering projects the document vectors in a subspace which is spanned by the k largest eigenvectors of the Laplacian matrix L computed from the similarity matrix A (N N) of pairwise Cosine similarities between documents. More specifically, the Laplacian matrix is computed as L=D 1/2 AD 1/2, where D is a 1 http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-2/www/data/news2.tar.gz. 2 http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz

Text Document Clustering Using Global Term Context Vectors 15 Table 2 NMI values of the clustering solution for VSM (BOW), GVSM, CVM-VSM and the proposed GTCV-VSM (for several values of σ) document representations using the spk-means algorithm. D 1 D 2 D 3 D 4 D 5 Method σ avg best avg 1% avg best avg 1% avg best avg 1% avg best avg 1% avg best avg 1% BOW.722.821.594.748.829.638.537.548.379.625.779.55.552.562.535 GTCV 1.749.854.61.767.845.638.544.564.372.667.793.515.57.578.561 2.756.871.631.765.852.657.563.574.396.67.832.539.572.58.561 5.773.881.687.777.864.662.577.62.4.688.851.539.589.633.578 1.777.886.685.781.873.672.59.621.424.684.849.54.59.63.58 3.761.879.659.776.863.653.579.59.369.683.842.518.576.612.568 inf.76.862.631.772.862.639.574.586.366.681.84.521.576.61.566 GVSM.752.832.611.747.822.637.556.576.419.67.827.547.575.58.573 CVM.75.841.612.754.851.659.547.64.4.672.824.541.578.581.575 diagonal matrix. Each diagonal element contains the sum of the i-th row of similarities D ii = N j=1 A ij. The next step is the construction of a matrix X (N k) = {x i : i=1,..., k} whose columns correspond to the k largest eigenvectors of L. The standard k-means algorithm is then used to cluster the rows of matrix X after being normalized to unit length in Euclidean space, where the i-th row is the vector representation of the i-th document in the new feature space. Clustering evaluation was based on the supervised measure Normalized Mutual Information (NMI ) and the F 1 -measure. We denote as n gt i the number of documents of class i, n j the size of cluster j, n ij the number of documents belonging to class i that are clustered in cluster j, C gt the grouping based on ground truth labels of documents c gt 1,..., cgt k (true classes). Let us further denote p(cgt i )=ngt i /N and p(c j )=n j /N the probability of selecting arbitrarily a document from the dataset and that belongs to class c gt i and cluster c j, respectively, and p(c gt i, c j)=n ij /N the joint of arbitrarily selecting a document from dataset and that belongs to cluster c j and is of class c gt i. Then, the [,1] Normalized MI measure is computed by dividing the Mutual Information by the maximum between the cluster and class entropy: NMI(C gt, C) = c gt i Cgt c j C p(c gt i, c j) log p(cgt i, c j) p(c gt i )p(c j) / max{h(c gt ), H(C)}. (16) When C and C gt are independent, the value of NMI equals to zero, while it equals to one if these partitions contain identical clusters. The F 1 -measure is the harmonic mean of the precision and recall measures of the clustering solution: F 1 = precision recall precision + recall. (17) Higher values of F 1 in [,1] indicate better clustering solutions. Tables 2, 3, 5, and 6 present the results from the experiments conducted for each collection. Specifically, we compared the classic BOW representation, the GVSM, the proposed GTCV-VSM method (with λ= in Eq. 11), that represents the documents as described in Eq. 14 and the CVM-VSM as proposed in [5], where document vectors are computed based on Eq. 6 with idf weights. More specifically,

16 Argyris Kalogeratos, Aristidis Likas Table 3 F 1-measure values of the spk-means clustering solution for the different representation methods. D 1 D 2 D 3 D 4 D 5 Method σ avg best avg 1% avg best avg 1% avg best avg 1% avg best avg 1% avg best avg 1% BOW.779.92.685.78.91.645.73.76.57.735.918.558.675.697.646 GTCV 1.86.94.688.79.921.65.79.713.576.755.92.561.691.695.677 2.814.946.688.792.924.674.721.728.58.764.938.598.698.714.672 5.828.953.722.817.929.665.736.737.597.773.948.611.712.751.681 1.832.954.733.82.936.63.737.739.63.773.947.581.712.749.681 3.814.95.747.794.929.657.725.727.576.766.944.579.698.746.666 inf.813.942.689.792.926.651.722.728.576.765.944.581.698.744.666 GVSM.79.923.75.783.93.64.76.71.576.75.943.591.687.72.672 CVM.765.941.672.79.93.672.78.725.576.751.934.64.685.716.669 Table 4 The p and t values of the statistical significance t-test of the difference in k-means performance using GTCV-VSM (σ=1) and the compared representation methods, with respect to the two evaluation measures. Values of p smaller than the significance level of.5 (5%) indicate significant superiority of GTCV-VSM. GTCV D 1 D 2 D 3 D 4 D 5 (σ=1) vs p-val t-val p-val t-val p-val t-val p-val t-val p-val t-val BOW NMI.11 1 6 5.98.75 1 3 4.5.25 1 6 5.81.8 1 8 6.45. 12.8 GVSM NMI.8 2.68.81 1 3 4.2.5 1 3 4.15.85 1.73.56 1 5 5.17 CVM NMI.51 2.83.1 3.33.52 1 4 4.65.1659 1.39.77 1 3 4.4 BOW F1.2 1 5 5.39.5 1 2 3.54.46 1 2 3.56.1 3.32. 12.8 GVSM F1.37 1 3 4.22.21 3.11.67 1 2 3.45.329 2.15. 9.6 CVM F1.81 1 3 4.2.6 1 8 6.5.27 3.4.314 2.18. 9.31 Table 5 NMI values of the clustering solution for VSM (BOW), GVSM, CVM-VSM and the proposed GTCV-VSM (for several values of σ) document representations using the spectral clustering algorithm. D 1 D 2 D 3 D 4 D 5 Method σ avg best avg 1% avg best avg 1% avg best avg 1% avg best avg 1% avg best avg 1% BOW.753.761.75.781.788.737.569.585.555.718.78.631.558.559.56 GTCV 1.77.774.769.79.795.75.614.626.6.735.779.642.56.561.516 2.781.785.76.79.794.757.625.632.61.752.789.649.562.564.523 5.794.84.79.833.853.763.639.64.619.768.827.669.579.6.557 1.87.814.81.833.853.761.645.648.62.758.819.661.581.589.558 3.791.796.769.87.832.743.613.613.69.755.797.647.567.582.535 inf.774.782.767.794.794.722.619.619.61.749.793.637.56.568.53 GVSM.756.77.72.794.83.747.593.595.586.722.78.637.548.554.513 CVM.761.768.751.81.823.76.65.66.59.728.794.642.557.566.519 for each collection, each representation method was tested for 1 runs of spkmeans (Tables 2, 3) and spectral-c (Tables 5, 6). To provide fair comparative results, for each document collection all methods were initialized using the same random document seeds. The average of all runs (avg), the average of the worst 1% of the clustering solutions (avg 1% ), and the best values are reported for each performance measure. The worst 1% concerns the 1% of the solutions with the lowest Cohesion, while the best clustering solution is that having the maximum Cohesion in the 1 runs (for spectral-c the sum of squared distances is considered for this purpose). Moreover, in Fig. 4 we present the average clustering performance of spk-means with respect to the value of λ parameter of Eq. 11 where, although

Text Document Clustering Using Global Term Context Vectors 17 Table 6 F 1-measure values of the spectral clustering solution for the different representation methods. D 1 D 2 D 3 D 4 D 5 Method σ avg best avg 1% avg best avg 1% avg best avg 1% avg best avg 1% avg best avg 1% BOW.81.811.78.819.822.767.71.723.71.88.911.697.666.669.654 GTCV 1.811.819.89.822.832.772.729.741.728.834.915.722.694.73.663 2.818.823.86.837.841.779.733.746.732.865.922.725.689.73.652 5.837.84.818.887.927.792.744.756.737 87.93.74.716.727.647 1.84.842.826.89.925.788.754.759.742.865.929.736.71.725.654 3.823.826.89.856.886.769.726.735.725.864.925.75.74.71.642 inf.814.817.86.826.832.734.728.735.729.859.922.73.692.686.653 GVSM.756.77.72.826.91.78.79.714.724.823.916.75.642.657.654 CVM.761.768.779.831.897.791.725.725.723.825.916.713.673.678.654 Table 7 The p and t values of the statistical significance t-test of the difference in spectral clustering performance using GTCV-VSM (σ=1) and the compared representation methods, with respect to the two evaluation measures. Values of p smaller than the significance level of.5 (5%) indicate significant superiority of GTCV-VSM. GTCV D 1 D 2 D 3 D 4 D 5 (σ=1) vs p-val t-val p-val t-val p-val t-val p-val t-val p-val t-val BOW NMI. 27.3. 13.8. 62..26 1 4 4.85. 8.3 GVSM NMI. 16.7. 7.51. 13..129 1 5 4.99. 12.1 CVM NMI. 19.3.15 1 8 6.35. 138..316 1 3 3.67. 8.83 BOW F1. 24.1. 11.4. 875..123 1 4 4.48. 19.1 GVSM F1. 15.1. 7.53. 41..113 1 2 3.31. 3.7 CVM F1. 18.7. 7.11. 268..115 1 3 3.94. 14.1 not best for all cases, the value we used seems to be a reasonable choice for all the datasets we have considered. Note that similar effect was observed for spectral-c method. In order to illustrate the statistical significance of the obtained results, the well-known t-test was applied for each dataset to determine the significance of the performance difference between our methods and the compared methods. We have considered the case where σ=1 for the Gaussian kernel for all datasets. Within a confidence interval of 95% and for the value of degrees of freedom equal to 198 (for two sets of 1 experiments each), the critical value for t is t c =1.972 (p c =5% for p value). This means that if the computed t t c, then the null hypothesis is rejected (p 5%, respectively), i.e. our method is superior, otherwise the null hypothesis is accepted. As it can be observed from the results of the statistical tests for spk-means presented in Table 4, the performance superiority of GTCV- VSM is clearly significant in four out of five datasets with respect to all other methods. For dataset D 4 the tests indicate that GTCV-VSM, although still better than BOW, has less significant difference in performance compared to GVSM and CVM-VSM. Table 4 provides the respective t-test results for the spectral-c method where, also due to the lower standard deviation of the results using all document representation methods, the GTCV-VSM demonstrates significantly better results than the compared representations. The experimental results indicate that our method outperforms the traditional BOW approach in all cases, even for small values of smoothing parameter σ (e.g. σ=1 or 2). This substantiates our rationale that the clustering procedure is assisted