An Efficient Algorithm for Large-Scale Text Categorization

Size: px

Start display at page:

Download "An Efficient Algorithm for Large-Scale Text Categorization"

Angel Paul
5 years ago
Views:

1 An Efficient Algorithm for Large-Scale Text Categorization CHANG-RUI YU 1, YAN LUO 2 1 School of Information Management and Engineering Shanghai University of Finance and Economics No.777, Guoding Rd. Shanghai. P.R.China yucr@sjtu.edu.cn 2 School of Management Shanghai Jiao Tong University No.535, Fahuazhen Rd. Shanghai. P.R.China yanluo@sjtu.edu.cn Abstract: - The text categorization is a core technique in knowledge mining field. Most of categorization methods are based on VSM in the current research, of which the widely-used method is knn. But most of them are highly complicated on computation, and could hardly be used for classifying large-scale sample. Moreover, to them, the classifier must be rebuilt when adding or deleting the training samples, which make them poor in scalability. In this paper, based on Mutual Dependence and Equivalent Radius, a new categorization method (called MDER) is proposed. MDER can be used to classify large-scale sample and has good scalability. After a series of experiments of classifying Chinese texts, the conclusion are drawn that MDER outperforms knn and CCC method, and can be used online to classify large-scale sample while keeping higher precision and recall. Key-Words: - Text categorization, Mutual dependence, Equivalent radius, VSM 1 Introduction In recent year, with the increasing interest on automatic text categorization, some algorithms are presented, such as Bayes [1], SVM (support vector machine) [2], Boosting [3], knn(the k-nearest Neighbors) [4],. Most of them are based on VSM, which has been applied to text categorization successfully in some degree [5]. However, these algorithms are highly complicated on computation, and could hardly be used on the occasion of classifying large-scale sample. Moreover, to them, the classifier must be rebuilt when increasing the corpora of the training samples. As a result, they have tough scalability. Therefore, it is essential to find a new method to classify largescale sample with higher precision and recall. This paper presents an efficient algorithm for large-scale text categorization based on mutual dependence and equivalent radius (MDER for short, below). The MDER algorithm is lower in computation complexity. It enables a quick response to classification and a good scalability. Moreover, distinguished from VSM-based algorithms, the MDER algorithm uses the equivalent radius to construct judging function, which returns a relative value according to the distribution of training samples, rather than an absolute value based on the number of training samples. As a result, the misjudgment caused by the great disparity among categories in the number of samples or in the distribution scope can be avoided. So MDER makes it possible to classify large-scale online samples, while keeping high ratio of precision and recall 2 The MDER Algorithm 2.1 Attribute Selection Based on Mutual Dependence In information process, information samples are always pre-processed in advance to better the expression of samples [6,7]. Where, an important step is to select attributes. Generally, algorithms for attribute selection fall into two categories: statisticsbased algorithms and data dictionary-based algorithms [8,9]. Both have their own advantages and disadvantages. For example, the former is domain-independent, not requiring data dictionary. However, it is lower in precision. On the contrary, the latter need the support of data dictionary, offering more precise ability. As we know, it s ISSN: Issue 2, Volume 5, February 2008

2 impossible to express the whole real world at semantic level just by a single data dictionary, so data dictionary-based algorithms must be domainspecific. Moreover, it is really time-wasted to establish a data dictionary, so statistics-based algorithms are widely used for attribute selection in practice, such as N-gram. But the N-gram algorithm is not satisfied yet, often giving some nonsensical word segmentations. Considering the problem above, this paper compares the advantages and disadvantages of mutual information and dependent value model respectively, and constructs the mutual dependence (MD) model combining the two parts advantages and discarding disadvantages. The MD model can determine to what degree two morphemes are dependent on each other, so it can be used for selecting important attributes in text classification. The MD-based algorithm of attribute selection would be helpful not only to select the right attributes, but to decrease the dimension. Also, it can ignore those attributes with little contributions to text classification. Based on the feature of Apriori (i.e. any nonempty subset of a frequent item set must be frequent), this paper proposes a new segmentation rule (called Rule1 below), by which the words in maximal length can be obtained for decreasing the dimension. In Algorithm1, a new MD model is used for processing words, which are similar to Rule1. Definition1 (Mutual Dependence, MD for short): the mutual dependence between two variables χ and η is defined as follows 2 F( S) L 1/ L MD( χη, ) = [ Fs ( ) L F( χ) F( η) ] Log ( ) F( χ) F( η) where, F (χ ) and F (η ) are the frequency of χ and η occurring respectively. F (s) is the frequency of χ and η occurring together. L is the length of samples. All these frequency are directly obtained from samples to be processed. Thus, the strongpoint is that the data dictionary is not required, and the words segmented from samples are closer to domain knowledge. Rule1 (Eliminating Redundant Words): If there exist two N-gram items t i and t j, satisfying t i t j and score(t i )= score(t j ), then either t i or t j is redundant. So t i remains by default. Where, score(*) is the score of a N-gram item, which is equal to the occurring frequency of the N-gram item. Theorem1(Range of Mutual Dependence): The mutual dependence between χ and η ranges within [0, 1 log L ), where L is the length of samples. 4 Theorem1 can be proved easily, so the proofs are omitted here. Under Theorem1, we set an upperlimit μ U and an lower-limit μ L (0<μ L <μ U < 1 4 Log L ) for selecting attributes whose mutual dependence ranges between μ U and μ L, where, μ U 1 4 Log L, μ L 0. The basic thought for attribute selection is described as follows: Use the N-gram algorithm to segment words, discard the words with higher or lower frequency, then get a word set Ux () = { x x U j gram,1 j n}, where U j gram are the words of original text with value of j in length. (2) Judge the correlation between the words in U(x) using the MD model: First, process n-gram words, then (n-1)-gram,, until to 1-gram. When processing an i-gram word x i U(x)(2 i n), segment x i into x j i equal to j in length ( 1 j < i) and x k i equal to k(k=i-j) in length. If x j i U(x) and x k i U(x),then compute MD(x j i, x k i ). When selecting attributes, use μ U and μ L to process by the following steps: 1) If MD(x j i, x k i )>μ U, then subtract the frequency of x i from the frequency of x j i and the frequency of x k i respectively; 2) If MD(x j i, x k i )<μ L, then delete x i from U(x); 3) If MD(x j i, x k i ) ranges within [μ L, μ U ],then U(x) dose not change. (3) Finally, delete the words with higher or lower frequency from U(x) respectively. When processing i-gram words, j-gram words (0<j<i) have been processed, so the computation complexity decrease, and the remained words are more precise. Most of words in Chinese consist of two characters [10], so n in N-gram is often set to be 4. ISSN: Issue 2, Volume 5, February 2008

3 Algorithm 1: Attribute Selection Algorithm Based on Mutual Dependency Input: D: texts with attributes to be selected; μ U : upper-limit of MD; μ L : lower-limit of MD; δ U : upper-limit frequency of word; δ L : lower-limit frequency of word; n: value of N-gram Output: U f : attribute set; W f : adjusted frequency set corresponding to U f Steps: Segment words using N-gram algorithm, then delete the words with higher frequency (>δ U ) or with lower frequency (<δ L ), finally get a word set U f and a word frequency set W f. (2) For I=n To 2 Step 1 Do Pick an unprocessed I-gram word from U f, and try to segment it into word x i and word x j existing in U f. According to the frequency in W f and Formula1, compute MD(x i, x j ). If MD(x i, x j ) >μ U then Subtract the frequency of I-gram word from the frequency of x i and x j of W f respectively; Else if MD(x i, x j ) <μ L then Delete the current I-gram word from U f,and delete the I-gram word s frequency from W f. End if Until all of I-gram words in U f are processed. Next I; (3) Delete those words whose frequency is larger than δ U or less than δ L from U f, and delete their frequency from W f. (4) Use Rule1 to reduce the dimension, and yield an attribute set U f and an frequency attribute W f. 2.2 Classification Model Definition2(Equivalent Radius, ER for short): Let the number of training samples for category ω i (i=1,,c) be n i, and the dimension of vector space be d. Project all samples of ω i on every dimension d j (j=1,,d) respectively, then get the projection range of ω i on d j, denoted as Rang = [ R, R ]. number of samples whose projection is located between O and Center. n is the number of samples whose projection is located at the positive direction from Center is an equivalent radius function of ωi projected on d j, expressed as bellow: R = a R (1 a ) R (2) Compute the center of gravity of ω i on d j, denoted as Center. Here, R is the radius from Center to where, 1 i c, 1 j d, a is distribution coefficient. In the following, an algorithm is given for the origin O, and R is the radius from Center to computing R and a. positive direction. Generally, R R. n is the Algorithm 2: Classification Model Input: d ti : training samples set, 1 t n i, 1 i c, where n i is the number of training samples for category ω i (i=1,,c), and c is the number of categories. output: R : equivalent radius of the category i on the dimension j.. R Center : center of gravity of the category i on the dimension j. a : distribution coefficient of the category i on the dimension j,1 i c,1 j d. Phase1: Compute a, R, R Assign sequence numbers to categories and attributes respectively, where category number ranges [1, c], and attribute dimension ranges [1, d]. (2) Let i =1. (3) For each category ω i, do the following operations: Normalize all the training sample vectors of the category ω i. Project all the training samples of ω i on the dimension d j (j=1,, d), and get the projection range Rang =[ R, R ]. Compute Center, n, n, R, R. (4) If all the categories have been processed, then enter Phase2. Else, let i=i1, and go to Step(3) to process the next category. Phase2: Determine the Equivalent Radius Let i=1. For each category ω i, do the following operations: (2) Let j=1. ISSN: Issue 2, Volume 5, February 2008

4 (3) For each dimension d j, compute = / ( n ), R = a R (1 a ) R a n n (4) If all the dimensions have been processed, then enter Step(5);Else, let j= j1, and go to Step(3) to process the next dimension. (5) If all categories have been processed, then stop. Else, let i=i1, and go to Phase2 to process the next category. a can be determined by the golden section or area. Under Formula2, if projection of x on d j is other methods. Algorithm2 sets a = n /( n n ) equal to 0, then R =0, from which the division according to sample distribution. The reason is that, overflow arises. Therefore, a distance coefficient if the distribution of samples is denser in the area denoted as 1/β is introduced. Thus, the judging toward the origin O from Center, then R < R, function is written as follows. 2 2 k ( x n < n, so R should have a greater weight, and let j Center) d x j g ( x) = 2, i=1,2,,c (3) i 2 a = n /( n n ), R j= 1 ( R ) j= k 1 β = a R (1 a ) R, and Among all categories, if ω vice versa. i is the closest to the testing sample x in relative distance, i.e. g i (x) is the smallest, and then we say x belongs to ω i. So the 2.3 MDER Algorithm rule is: if g j( x) = Min gi( x), i=1,2,,c, then x ω j. First, construct a judging function depending on i 2 2 category ω i. Where, ( xj Center) /( R ) is Next, it will be shown that the classification result is not highly sensitive to 1/β 2, because, if a projection defined as the relative distance between a training of one category on a dimension is equal to 0, then sample x={x1,x 2,,x d } and ω i. Thus, the judging the dimension is generally a specific one for other function is based on relative distance rather than categories. And a small change in 1/β 2 cannot absolute distance so as to avoid the misjudgment greatly influence the final result. caused by great disparity of different categories in the number of training samples or in the distribution Algorithm3: MDER Algorithm input: d ti : training sample set, where 1 t n i, 1 i c, n i is the number of texts to be trained for category ω i (i=1,2,,c), and c is the number of category. t j : testing sample set, where 1 j m, and m is the number of texts to be tested. output: The category of the testing samples Steps: According to Algorithm1, extract attributes from the training sample set. (2) According to Algorithm2, compute the center of gravity and the equivalent radius of classifier s projection on every dimension. (3) From the testing sample set, pick a sample x, not be classified yet. (4) Get the character value of x. (5) Compute g i (x) according to Fomula3. (6) If g j() x = Min g i () x, i=1,2,,c, then x ω j. i (7) If all the testing samples have been classified, then stops. Else, go to Step(3). Some notations are explained as below: c is the number of categories, d is the number of dimension, n is the average number of testing texts for every category, and m is the average number of training texts for every category. Then, the performance of MDER algorithm is analyzed: Classification Efficiency: From Algorithm3, we are aware of the worst-case complexity of MDER is (c n (log 2 c)!), while it is (c 2 n m c n (log 2 (c m))!) for knn. Generally, n<<m, so the complexity of MDER is less than that of knn. (2) Updating Performance: MDER is better at updating, because when adding or deleting training samples, just to adjust the trained classifier into a new classifier according to attributes of the added or deleted training samples, rather than to train all samples again. Then, we introduce how to adjust the classifier when adding a training sample. And it is in the similar way for deleting. When adding a new samples x=(x 1,x 2,,x d ), it is only required to adjust Center, the center of gravity for the category ω i (i=1,2,,c) and R. The relationship between the adjusted center of gravity written as Center and the original center of (0) gravity written as Center is described as follows: (0) Center =(n Center x )/(n1) (4) j ISSN: Issue 2, Volume 5, February 2008

5 The relationship between the adjusted ER, written as R and the original ER, written as (0) R is described as follows: (0) (0) R = R ( n - n ) ( Center - x j )/(n1) 2 Center - x /(n1) (5) j After the adjustment of Center and, and n are also need to revised according to Algorithm4:Updating MDER Classifier Input: (0) (0) Center R n n (for the existed classifier) (2) U: the set of training samples to be added. Output: Center R (for the new classifier) Steps: Pick a sample x=(x 1,x 2,,x d ) from U. (2)According to Formula4 and Formula5, compute n n R n Center and (3)If Center > x j Then { (0) (0) Ω= ( Center Center ) n / R ; n n = -Ω1; n = Ω; n } Else { (0) (0) Ω= ( Center Center ) n / R ; n = n = -Ω1; n Ω; n Algorithm 4. When more samples are to be added, repeat Formula(5) and get a new classifier. 2.4 Updating Performance According to the performance analysis, the updating of MDER algorithm is R on the category of x for the new classifier. } (4)If all the training samples to be added have been processed in above way, then the algorithm stops. Else, (0) (0) Center = Center, R = R, and go to Step. 3 Experimental Result and Analysis 3.1 Experiment Content and Performance Next, the MDER algorithm is applied to the classification of texts in Chinese. Compared with knn and CCC (Classification based on the Center of Classes), the experimental result of MDER is analyzed. The experimental data, stored in two corpuses G1 and G2, comes from the 2005-year s People Daily and the SOHU website respectively, shown as Table1 and Table2. Table 1 The testing corpora (G1) Categories Number of Texts Politics 743 Sports 321 Economy 293 Agriculture 86 Environment 104 Astronautics 135 Art 156 Education 131 Medicine 99 Transportation 102 ENERGY 42 Computer 51 Mining 68 History 102 Military 122 Total 2555 ISSN: Issue 2, Volume 5, February 2008

6 Table 2 The testing corpora (G2) Categories Number of Texts Mining 599 Military 321 Computer 246 Electronics 101 Communications 89 Energy 129 Philosophy 132 History 103 Law 124 Literature 99 Total 1943 Definition3 (Precision and Recall): In classification, a is the number of texts belonging to the category ω i and has been identified as ω i successfully. b is the number of texts not belonging to ω i, but is identified as ω i mistakenly. c is the number of texts belonging to ω i, but fails to identified as ω i. d is the number of the texts not belonging to ω i, and hasn t been identified as ω i.so Recall=a/(ac), and Precision= a/(ab). Definition4 (F-value): In multi-class classification, F-value is defined as a function depending on precision ratio and recall ratio to measure the classification performance. F-value is written as: F = 2P R / (PR) (6) The experiment process is in the following: Let the distance coefficient be 1, 6.5, 12.5, 25, 50, 100, 200, respectively. Taking different dimensions of the vector space, we use MDER algorithm to test texts in G1 and G2 for purpose of evaluating the influence of distance coefficient to classification results. (2) Pick randomly 70%, 80%, and 90% of samples from G1 and G2 respectively for training, and the remains of 30%, 20%, 10% are used to test. Taking different dimension of the vector space, we compare the MDER algorithm with knn and CCC in classification performance for open test. (3) Pick randomly 10%, 25%, 50% of samples for every category in G1 and G2 respectively to test. And the 100% takes part in training. Taking different dimension of the vector space, we compare the MDER algorithm with knn and CCC in classification performance of closed test. (4) Use Algorithm4 to add 100 texts belonging to the 15 categories to G1. Then, use the MDER algorithm to test for purpose of evaluating the learning ability to increments. 3.2 Experimental Results The Fig.1 shows the F-value versus the distance coefficient and the dimension. Not changing the dimension, F-value is improved greatly at the beginning of increasing the distance coefficient. After reaching the maximum, F-value will decrease slowly and trend to an invariableness (as Fig.1 (b)). As mentioned above, the improvement of performance is not obvious after F-value s reaching the peak, so the classification result is not sensitive to the distance coefficient any more. The reason is that, for a category, if the projection on a character is equal to 0, then generally the character is the distinct to others. The distance coefficient emphasizes the distance between the training sample and the category. The emphasizing degree can t affect the testing result. So a little change in distance coefficient within a wider scope can t greatly influence the classification performance. When the distance coefficient is larger than 40, the performance in different dimension decreases smoothly. So in the experiment, 1/β 2 is set to be 40. Fig.1 (a) shows the influence of dimension to classification performance. Too few dimensions would be not enough express the characteristic of dimension, while, too many dimensions would bring some disturbance. So in the experiment of this work, when dimension is equal to 300, the best performance is got. 200 F-Value (a) Distance coefficient Distance coefficient (b) Fig.1. the relationship of distance and dimension of property. (a) distance coefficient, dimension verus F- value and (b) F- value versus distance coefficient when dimension is 300 F-Value ISSN: Issue 2, Volume 5, February 2008

7 When classifying large-scale text, the speed should be considered, besides the recall and the precision. So in this paper, we take the average of F- value and the response time as factors to evaluate the MDER algorithm comprehensively. The larger F-value, and the shorter the time, the higher the whole performance is. Fig.2 - Fig.3 illustrate the relationship of F-value and response time versus dimensions for closed and open tests respectively. The experiments show the MDER algorithm is better than knn and CCC with respect to recall and precision. To the response time, MDER is better than knn, and is equivalent to CCC. However, the F-value in CCC is far less then that in MDER. So we can conclude that MDER outperforms knn and CCC in the whole performance. Moreover, after adding 100 texts to G1 by Algorithm4, both precision and recall increase 1%, so the performance of the classifier can still be guaranteed when updating corpora. Therefore, MDER not only offer higher classification precision and speed, but also support incremental learning. F-Value Response time(s) (a) knn MDER CCC knn MDER CCC (b) Fig.2. the relationship of value of F/response time to dimension for close test. (a) F- value versus dimension for closed test and (b) Response time versus dimension for closed test F-Value Response time(s) knn MDER CCC (a) knn MDER CCC (b) Fig.3. the relationship of value of F/response time to dimension for open test. (a) F- value versus dimension for open test and (b) Response time versus dimension for open test 4 Conclusion Most of current text categorization algorithms are based on the vector space, among which knn is the widely-used one. However, these algorithms don t adapt well to large-scale text because of the high complexity on computation. Moreover, when increasing the corpus of training samples in size, the classifier should be rebuild. So they are poor in scalability. This paper presents two concepts, mutual dependence (MD) and equivalence radius (ER), and proposes a new MD & ER based algorithm, which suits classifying large-scale text, and has good scalability. In the new algorithm, compared with knn and CCC, not only the precision and the recall are increased, but the response speed is also improved, which make it possible to classify the large-scale online text. References: [1] Aas K and Eikvil L, Text categorization: A survey, Technical Report, NR 941, Oslo: Norwegian Computing Center, [2] Joachims T, Text categorization with support vector machines: Learning with many relevant ISSN: Issue 2, Volume 5, February 2008

8 features, Proc. of the 10th European Conf. on Machine Learning (ECML-98), Chemnitz: Springer-Verlag, 1998, pp [3] Schapire R E and Singer Y B, Texter: A boosting-based system for text categorization, Machine Learning, Vol. 39, No. 2-3, 2000, pp [4] Tan S, Neighbor-Weighted k-nearest neighbor for unbalanced text corpus, Expert Systems with Applications, Vol. 28, No.4, 2005, pp [5] Sebastiani F, Machine learning in automated text categorization, ACM Computing Surveys, Vol. 34, No.1, 2002, pp.147. [6] Chien-I Lee, Hsiu-Min Chuang, An effective soft clustering approach to mining gene expressions from multi-source databases, Proceedings of the 6th WSEAS Int. Conf. on Artificial Intelligence, Knowledge Engineering and Data Bases, 2007, pp [7] Sen-chi Yu and Yuan-horng Lin, Comparisons of Possibility- and Probability-based Classification: An Example of Depression Severity Clustering, Proceedings of the 8th WSEAS International Conference on Fuzzy Systems, 2007, pp [8] Michael K, Andreas K and Dave R, Stochastic Language Models (N-Gram) Specification, W3C Working Draft, 3 January 2001, [9] Rogati M and Yang Y, High-Performing feature selection for text classification, Proc. of the 11th ACM Int l Conf. on Information and Knowledge Management (CIKM-02), McLean: ACM Press, 2002, pp [10] Thomas E, Segmenting Chinese in Unicode, the 16th International Unicode Conference, 2000, pp ISSN: Issue 2, Volume 5, February 2008

Classification of Publications Based on Statistical Analysis of Scientific Terms Distributions

AUSTRIAN JOURNAL OF STATISTICS Volume 37 (2008), Number 1, 109 118 Classification of Publications Based on Statistical Analysis of Scientific Terms Distributions Vaidas Balys and Rimantas Rudzkis Institute