Keyword Reduction for Text Categorization using Neighborhood Rough Sets

IJCSI Internatonal Journal of Computer Scence Issues, Volume 1, Issue 1, No, January 015 ISSN (rnt): 1694-0814 ISSN (Onlne): 1694-0784 www.ijcsi.org Keyword Reducton for Text Categorzaton usng Neghborhood Rough Sets S-Yuan Jng Schuan rovnce Unversty Key Laboratory of Internet Natural Language Intellgent rocessng, Leshan Normal Unversty, Leshan, 614000, Chna Abstract Keyword reducton s a technque that removes some less mportant eywords from the orgnal dataset. Its am s to decrease the tranng tme of a learnng machne and mprove the performance of text categorzaton. Some researchers appled rough sets, whch s a popular computatonal ntellgent tool, to reduce eywords. However, classcal rough sets model, whch s usually adopted, can just deal wth nomnal value. In ths wor, we try to apply neghborhood rough sets to solve the eyword reducton problem. A heurstc algorthm s proposed meanwhle compared wth some classcal methods, such as Informaton Gan, Mutual Informaton, CHI square statstcs, etc. The expermental results show that the proposed methods can outperform other methods. Keywords: Text Categorzaton; Keyword Reducton; Neghborhood Rough Sets; Heurstc Algorthm. 1. Introducton Automatc text categorzaton s always a hot ssue n pattern classfcaton and web data mnng. There are several ey steps n automatc text categorzaton, ncludng text pre-processng, eyword reducton (or called feature selecton), learnng machne tranng, etc. Keyword reducton technque remove less mportant eywords from orgnal eyword set, thus mprove the result of text categorzaton as well as the tranng tme. Statstcal methods are the most commonly used n eyword reducton, such as statstcs, nformaton gan, mutual nformaton [1], etc. Some researchers use some other technques whch can also play well n eyword reducton, such as rough sets [, 13]. Rough sets s a powerful computatonal ntellgent tool whch proposed by Z. awla n 198 [3]. It has been proven to be effcent to handle mprecse, uncertan and vague nowledge. Attrbute reducton s one of the most mportant applcatons for rough sets. In last decade, some researchers appled rough sets to solve the eyword reducton problem and obtaned promsng results. For example, A. Chouchoulas and Q. Shen frstly ntroduced a common method to apply rough sets to eyword reducton n text categorzaton [4]. Mao et al. proposed a hybrd algorthm for text categorzaton whch combned tradtonal classfers wth varable precson rough set []. Y. L et al. presented a novel rough set-based case-based reasoner for use n text categorzaton [5]. However all the research adopt classcal rough sets model (.e. awla rough sets model), whch can just deal wth nomnal values. The number of eyword s contnuous, thus they must dscrete data before computaton. Ths wll loss some mportant nformaton and mpacts the fnal result. To address ths problem, we try to apply neghborhood rough sets to solve the eyword reducton problem n ths paper. A new heurstc algorthm for eyword reducton s proposed. We use two famous classfers,.e. NN and SVM, to test the performance of the new algorthm. The expermental datasets are Reuters-1578 and 0- newsgroup. Expermental results show that the proposed method can acheve good performance and outperform some well-nown statstcal methods. The rest of the paper s organzed as follow: secton wll recall some basc concept and nowledge of neghborhood rough sets as well as eyword reducton technques. Secton 3 ntroduces a new heurstc algorthm based on neghborhood rough sets for eyword reducton. Secton 4 gves the expermental result and the dscusson. Fnally, secton 5 gves a concluson.. relmnares of Neghborhood Rough Sets Neghbor theory s proposed by Ln et al. n 1990 [9]. It has been an mportant tool for many artfcal ntellgence tass, such as machne learnng, pattern classfcaton, data mnng, natural language understandng and nformaton retreval etc. [8]. Neghborhood rough set s a new model whch combnes neghbor theory and rough set theory together, and t can be used to deal wth numerc value or hybrd value. In ths secton, we brefly recall some basc concepts and theory of neghborhood rough sets n [6, 10]. 015 Internatonal Journal of Computer Scence Issues

IJCSI Internatonal Journal of Computer Scence Issues, Volume 1, Issue 1, No, January 015 ISSN (rnt): 1694-0814 ISSN (Onlne): 1694-0784 www.ijcsi.org 3 Defnton.1: U s a non-empty fnte set of objects, s a gven functon. We say NAS U, s a neghbor approxmaton space where: 1). x x x x, 0,, 0, f and only f x x, 1 x, x U 1 1 1 ). 1,, 1, 1, 3). x x x x x x U x, x x, x x, x, x, x, x U 1 3 1 3 1 3 We say s the dstance functon n ths neghbor space. There are many useful dstance functon can be used, such as Eucldean dstance functon, Mnows dstance, etc. To construct a neghborhood rough set model for unverse granulaton on the numercal attrbute, Hu et al. proposed a neghbor [7]. Defnton.: Gven a neghbor space U,, x U, 0, we say x s a neghbor of x whose centre s x and radus s, where: x y x, y, y U (.1) It s easy to fnd that a set of neghbors wll nduce a famly of nformaton granules and they can be used to approxmate any concepts n unverse. Defnton.3: A neghborhood nformaton system s a NIS U, A,, whereu s a non-empty fnte set of trple objects; A s a non-empty fnte set of attrbutes; s a dstance functon; A and form a famly of neghborhood relaton onu. We gve an example to llustrate how to compute neghbors n a gve nformaton system. Table 1 shows an nformaton table, n whch there are sx objects and each of them has two attrbutes. We use an mproved Eucldean dstance proposed n [10] as the dstance functon. The results are shown n table. If we x x, x, set threshold to 0.3, we can get: 1 1 5 x x, x4, x3 x3, x5, x4 x, x4 x x, x, x, x x. 5 1 3 5 6 6, Table. The results of dstance computaton x 1 x x 3 x 4 x 5 x 5 x 1 0 1.11 0.34 1.04 0.14 0.9 x 1.11 0 0.97 0.6 1.1 0.68 x 3 0.34 0.97 0 0.97 0. 1 x 4 1.04 0.6 0.97 0 1.09 0.43 x 5 0.14 1.1 0. 1.09 0 1.04 x 5 0.9 0.68 1 0.43 1.04 0 Based on above defnton, we can construct lower approxmaton and upper approxmaton n neghborhood rough sets. Defnton.4: Gven a neghborhood approxmaton NAS U, and X U, the lower approxmaton space and upper approxmaton about X can be defned as: N X x x X, x U N X x x X, x U And X U, N X X N X (.). Moreover, we can respectvely defne the boundary doman, postve doman and negatve doman as follow: BN X N X N X (.3) OS X N X (.4) NEG X U N X (.5) ostve doman OS X represents the granules contaned by X ; NEG X represents the granules not contaned by X ; BN X represents the granules partally contaned by X. If N X N X, we say X n ths neghborhood approxmaton space s defnable; otherwse, t s ndefnable, namely t s rough. In practcal envronment, some data are nosy. The model defned above s crsp and wthout tolerance ablty of nose. Therefore, we adopt varable precson rough sets to mprove formula.. The functon card s to count the object number n a set. arameter s usually set between 0.5 to 1. Table 1. An llustrated example for neghbor computaton obj. attrbute1 attrbute obj. attrbute1 attrbute x 1 4 x 4 10 8 x 1 7 x 5 3 3 x 3 5 3 x 6 6 9 x card x X N X x, x U card card x X N X x 1, x U card x (.6) 015 Internatonal Journal of Computer Scence Issues

IJCSI Internatonal Journal of Computer Scence Issues, Volume 1, Issue 1, No, January 015 ISSN (rnt): 1694-0814 ISSN (Onlne): 1694-0784 www.ijcsi.org 4 Data n many pattern classfcaton problems, such as the eyword reducton problem, can be represented by a decson table. Therefore, we ntroduce some concepts and a neghborhood rough sets model to handle decson table. Defnton.5: Gven a neghborhood nformaton NIS U, A,, we say t s a neghborhood system decson table NDT U, C D, f A C D, C represents the condton attrbute set, D represents the decson attrbute, and C forms a famly of neghborhood relaton onu. Defnton.6: Gven a neghborhood decson NDT U, C D,, C, decson table attrbute D dvde the unverse U nto some equvalence classes, X1, X,, X n,the lower approxmaton and upper approxmaton about D relatve to can be defned as: N D N X N X N X N D N X N X N X 1 n 1 n (.7) Obvously OS D N D, NEG D U N D p BN D N D N D. 3. Algorthm p, Generally speang, there are two steps n text categorzaton tas,.e. text preprocessng and classfer buldng. Text preprocessng has fve sub-steps whch are word extractng, stoppng word omttng, word stemmng, eyword reducng and eyword weghtng. The former three sub-steps s responsble to extract words from raw text meanwhle delete some useless words whch are n a gven stoppng word lst, and combne the words wth same stem. After that, we commonly use VSM (Vector Space Model) to represent the texts. The last two sub-steps,.e. eyword reducng and weghtng, remove eywords whch have lttle help to classfcaton and assgn a new value to each eyword whch can emphasze the mportance of dfferent eywords. Classfer buldng n text categorzaton s same wth other machne learnng tass and the explanaton s omtted here. Keyword reducton s a ey step n text categorzaton. Its am s to remove some eywords whch are less mportant. A good algorthm of eyword reducton can eep the most mportant eywords and t wll greatly mprove classfer s performance as well as decrease the tranng tme. In ths secton, we frstly gve some mportant concepts and theores to measure the mportance degree of eywords based on neghborhood rough sets, and then we propose a heurstc algorthm for ey reducton n text categorzaton. To apply neghborhood rough set s to text categorzaton, we must represent the problem by neghborhood decson NDT U, C D,, where U represents a set of table samples (.e. texts), C represents a set of eywords, D represents the label of documents, s a pre-defned dstance functon. Then, for C, we can defne the dependency relaton between and D as: card OS D D (3.1) card U Obvously, 0 D 1. The dependency D reflects the dependence degree of D relatve to. Ths defnton s n accordance wth the classcal rough set theory. Theorem 3.1: Gven a neghborhood nformaton system NIS U, A, and threshold, 1, C,, then x U x x f 1., 1 roof. It s obvous thus the proof s omtted here. Theorem 3.: Gven a neghborhood decson table NDT U, C D,, 1, C, f 1, then x OS D x OS D. 1 roof. Wthout loss of generalty, let x OS D D j, represents the jth equvalence class dvded by decson D. Accordng to theorem 3.1, f 1, x x x OS D. Ths then. Thus, 1 completes the proof. Theorem 3.3: D C f 1 s monotone, that s to say,, then D D D 1 C 1. roof. Accordng to theorem 3., f 1 C, OS D OS D OS D. Accordng then 1 C to formula 3.1, we can get D D D Ths completes the proof.. 1 C Theorem 3.3 shows an mportant property of dependency functon. Accordng to t, we can defne the sgnfcance of each eyword a relatve to decson D (.e. labels), as follow: sg a,, D a D D (3.) Based on the sgnfcance metrc, we can desgn a heurstc eyword reducton algorthm. The algorthm begns wth a 015 Internatonal Journal of Computer Scence Issues

IJCSI Internatonal Journal of Computer Scence Issues, Volume 1, Issue 1, No, January 015 ISSN (rnt): 1694-0814 ISSN (Onlne): 1694-0784 www.ijcsi.org 5 relatve core set searched by an algorthm proposed n [11]. For each teraton, t computes sgnfcance of all rest eywords and selects a eyword that has the hghest sgnfcance and adds t to a fnal eyword set. The algorthm does not end untl the pre-defned eyword number s acheved. The pseudo-code s shown below. NRS-KR Algorthm: Input: NDT U, C D,, a predefned threshold and eyword number N Output: A eyword reducton, denoted as red 1. red core ; N core do. for =1: 3. m = 0, canddate=null; 4. for a C red do sg a, red, D >m do 5. f 6. canddate = 7. end f 8. end for 9. red red canddate; 10. end for 4. Experments a ; m=,, sg a red D ; In ths secton, we perform experments to evaluate the proposed algorthm. Frstly, we wll ntroduce expermental datasets, the selected methods whch are used for comparson, and performance metrcs n the experments. Then, we wll explan how we perform the experments and show the results. 4.1 Selected Datasets We select two datasets to evaluate the proposed algorthm. The frst one s Reuters-1578, whch s a standard text categorzaton benchmar and collected by the Carnege group from the Reuters newswre n 1987. We only choose the most populous ten categores from ths corpus, whch are earn, acq, coffee, sugar, trade, shp, crude, nterest, money-fx, gold. The second one s 0-newsgroup, whch contans approxmately 0,000 newsgroup texts beng dvded nearly among 0 dfferent newsgroups. 4. Methods for Comparson We choose four methods to compare wth the proposed method. The frst one s classcal rough sets based eyword reducton algorthm and the other three are statstcs (CHI), Informaton Gan (IG) and Mutual Informaton (MI), respectvely. The statstcs measures the lac of ndependence between eyword and label and can be compared to the dstrbuton wth one degree of freedom to judge extremeness. The statstcs s defned as formula 4.1, where t means term (.e. eyword) and c means category (.e. label), A s the number of tmes t and c occur at the same tme, B s the number of tmes t occurs wthout c, C s the number of tmes c occurs wthout t, and D s the number of tmes nether t or c occur. N s the number of texts n dataset. tc, N AD CB A CB D A BC D (4.1) The nformaton gan measures the number of bts of nformaton obtaned for category predcton by nowng the presence or absence of a term n a text. The nformaton gan of a term t s defned as follow. m r log r 1 r m r log r r IG t c c t 1 m 1 r c t c t t c t log r c t (4.) Mutual nformaton s a crteron commonly used n statstcal modelng. The basc dea behnd ths metrc s that the larger mutual nformaton s, the hgher the cooccurrence probablty between term t and category c s. MI can be defned as follow. The symbol N, A, B, C s same wth formula 4.1. r, MI t, c log log r 4.3 erformance Metrcs tc A N tr c A C A B (4.3) For dfferent purpose, there are several performance metrcs can be chose n text categorzaton evaluaton, ncludng recall, precson, F-measure, Mcro-average of F- measure and Macro-average of F-measure. In the experments, we choose Mcro-average of F1-measure (abbrevated as F1-Mcro) and Macro-average of F1- measure (abbrevated as F1-Macro) as the performance metrcs. Snce the two metrcs are based on other basc metrcs, we brefly explan here. Frstly, we gve the defnton of recall and precson, as follow. 015 Internatonal Journal of Computer Scence Issues

IJCSI Internatonal Journal of Computer Scence Issues, Volume 1, Issue 1, No, January 015 ISSN (rnt): 1694-0814 ISSN (Onlne): 1694-0784 www.ijcsi.org 6 a recall 100% a c a precson 100% a b (4.4) (4.5) Where a s the number of texts correctly classfed to that class, b s the number of texts ncorrectly classfed to that class, c s the number of text ncorrectly rejected from that class. Based on the two defntons, we further gve the defnton of F1-measure as formula 4.6. precson recall F1 precson recall (4.6) Fnally, the F1-Mcro and F1-Macro are defned as follow. n N F1 macro F1 1 N n N precson recall 1 N precson recall precson F1Mcro precson recall recall (4.7) (4.8) Where precson and recall s computed by the number of a, b, c. 4.4 Expermental Results SVM and NN. SVM s mplemented by lbsvm whch s a wdely used open-source lbrary for SVM [1]. Lnear ernel s adopted n SVM. For category ranng n NN, we ndvdually select 10 nearest neghbors for each text and adopt the category whch s the most. Cosne smlarty s used for neghbor s selecton. Before eyword reducton, we need to preprocess the raw datasets. Frstly, we extract words from the raw texts. These eywords wll not be used to construct VSM mmedately. We remove some useless words based on a stop word lst whch s obtaned from Internet and has 887 stop words. After that, we use the orter Stemmng Algorthm to combne the words that has same stem. For the proposed algorthm, the threshold s set to 0.. Eucldean dstance s adopted as the dstance functon. We compare the proposed algorthm (NRS) wth other four methods,.e. classcal rough sets (RS), statstcs (CHI), nformaton gan (IG) and mutual nformaton (MI). The results are shown from fgure to 5. In experments, we choose a dataset and apply dfferent methods to select a number of eyword and evaluate by a specfc classfer. The eyword number changes from 1000 to 10000. From fgure to 5, we can easly fnd that the proposed algorthm plays very well and t can select much better eywords than other four methods. Furthermore, we lst the average results of each algorthm n table 3 and 4. From table 3 and 4, we can see that the proposed algorthm also outperforms other methods n ths aspect. The experments are performed on Matlab R01b. We choose two popular classfers for evaluaton, whch are Table 3: erformance of dfferent algorthms on Reuters-1578 (%) Methods SVM NN F1-Mcro F1-Macro F1-Mcro F1-Macro NRS 0.9590 0.9135 0.906 0.868 RS 0.951 0.891 0.889 0.858 CHI 0.9550 0.9068 0.8687 0.8501 IG 0.9489 0.8756 0.8633 0.8316 MI 0.8969 0.7311 0.793 0.664 Table 4: erformance of dfferent algorthms on 0-newsgroups (%) Methods SVM NN F1-Mcro F1-Macro F1-Mcro F1-Macro NRS 0.7074 0.7070 0.6495 0.6447 RS 0.653 0.6183 0.5904 0.5835 CHI 0.6374 0.6301 0.598 0.5914 IG 0.5713 0.5669 0.5309 0.55 MI 0.4308 0.440 0.3834 0.4104 015 Internatonal Journal of Computer Scence Issues

IJCSI Internatonal Journal of Computer Scence Issues, Volume 1, Issue 1, No, January 015 ISSN (rnt): 1694-0814 ISSN (Onlne): 1694-0784 www.ijcsi.org 7 Fg. 1 Comparson of dfferent algorthms on Reuters-1578 usng SVM Classfer(a) F1-Mcro; (b) F1-Macro Fg. Comparson of dfferent algorthms on Reuters-1578 usng NN Classfer(a) F1-Mcro; (b) F1-Macro Fg. 3 Comparson of dfferent algorthms on 0-newsgroups usng SVM Classfer(a) F1-Mcro; (b) F1-Macro Fg. 4 Comparson of dfferent algorthms on 0-newsgroups usng NN Classfer (a) F1-Mcro; (b) F1-Macro 015 Internatonal Journal of Computer Scence Issues

IJCSI Internatonal Journal of Computer Scence Issues, Volume 1, Issue 1, No, January 015 ISSN (rnt): 1694-0814 ISSN (Onlne): 1694-0784 www.ijcsi.org 8 5. Conclusons In ths paper, we propose a novel way to eyword reducton n text categorzaton usng neghborhood rough sets. As we nowledge, neghborhood rough sets s a model whch combne classcal rough sets and neghbor theory, and t s effcent to handle numerc value or even hybrd value. We apply some mportant concepts and theores of neghborhood rough sets to eyword reducton problem, moreover desgn a heurstc algorthm. The Expermental results prove that the proposed algorthm play well on eyword reducton problem and t outperform some classcal methods on Reuters-1578 and 0- newsgroups. Acnowledgments Ths wor s supported by the Scentfc Research Fund of Schuan rovncal Educaton Department (Grand No. 14ZB047), Scentfc Research Fund of Leshan Normal Unversty (Grand No. Z135), and Schuan rovncal Department of Scence and Technology roject (No.014JY0036).Instead, wrte F. A. Author thans.... Sponsor and fnancal support acnowledgments are also placed here. References [1] Y. Yang, "A comparatve study on feature selecton n text categorzaton", n roceedng of 14th Internatonal Conference on Machne Learnng (ICML 97), 1997, pp. 41-40. [] D. Q. Mao, Q. G. Duan, H. Y. Zhang et al., "Rough set based hybrd algorthm for text classfcaton", Expert Systems wth Applcatons, Vol. 36, 009, pp. 9168-9174. [3] Z. awla, "Rough sets", Internatonal Journal of Computer and Informaton Scence, Vol. 11, 198, pp. 341-356. [4] A. Chouchoulas, Q. Shen, "Rough Set-Aded Keyword Reducton for Text Categorzaton", Appled Artfcal Intellgence, Vol. 15, 001, pp. 843-873. [5] Y. L, S. C. K. Shu, S. K. al et al., "A rough set-based casebased reasoner for text categorzaton", Internatonal Journal of Approxmate Reasonng, Vol. 41, 006, pp. 9-55. [6] Q. H. Hu, D. R. Yu, Z. X. Xe, "Numercal attrbute reducton based on neghborhood granulaton and rough approxmaton", Chnese Journal of software, Vol. 19, 008, pp. 640 649. [7] Q. H. Hu, D. R. Yu, J. F. Lu et al., "Neghborhood rough set based heterogeneous feature subset selecton", Informaton Scences, Vol. 178, 008, pp. 3577-3594. [8] H. Wang, "Nearest neghbors by neghborhood countng", IEEE Trans. on attern Analyss and Machne Intellgence, Vol. 8, 006, pp. 94-953. [9] T. Y. Ln, Q. Lu, K. J. Huang, "Rough sets, neghborhood systems and approxmaton", n the 5th Internatonal Symposum on Methodologes of Intellgent Systems, 1990, pp. 130-141. [10] S. Y. Jng, K. She, S. Al, "A Unversal neghbourhood rough sets model for nowledge dscoverng from ncomplete heterogeneous data", Expert Systems, Vol. 30, 013, pp.89-96. [11] T. F. Zhang, J. M. Xao, X. H. Wang, "Algorthms of attrbute relatve reducton n rough set theory", Chnese Journal of Electronc, Vol. 33, 005, pp. 080-083. [1] C. C. Chang, C. J. Ln, "LIBSVM: a lbrary for support vector machnes", ACM Transactons on Intellgent Systems and Technology, Vol. 7, 011, pp. 1-7. [13] S. Al, S. Y. Jng, K. She, "roft-aware DVFS Enabled Resource Management of IaaS Cloud", Internatonal Journal of Computer Scence Issues, Vol. 10, 003, pp. 37-47. S-Yuan Jng receved the h. D. degree from Unversty of Electronc Scence and Technology of Chna, Chengdu Chna, n 013. He s currently an assocate professor wth Leshan Normal Unversty. He s the member of ACM and CCF. Hs research nterest s n computatonal ntellgence, bg data mnng etc. 015 Internatonal Journal of Computer Scence Issues