Keyword Reduction for Text Categorization using Neighborhood Rough Sets

Similar documents
A Method for Filling up the Missed Data in Information Table

Power law and dimension of the maximum value for belief distribution with the max Deng entropy

Discretization of Continuous Attributes in Rough Set Theory and Its Application*

The Order Relation and Trace Inequalities for. Hermitian Operators

Semi-supervised Classification with Active Query Selection

MDL-Based Unsupervised Attribute Ranking

Boostrapaggregating (Bagging)

Regularized Discriminant Analysis for Face Recognition

Multilayer Perceptron (MLP)

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Application research on rough set -neural network in the fault diagnosis system of ball mill

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

MAXIMUM A POSTERIORI TRANSDUCTION

Generalized Linear Methods

A Hybrid Variational Iteration Method for Blasius Equation

The Study of Teaching-learning-based Optimization Algorithm

Fuzzy Boundaries of Sample Selection Model

A New Evolutionary Computation Based Approach for Learning Bayesian Network

Pop-Click Noise Detection Using Inter-Frame Correlation for Improved Portable Auditory Sensing

An Improved multiple fractal algorithm

Feature Selection in Multi-instance Learning

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Kernel Methods and SVMs Extension

Evaluation for sets of classes

A Network Intrusion Detection Method Based on Improved K-means Algorithm

Lecture Notes on Linear Regression

Lecture 10 Support Vector Machines II

International Journal of Mathematical Archive-3(3), 2012, Page: Available online through ISSN

A Knowledge-Based Feature Selection Method for Text Categorization

A PROBABILITY-DRIVEN SEARCH ALGORITHM FOR SOLVING MULTI-OBJECTIVE OPTIMIZATION PROBLEMS

Homework Assignment 3 Due in class, Thursday October 15

A Particle Filter Algorithm based on Mixing of Prior probability density and UKF as Generate Importance Function

Natural Language Processing and Information Retrieval

Valuated Binary Tree: A New Approach in Study of Integers

x = , so that calculated

Complement of Type-2 Fuzzy Shortest Path Using Possibility Measure

NUMERICAL DIFFERENTIATION

A Robust Method for Calculating the Correlation Coefficient

Excess Error, Approximation Error, and Estimation Error

CHAPTER IV RESEARCH FINDING AND DISCUSSIONS

Microwave Diversity Imaging Compression Using Bioinspired

Hongyi Miao, College of Science, Nanjing Forestry University, Nanjing ,China. (Received 20 June 2013, accepted 11 March 2014) I)ϕ (k)

System in Weibull Distribution

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Problem Set 9 Solutions

An Improved Prediction Model based on Fuzzy-rough Set Neural Network

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

A new Approach for Solving Linear Ordinary Differential Equations

Convexity preserving interpolation by splines of arbitrary degree

Turbulence classification of load data by the frequency and severity of wind gusts. Oscar Moñux, DEWI GmbH Kevin Bleibler, DEWI GmbH

Foundations of Arithmetic

Orientation Model of Elite Education and Mass Education

De-noising Method Based on Kernel Adaptive Filtering for Telemetry Vibration Signal of the Vehicle Test Kejun ZENG

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Handling Uncertain Spatial Data: Comparisons between Indexing Structures. Bir Bhanu, Rui Li, Chinya Ravishankar and Jinfeng Ni

RBF Neural Network Model Training by Unscented Kalman Filter and Its Application in Mechanical Fault Diagnosis

Supporting Information

SDMML HT MSc Problem Sheet 4

On the correction of the h-index for career length

A New Scrambling Evaluation Scheme based on Spatial Distribution Entropy and Centroid Difference of Bit-plane

Online Classification: Perceptron and Winnow

A New Refinement of Jacobi Method for Solution of Linear System Equations AX=b

Neural networks. Nuno Vasconcelos ECE Department, UCSD

An Enterprise Competitive Capability Evaluation Based on Rough Sets

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

VQ widely used in coding speech, image, and video

Inexact Newton Methods for Inverse Eigenvalue Problems

BOOTSTRAP METHOD FOR TESTING OF EQUALITY OF SEVERAL MEANS. M. Krishna Reddy, B. Naveen Kumar and Y. Ramu

The Minimum Universal Cost Flow in an Infeasible Flow Network

Aggregation of Social Networks by Divisive Clustering Method

Comparison of the Population Variance Estimators. of 2-Parameter Exponential Distribution Based on. Multiple Criteria Decision Making Method

10-701/ Machine Learning, Fall 2005 Homework 3

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

Constructing Rough Set Based Unbalanced Binary Tree for Feature Selection

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Modelli Clamfim Equazione del Calore Lezione ottobre 2014

Transient Stability Assessment of Power System Based on Support Vector Machine

An improved multi-objective evolutionary algorithm based on point of reference

Research Article Green s Theorem for Sign Data

Rotation Invariant Shape Contexts based on Feature-space Fourier Transformation

829. An adaptive method for inertia force identification in cantilever under moving mass

Machine Learning. Classification. Theory of Classification and Nonparametric Classifier. Representing data: Hypothesis (classifier) Eric Xing

FUZZY GOAL PROGRAMMING VS ORDINARY FUZZY PROGRAMMING APPROACH FOR MULTI OBJECTIVE PROGRAMMING PROBLEM

Department of Statistics University of Toronto STA305H1S / 1004 HS Design and Analysis of Experiments Term Test - Winter Solution

COMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS

Bayesian predictive Configural Frequency Analysis

Gaussian Mixture Models

AGGREGATION OF FUZZY OPINIONS UNDER GROUP DECISION-MAKING BASED ON SIMILARITY AND DISTANCE

Interactive Bi-Level Multi-Objective Integer. Non-linear Programming Problem

Study of Selective Ensemble Learning Methods Based on Support Vector Machine

Linear Classification, SVMs and Nearest Neighbors

Sparse Gaussian Processes Using Backward Elimination

Classification with Reject Option in Text Categorisation Systems

Online Appendix to: Axiomatization and measurement of Quasi-hyperbolic Discounting

Review of Taylor Series. Read Section 1.2

THE ARIMOTO-BLAHUT ALGORITHM FOR COMPUTATION OF CHANNEL CAPACITY. William A. Pearlman. References: S. Arimoto - IEEE Trans. Inform. Thy., Jan.

Ensemble Methods: Boosting

Neryškioji dichotominių testo klausimų ir socialinių rodiklių diferencijavimo savybių klasifikacija

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Statistical pattern recognition

Transcription:

IJCSI Internatonal Journal of Computer Scence Issues, Volume 1, Issue 1, No, January 015 ISSN (rnt): 1694-0814 ISSN (Onlne): 1694-0784 www.ijcsi.org Keyword Reducton for Text Categorzaton usng Neghborhood Rough Sets S-Yuan Jng Schuan rovnce Unversty Key Laboratory of Internet Natural Language Intellgent rocessng, Leshan Normal Unversty, Leshan, 614000, Chna Abstract Keyword reducton s a technque that removes some less mportant eywords from the orgnal dataset. Its am s to decrease the tranng tme of a learnng machne and mprove the performance of text categorzaton. Some researchers appled rough sets, whch s a popular computatonal ntellgent tool, to reduce eywords. However, classcal rough sets model, whch s usually adopted, can just deal wth nomnal value. In ths wor, we try to apply neghborhood rough sets to solve the eyword reducton problem. A heurstc algorthm s proposed meanwhle compared wth some classcal methods, such as Informaton Gan, Mutual Informaton, CHI square statstcs, etc. The expermental results show that the proposed methods can outperform other methods. Keywords: Text Categorzaton; Keyword Reducton; Neghborhood Rough Sets; Heurstc Algorthm. 1. Introducton Automatc text categorzaton s always a hot ssue n pattern classfcaton and web data mnng. There are several ey steps n automatc text categorzaton, ncludng text pre-processng, eyword reducton (or called feature selecton), learnng machne tranng, etc. Keyword reducton technque remove less mportant eywords from orgnal eyword set, thus mprove the result of text categorzaton as well as the tranng tme. Statstcal methods are the most commonly used n eyword reducton, such as statstcs, nformaton gan, mutual nformaton [1], etc. Some researchers use some other technques whch can also play well n eyword reducton, such as rough sets [, 13]. Rough sets s a powerful computatonal ntellgent tool whch proposed by Z. awla n 198 [3]. It has been proven to be effcent to handle mprecse, uncertan and vague nowledge. Attrbute reducton s one of the most mportant applcatons for rough sets. In last decade, some researchers appled rough sets to solve the eyword reducton problem and obtaned promsng results. For example, A. Chouchoulas and Q. Shen frstly ntroduced a common method to apply rough sets to eyword reducton n text categorzaton [4]. Mao et al. proposed a hybrd algorthm for text categorzaton whch combned tradtonal classfers wth varable precson rough set []. Y. L et al. presented a novel rough set-based case-based reasoner for use n text categorzaton [5]. However all the research adopt classcal rough sets model (.e. awla rough sets model), whch can just deal wth nomnal values. The number of eyword s contnuous, thus they must dscrete data before computaton. Ths wll loss some mportant nformaton and mpacts the fnal result. To address ths problem, we try to apply neghborhood rough sets to solve the eyword reducton problem n ths paper. A new heurstc algorthm for eyword reducton s proposed. We use two famous classfers,.e. NN and SVM, to test the performance of the new algorthm. The expermental datasets are Reuters-1578 and 0- newsgroup. Expermental results show that the proposed method can acheve good performance and outperform some well-nown statstcal methods. The rest of the paper s organzed as follow: secton wll recall some basc concept and nowledge of neghborhood rough sets as well as eyword reducton technques. Secton 3 ntroduces a new heurstc algorthm based on neghborhood rough sets for eyword reducton. Secton 4 gves the expermental result and the dscusson. Fnally, secton 5 gves a concluson.. relmnares of Neghborhood Rough Sets Neghbor theory s proposed by Ln et al. n 1990 [9]. It has been an mportant tool for many artfcal ntellgence tass, such as machne learnng, pattern classfcaton, data mnng, natural language understandng and nformaton retreval etc. [8]. Neghborhood rough set s a new model whch combnes neghbor theory and rough set theory together, and t can be used to deal wth numerc value or hybrd value. In ths secton, we brefly recall some basc concepts and theory of neghborhood rough sets n [6, 10]. 015 Internatonal Journal of Computer Scence Issues

IJCSI Internatonal Journal of Computer Scence Issues, Volume 1, Issue 1, No, January 015 ISSN (rnt): 1694-0814 ISSN (Onlne): 1694-0784 www.ijcsi.org 3 Defnton.1: U s a non-empty fnte set of objects, s a gven functon. We say NAS U, s a neghbor approxmaton space where: 1). x x x x, 0,, 0, f and only f x x, 1 x, x U 1 1 1 ). 1,, 1, 1, 3). x x x x x x U x, x x, x x, x, x, x, x U 1 3 1 3 1 3 We say s the dstance functon n ths neghbor space. There are many useful dstance functon can be used, such as Eucldean dstance functon, Mnows dstance, etc. To construct a neghborhood rough set model for unverse granulaton on the numercal attrbute, Hu et al. proposed a neghbor [7]. Defnton.: Gven a neghbor space U,, x U, 0, we say x s a neghbor of x whose centre s x and radus s, where: x y x, y, y U (.1) It s easy to fnd that a set of neghbors wll nduce a famly of nformaton granules and they can be used to approxmate any concepts n unverse. Defnton.3: A neghborhood nformaton system s a NIS U, A,, whereu s a non-empty fnte set of trple objects; A s a non-empty fnte set of attrbutes; s a dstance functon; A and form a famly of neghborhood relaton onu. We gve an example to llustrate how to compute neghbors n a gve nformaton system. Table 1 shows an nformaton table, n whch there are sx objects and each of them has two attrbutes. We use an mproved Eucldean dstance proposed n [10] as the dstance functon. The results are shown n table. If we x x, x, set threshold to 0.3, we can get: 1 1 5 x x, x4, x3 x3, x5, x4 x, x4 x x, x, x, x x. 5 1 3 5 6 6, Table. The results of dstance computaton x 1 x x 3 x 4 x 5 x 5 x 1 0 1.11 0.34 1.04 0.14 0.9 x 1.11 0 0.97 0.6 1.1 0.68 x 3 0.34 0.97 0 0.97 0. 1 x 4 1.04 0.6 0.97 0 1.09 0.43 x 5 0.14 1.1 0. 1.09 0 1.04 x 5 0.9 0.68 1 0.43 1.04 0 Based on above defnton, we can construct lower approxmaton and upper approxmaton n neghborhood rough sets. Defnton.4: Gven a neghborhood approxmaton NAS U, and X U, the lower approxmaton space and upper approxmaton about X can be defned as: N X x x X, x U N X x x X, x U And X U, N X X N X (.). Moreover, we can respectvely defne the boundary doman, postve doman and negatve doman as follow: BN X N X N X (.3) OS X N X (.4) NEG X U N X (.5) ostve doman OS X represents the granules contaned by X ; NEG X represents the granules not contaned by X ; BN X represents the granules partally contaned by X. If N X N X, we say X n ths neghborhood approxmaton space s defnable; otherwse, t s ndefnable, namely t s rough. In practcal envronment, some data are nosy. The model defned above s crsp and wthout tolerance ablty of nose. Therefore, we adopt varable precson rough sets to mprove formula.. The functon card s to count the object number n a set. arameter s usually set between 0.5 to 1. Table 1. An llustrated example for neghbor computaton obj. attrbute1 attrbute obj. attrbute1 attrbute x 1 4 x 4 10 8 x 1 7 x 5 3 3 x 3 5 3 x 6 6 9 x card x X N X x, x U card card x X N X x 1, x U card x (.6) 015 Internatonal Journal of Computer Scence Issues

IJCSI Internatonal Journal of Computer Scence Issues, Volume 1, Issue 1, No, January 015 ISSN (rnt): 1694-0814 ISSN (Onlne): 1694-0784 www.ijcsi.org 4 Data n many pattern classfcaton problems, such as the eyword reducton problem, can be represented by a decson table. Therefore, we ntroduce some concepts and a neghborhood rough sets model to handle decson table. Defnton.5: Gven a neghborhood nformaton NIS U, A,, we say t s a neghborhood system decson table NDT U, C D, f A C D, C represents the condton attrbute set, D represents the decson attrbute, and C forms a famly of neghborhood relaton onu. Defnton.6: Gven a neghborhood decson NDT U, C D,, C, decson table attrbute D dvde the unverse U nto some equvalence classes, X1, X,, X n,the lower approxmaton and upper approxmaton about D relatve to can be defned as: N D N X N X N X N D N X N X N X 1 n 1 n (.7) Obvously OS D N D, NEG D U N D p BN D N D N D. 3. Algorthm p, Generally speang, there are two steps n text categorzaton tas,.e. text preprocessng and classfer buldng. Text preprocessng has fve sub-steps whch are word extractng, stoppng word omttng, word stemmng, eyword reducng and eyword weghtng. The former three sub-steps s responsble to extract words from raw text meanwhle delete some useless words whch are n a gven stoppng word lst, and combne the words wth same stem. After that, we commonly use VSM (Vector Space Model) to represent the texts. The last two sub-steps,.e. eyword reducng and weghtng, remove eywords whch have lttle help to classfcaton and assgn a new value to each eyword whch can emphasze the mportance of dfferent eywords. Classfer buldng n text categorzaton s same wth other machne learnng tass and the explanaton s omtted here. Keyword reducton s a ey step n text categorzaton. Its am s to remove some eywords whch are less mportant. A good algorthm of eyword reducton can eep the most mportant eywords and t wll greatly mprove classfer s performance as well as decrease the tranng tme. In ths secton, we frstly gve some mportant concepts and theores to measure the mportance degree of eywords based on neghborhood rough sets, and then we propose a heurstc algorthm for ey reducton n text categorzaton. To apply neghborhood rough set s to text categorzaton, we must represent the problem by neghborhood decson NDT U, C D,, where U represents a set of table samples (.e. texts), C represents a set of eywords, D represents the label of documents, s a pre-defned dstance functon. Then, for C, we can defne the dependency relaton between and D as: card OS D D (3.1) card U Obvously, 0 D 1. The dependency D reflects the dependence degree of D relatve to. Ths defnton s n accordance wth the classcal rough set theory. Theorem 3.1: Gven a neghborhood nformaton system NIS U, A, and threshold, 1, C,, then x U x x f 1., 1 roof. It s obvous thus the proof s omtted here. Theorem 3.: Gven a neghborhood decson table NDT U, C D,, 1, C, f 1, then x OS D x OS D. 1 roof. Wthout loss of generalty, let x OS D D j, represents the jth equvalence class dvded by decson D. Accordng to theorem 3.1, f 1, x x x OS D. Ths then. Thus, 1 completes the proof. Theorem 3.3: D C f 1 s monotone, that s to say,, then D D D 1 C 1. roof. Accordng to theorem 3., f 1 C, OS D OS D OS D. Accordng then 1 C to formula 3.1, we can get D D D Ths completes the proof.. 1 C Theorem 3.3 shows an mportant property of dependency functon. Accordng to t, we can defne the sgnfcance of each eyword a relatve to decson D (.e. labels), as follow: sg a,, D a D D (3.) Based on the sgnfcance metrc, we can desgn a heurstc eyword reducton algorthm. The algorthm begns wth a 015 Internatonal Journal of Computer Scence Issues

IJCSI Internatonal Journal of Computer Scence Issues, Volume 1, Issue 1, No, January 015 ISSN (rnt): 1694-0814 ISSN (Onlne): 1694-0784 www.ijcsi.org 5 relatve core set searched by an algorthm proposed n [11]. For each teraton, t computes sgnfcance of all rest eywords and selects a eyword that has the hghest sgnfcance and adds t to a fnal eyword set. The algorthm does not end untl the pre-defned eyword number s acheved. The pseudo-code s shown below. NRS-KR Algorthm: Input: NDT U, C D,, a predefned threshold and eyword number N Output: A eyword reducton, denoted as red 1. red core ; N core do. for =1: 3. m = 0, canddate=null; 4. for a C red do sg a, red, D >m do 5. f 6. canddate = 7. end f 8. end for 9. red red canddate; 10. end for 4. Experments a ; m=,, sg a red D ; In ths secton, we perform experments to evaluate the proposed algorthm. Frstly, we wll ntroduce expermental datasets, the selected methods whch are used for comparson, and performance metrcs n the experments. Then, we wll explan how we perform the experments and show the results. 4.1 Selected Datasets We select two datasets to evaluate the proposed algorthm. The frst one s Reuters-1578, whch s a standard text categorzaton benchmar and collected by the Carnege group from the Reuters newswre n 1987. We only choose the most populous ten categores from ths corpus, whch are earn, acq, coffee, sugar, trade, shp, crude, nterest, money-fx, gold. The second one s 0-newsgroup, whch contans approxmately 0,000 newsgroup texts beng dvded nearly among 0 dfferent newsgroups. 4. Methods for Comparson We choose four methods to compare wth the proposed method. The frst one s classcal rough sets based eyword reducton algorthm and the other three are statstcs (CHI), Informaton Gan (IG) and Mutual Informaton (MI), respectvely. The statstcs measures the lac of ndependence between eyword and label and can be compared to the dstrbuton wth one degree of freedom to judge extremeness. The statstcs s defned as formula 4.1, where t means term (.e. eyword) and c means category (.e. label), A s the number of tmes t and c occur at the same tme, B s the number of tmes t occurs wthout c, C s the number of tmes c occurs wthout t, and D s the number of tmes nether t or c occur. N s the number of texts n dataset. tc, N AD CB A CB D A BC D (4.1) The nformaton gan measures the number of bts of nformaton obtaned for category predcton by nowng the presence or absence of a term n a text. The nformaton gan of a term t s defned as follow. m r log r 1 r m r log r r IG t c c t 1 m 1 r c t c t t c t log r c t (4.) Mutual nformaton s a crteron commonly used n statstcal modelng. The basc dea behnd ths metrc s that the larger mutual nformaton s, the hgher the cooccurrence probablty between term t and category c s. MI can be defned as follow. The symbol N, A, B, C s same wth formula 4.1. r, MI t, c log log r 4.3 erformance Metrcs tc A N tr c A C A B (4.3) For dfferent purpose, there are several performance metrcs can be chose n text categorzaton evaluaton, ncludng recall, precson, F-measure, Mcro-average of F- measure and Macro-average of F-measure. In the experments, we choose Mcro-average of F1-measure (abbrevated as F1-Mcro) and Macro-average of F1- measure (abbrevated as F1-Macro) as the performance metrcs. Snce the two metrcs are based on other basc metrcs, we brefly explan here. Frstly, we gve the defnton of recall and precson, as follow. 015 Internatonal Journal of Computer Scence Issues

IJCSI Internatonal Journal of Computer Scence Issues, Volume 1, Issue 1, No, January 015 ISSN (rnt): 1694-0814 ISSN (Onlne): 1694-0784 www.ijcsi.org 6 a recall 100% a c a precson 100% a b (4.4) (4.5) Where a s the number of texts correctly classfed to that class, b s the number of texts ncorrectly classfed to that class, c s the number of text ncorrectly rejected from that class. Based on the two defntons, we further gve the defnton of F1-measure as formula 4.6. precson recall F1 precson recall (4.6) Fnally, the F1-Mcro and F1-Macro are defned as follow. n N F1 macro F1 1 N n N precson recall 1 N precson recall precson F1Mcro precson recall recall (4.7) (4.8) Where precson and recall s computed by the number of a, b, c. 4.4 Expermental Results SVM and NN. SVM s mplemented by lbsvm whch s a wdely used open-source lbrary for SVM [1]. Lnear ernel s adopted n SVM. For category ranng n NN, we ndvdually select 10 nearest neghbors for each text and adopt the category whch s the most. Cosne smlarty s used for neghbor s selecton. Before eyword reducton, we need to preprocess the raw datasets. Frstly, we extract words from the raw texts. These eywords wll not be used to construct VSM mmedately. We remove some useless words based on a stop word lst whch s obtaned from Internet and has 887 stop words. After that, we use the orter Stemmng Algorthm to combne the words that has same stem. For the proposed algorthm, the threshold s set to 0.. Eucldean dstance s adopted as the dstance functon. We compare the proposed algorthm (NRS) wth other four methods,.e. classcal rough sets (RS), statstcs (CHI), nformaton gan (IG) and mutual nformaton (MI). The results are shown from fgure to 5. In experments, we choose a dataset and apply dfferent methods to select a number of eyword and evaluate by a specfc classfer. The eyword number changes from 1000 to 10000. From fgure to 5, we can easly fnd that the proposed algorthm plays very well and t can select much better eywords than other four methods. Furthermore, we lst the average results of each algorthm n table 3 and 4. From table 3 and 4, we can see that the proposed algorthm also outperforms other methods n ths aspect. The experments are performed on Matlab R01b. We choose two popular classfers for evaluaton, whch are Table 3: erformance of dfferent algorthms on Reuters-1578 (%) Methods SVM NN F1-Mcro F1-Macro F1-Mcro F1-Macro NRS 0.9590 0.9135 0.906 0.868 RS 0.951 0.891 0.889 0.858 CHI 0.9550 0.9068 0.8687 0.8501 IG 0.9489 0.8756 0.8633 0.8316 MI 0.8969 0.7311 0.793 0.664 Table 4: erformance of dfferent algorthms on 0-newsgroups (%) Methods SVM NN F1-Mcro F1-Macro F1-Mcro F1-Macro NRS 0.7074 0.7070 0.6495 0.6447 RS 0.653 0.6183 0.5904 0.5835 CHI 0.6374 0.6301 0.598 0.5914 IG 0.5713 0.5669 0.5309 0.55 MI 0.4308 0.440 0.3834 0.4104 015 Internatonal Journal of Computer Scence Issues

IJCSI Internatonal Journal of Computer Scence Issues, Volume 1, Issue 1, No, January 015 ISSN (rnt): 1694-0814 ISSN (Onlne): 1694-0784 www.ijcsi.org 7 Fg. 1 Comparson of dfferent algorthms on Reuters-1578 usng SVM Classfer(a) F1-Mcro; (b) F1-Macro Fg. Comparson of dfferent algorthms on Reuters-1578 usng NN Classfer(a) F1-Mcro; (b) F1-Macro Fg. 3 Comparson of dfferent algorthms on 0-newsgroups usng SVM Classfer(a) F1-Mcro; (b) F1-Macro Fg. 4 Comparson of dfferent algorthms on 0-newsgroups usng NN Classfer (a) F1-Mcro; (b) F1-Macro 015 Internatonal Journal of Computer Scence Issues

IJCSI Internatonal Journal of Computer Scence Issues, Volume 1, Issue 1, No, January 015 ISSN (rnt): 1694-0814 ISSN (Onlne): 1694-0784 www.ijcsi.org 8 5. Conclusons In ths paper, we propose a novel way to eyword reducton n text categorzaton usng neghborhood rough sets. As we nowledge, neghborhood rough sets s a model whch combne classcal rough sets and neghbor theory, and t s effcent to handle numerc value or even hybrd value. We apply some mportant concepts and theores of neghborhood rough sets to eyword reducton problem, moreover desgn a heurstc algorthm. The Expermental results prove that the proposed algorthm play well on eyword reducton problem and t outperform some classcal methods on Reuters-1578 and 0- newsgroups. Acnowledgments Ths wor s supported by the Scentfc Research Fund of Schuan rovncal Educaton Department (Grand No. 14ZB047), Scentfc Research Fund of Leshan Normal Unversty (Grand No. Z135), and Schuan rovncal Department of Scence and Technology roject (No.014JY0036).Instead, wrte F. A. Author thans.... Sponsor and fnancal support acnowledgments are also placed here. References [1] Y. Yang, "A comparatve study on feature selecton n text categorzaton", n roceedng of 14th Internatonal Conference on Machne Learnng (ICML 97), 1997, pp. 41-40. [] D. Q. Mao, Q. G. Duan, H. Y. Zhang et al., "Rough set based hybrd algorthm for text classfcaton", Expert Systems wth Applcatons, Vol. 36, 009, pp. 9168-9174. [3] Z. awla, "Rough sets", Internatonal Journal of Computer and Informaton Scence, Vol. 11, 198, pp. 341-356. [4] A. Chouchoulas, Q. Shen, "Rough Set-Aded Keyword Reducton for Text Categorzaton", Appled Artfcal Intellgence, Vol. 15, 001, pp. 843-873. [5] Y. L, S. C. K. Shu, S. K. al et al., "A rough set-based casebased reasoner for text categorzaton", Internatonal Journal of Approxmate Reasonng, Vol. 41, 006, pp. 9-55. [6] Q. H. Hu, D. R. Yu, Z. X. Xe, "Numercal attrbute reducton based on neghborhood granulaton and rough approxmaton", Chnese Journal of software, Vol. 19, 008, pp. 640 649. [7] Q. H. Hu, D. R. Yu, J. F. Lu et al., "Neghborhood rough set based heterogeneous feature subset selecton", Informaton Scences, Vol. 178, 008, pp. 3577-3594. [8] H. Wang, "Nearest neghbors by neghborhood countng", IEEE Trans. on attern Analyss and Machne Intellgence, Vol. 8, 006, pp. 94-953. [9] T. Y. Ln, Q. Lu, K. J. Huang, "Rough sets, neghborhood systems and approxmaton", n the 5th Internatonal Symposum on Methodologes of Intellgent Systems, 1990, pp. 130-141. [10] S. Y. Jng, K. She, S. Al, "A Unversal neghbourhood rough sets model for nowledge dscoverng from ncomplete heterogeneous data", Expert Systems, Vol. 30, 013, pp.89-96. [11] T. F. Zhang, J. M. Xao, X. H. Wang, "Algorthms of attrbute relatve reducton n rough set theory", Chnese Journal of Electronc, Vol. 33, 005, pp. 080-083. [1] C. C. Chang, C. J. Ln, "LIBSVM: a lbrary for support vector machnes", ACM Transactons on Intellgent Systems and Technology, Vol. 7, 011, pp. 1-7. [13] S. Al, S. Y. Jng, K. She, "roft-aware DVFS Enabled Resource Management of IaaS Cloud", Internatonal Journal of Computer Scence Issues, Vol. 10, 003, pp. 37-47. S-Yuan Jng receved the h. D. degree from Unversty of Electronc Scence and Technology of Chna, Chengdu Chna, n 013. He s currently an assocate professor wth Leshan Normal Unversty. He s the member of ACM and CCF. Hs research nterest s n computatonal ntellgence, bg data mnng etc. 015 Internatonal Journal of Computer Scence Issues