Naive Bayesian Text Classifier Based on Different Probability Model

Size: px

Start display at page:

Download "Naive Bayesian Text Classifier Based on Different Probability Model"

Gwenda Burke
5 years ago
Views:

1 aive Bayesian Text Classifier Based on Different Probability Model,,3 LV Pin, Zhong Luo College of Computer Science and Technology, Wuhan University of Technology, Wuhan,China, School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan,China 3 Hubei Province Key Laboratory for Intelligent Robot, Wuhan Institute of Technology,Wuhan, China Abstract Text classification is very general and has many applications within and beyond information retrieval. It has been a open problem for many researchers how to construct a reasonable classifier and how to reduce the term space based on the constructed classifier in order to improve the classification effience.this paper shows that a learned text classifiers can be constructed according to multinomial model and Bernoulli model given training sample set. We introduce two algorithms for learning from labeled documents based on the combination of maximum likelihood estimate (MLE) to obtain a aive Bayes classifier. The two algorithms work well when the data conform to the generative assumptions of the model. However these assumptions are often violated impractical, we present an extension to the algorithm that improve classification accuracy under this condition of feature selection of mutual information. Experimental results, obtained using text from Reuters-RCV, show that regardless of the differences between the two methods, it can be draw that using a carefully selected subset of the features results in better effectiveness than using all features. Keywords: Text Classifier, avie Bayes Learning, Probability Model, Maximum Likelihood Estimate. Introduction Due to the proliferated availability of texts in digital form and the increasing need to access them in flexible ways, text classification becomes an elementary and crucial task []. In the past several years, many methods based on machine learning and statistical theories have been applied to text classification. Among these kinds of methods, decision trees, k-nearest neighbors, neural networks, aive Bayes(B) and support vector machines are all successful examples []. As one of these successful methods, aive Bayes is popular in text classification due to its computational efficiency and relatively good predictive performance. In recent years, there are many literatures about the aive Bayes classifier applied in text classification. Since aive Bayes is very efficient and easy to implement compared to other learning methods, it is worthwhile to improve the performance of aive Bayes in text classification tasks. With this background, text classifiers based on aive Bayes have been studied extensively by some researchers. It has been reported [3],[4]that there seems to exist the earliest avie Bayes classifier and its history descriptions. A precision measure of based on different documents has been given according to the multinomial model and multivariate Bernoulli mode [5].However, it has not presented how to construct the avie Bayes classifier according to these two kinds of probability generative model. Some other avie Bayes models have been investigated in [6]. Even though the parameter evaluation of avie Bayes is very poor, the performance of the text classifier is excellent [7], [8], [9].Moreover, the performance of text classifier is the optimal when the hypothetic condition of term and position independence is hold[0]. Despite its popularity, there has been some confusion in the text classification community about the aive Bayesians classifier because there are two different generative models in common use, both of which make the aive Bayes assumption. Both are called aive Bayes by their practitioners. One model specifies that a document is represented by a vector of binary attributes indicating which words occur and do not occur in the document. The number of times a term occurs in a document is not captured. When calculating the probability of a document, one multiplies the probability of all the attribute values, including the probability of non-occurrence for terrns that do not occur in the document. Here we can understand the document to be the event, and the absence or presence of terms to be attributes of the event. This describes a distribution based on a multivariate Bernoulli event model. This approach is more traditional in the field of Bayesian networks, and is appropriate for tasks that International Journal of Digital Content Technology and its Applications(JDCTA) Volume6,umber,July 0 doi:0.456/dcta.vol6.issue

2 have a fixed number of attributes. The approach has been used for text classification by numerous people. The second model specifies that a document is represented by the set of term occurrences from the document. As above, the order of the terms is lost; however, the number of occurrences of each term in the document is captured. When calculating the probability of a document, one multiplies the probability of the terms that occur. Here we can understand the individual term occurrences to be the event, and the document to be the collection of term event. We call this the multinomial event model. This approach is more traditional in statistical language modeling for speech recognition, where it would be called a unigram language model. This approach has also been used for text classification by numerous people. So this paper aims to address aive Bayesian classifier construction by explaining both models in detail. We use maximum likelihood estimate (MLE) to evaluate the most likely value of each parameter given the training data[6][7][8]. Both of the classification algorithms we study represent documents in high-dimensional spaces. To improve the efficiency of these two algorithms, it is generally desirable to reduce the dimensionality of these spaces; to this end, an expected mutual information technique known as feature selection is applied in text classification. Over the course of several experimental comparisons, it can be concluded that regardless of the differences between the two methods, using a carefully selected subset of the features results in better effectiveness than using all features. The remainder of the paper is organized as follows. Section describes a general introduction to the text classification problem including a formal definition. Section 3 presents the formal probabilistic framework for Bayesian classifier construction based on the multinomial model and multivariate Bernoulli model. In Section 4, we compare the process of text generation based on two models. In Section 5, feature selection issues are discussed. In Section 6, we carry out experiments by the expected mutual information method to realize the feature selection in order to improve the accuracy of classifier and describe a systematic experimental comparison between the accuracy and the size of feature set. Section 7 is conclusions.. The text classification definition In text classification, there is a description d of a document, where is the document space, and a fixed set of classes C { c, c,, c }.Generally speaking, the document space is some type of high-dimensional space, and the classes are human defined for the needs of an application. Recall that ( d, denotes the labeled documents given training set D. Also, recall that ( d, is the result of C.Using a learning method or learning algorithm, then, we wish to learn a classification function that maps documents to classes, where the function is : C.This type of learning is called supervised learning because a supervisor serves as a teacher directing the learning process. Figure.Text classification For example, Figure shows an example of text classification from the Reuters-RCV collection, where train set is omitted. There are six classes (UK, China,..., sports), each with three training documents. We show a few mnemonic words for each document s content. The training set provides some typical examples for each class, so that we can learn the classification function.once we have learned.we can apply it to the test set, for example, the new document first private Chinese airline whose class is unknown. In Figure, the classification function assigns the new document to class ( d ) China, which is the correct assignment [0]. 465

3 3. The probabilistic framework of aive Bayes We can assign a document to an appropriate class by calculating a maximum posteriori probability c map, which we compute as follows: d cmap argmax c d) arg max arg max d () cc cc d) cc where Bayesian rule is applied the second step of Equation (), Since d) is fixed for every c, we drop the denominator in the last step. ext we mainly discuss how to evaluate the value of p ( c d). 3. Multinomial model The first model discussed is the multinomial B model, a probabilistic learning method. The probability of a document d being in class c is computed as: c d) t k () ind where t k is the conditional probability of term tk occurring in a document of class c. We regard t k as a measure of how much evidence t k contributes that c is the correct class. Let be the prior probability of a document occurring in class c. If a document's terms do not provide clear evidence for one class versus another, we choose the one that has a higher prior probability. Let ( t, t,, t n ) be the tokens in d that are part of the vocabulary we use for classification and d nd be the number of such tokens in d. In text classification, the goal is to find the best class for the document. The best class in B classification is the maximum a posteriori classc map : c arg max t (3) map cc k n d In fact we do not know the true values of the parameters, but we can estimate them from the training set. Moreover, how should we estimate the parameters and t k? Firstly, let us try the maximum likelihood estimate, which is simply the relative frequency and corresponds to the most c likely value of each parameter given the training data. For the priors, this estimate is:, where c is the number of documents in class c and is the total number of documents. Secondly, we can estimate the conditional probability as the relative frequency of term in documents belonging to Tct class: t k, where T ct is the number of occurrences term t of in training documents T t' V ct' from class c, including multiple occurrences of a term in a document and V is the lexicon that consists of all term ( t, t,, t n ). T d ct is a count of occurrences in all positions in the documents in ' the training set. However, the problem with the MLE estimate is that it is zero for a term-class combination that did not occur in the training data. Sometimes the estimate is 0 because of sparseness, that is, the training data are never large enough to represent the frequency of rare events adequately. To eliminate zeros, we may use add-one or Laplace smoothing, which simply adds one to each Tct Tct t k (4) ( T ) T B count where B V t' V ct' t' V is the number of terms in the vocabulary. Add-one smoothing can be k ct' 466

4 interpreted as a uniform prior (each term occurs once for each class) that is then updated as evidence from the training data comes in. ote that this is a prior probability for the occurrence of a term as opposed to the prior probability of a class which we estimate in the document level. The multinomial aive Bayes classification algorithm is described in Figure. Figure. aive Bayes algorithm based multinomial model: training and test 3. Multivariate model Multivariate model is equivalent to the binary independence model. Here, binary is equivalent to Boolean. A document is represented as binary term event vector. That is, a document d is represented by the vector ( e, e,, em ), t m where et if term t is present in document d and et 0 if t is not present in document d. Independence means that terms are modeled as occurring in documents independently. The model assumes that there is no association between terms. In a sense this assumption is equivalent to an assumption of the vector space model, where each term is a dimension that is orthogonal to all other terms. So for the priors this c estimate is:, and for the conditional probability as the relative frequency of term in documents belonging to class is as follows: ct p ( t k (5) In equation (5), we may also add one smoothing B=, where is the fraction of documents of class c. In contrast, ct c is the fraction documents of class c that contain term t. The multivariate Bernoulli aive Bayes classification algorithm is described in Figure 3. c 467

5 Figure 3. aive Bayes algorithm based Bernoulli model: training and test 4. Comparison of text generation based on two models We can interpret Equation () as a description of the generative process that we assume in Bayesian text classification. To generate a document, we first choose class c with probability p c ).The two models differ in the formalization of the second step, the generation of ( the document given the class, corresponding to the conditional distribution p d c ).For ( multinomial model, formula is as follows: p d c ) t, t,, t c ) (6) d ( n However, for multivariate Bernoulli model, formula is as follows: p d c ) e, e,, e c ) (7) where ( M t, t,, t n is the sequence of terms as it occurs in d and e, e,, e M is a binary d vector of dimensionality M that indicates for each term whether it occurs in d or not. According to Equation () and (3), it can be concluded that a text classification problem is to choose the document representation. However, we cannot use Equation () and (3) for text classification directly. For the M Bernoulli model, we would have to estimate C different parameters, one for each possible combination of M values ei and a class. The number of parameters in the multinomial case has the same order of magnitude. It is infeasible to estimate these parameters reliably because of a very large quantity. To reduce the number of parameters, we make the aive Bayes conditional independence assumption. We assume that attribute values are independent of each other given the class, so equation () and (3) can be rewritten, it is as follows: multinomial model: d c ) t, t,, t c ) X t c Bernoulli model: d e nd k k ) k nd, e,, em c ) P( Ui ei c ) im Even when assuming conditional independence, we still have too many parameters for the multinomial model if we assume a different probability distribution for each position k in the document. For this reason, we make a second independence assumption for the multinomial model, 468

6 positional independence. The conditional probabilities for a term are the same independent of position in the document. That is, for the positions k and k, terms t and classes c, X k t X t k holds. To summarize, with conditional and positional independence assumptions, we only need to estimate ( M C ) parameters t k or e i, one for each term-class combination, rather than a number that is at least exponential in the size of the vocabulary. The independence assumptions reduce the number of parameters to be estimated by several orders of magnitude. 5. Feature selection issues In feature selection, it has been recognized that the combinations of individually good features do not necessarily lead to good classification performance. In other words, the m best features are not the best m features [0]. Before presenting the experimental results in Section 6, we discuss the implementation issues regarding the calculation of mutual information for discrete data and binary classifier. We consider mutual-information-based feature selection for discrete data[]. Given two random variables x and y, their mutual information is defined in terms of their probabilistic density functions x), and p ( x,. x, I ( x, x, log dxdy (8) x) For categorical feature variables, the integral operation in (4) reduces to summation. In this case, computing mutual information is straightforward, because both oint and marginal probability tables can be estimated by tallying the samples of categorical variables in the data. However, when at least one of variables x and y is continuous, their mutual information I( x, is hard to compute, because it is often difficult to compute the integral in the continuous space based on a limited number of samples. One solution is to incorporate data discretization as a preprocessing step. For some applications where it is unclear how to properly discretize the continuous data, an alternative solution is to use density estimation method to approximate I ( x,, as suggested by earlier work in medical image registration and feature selection[].given samples of a variable x, the approximate density ( i) function x) has the following form: x) ( x x, h),where (.) is the Parzen i (i) window function as explained below, x is the i th sample, and h is the window width. Parzen has proven that, with the properly chosen (.) and h, the estimation x) can converge to the true density p (x) when goes to infinity []. Usually, (.) is chosen as the Gaussian window: zt ( z, h) ex h z ) {( ) d / h d / },where z x (i) x, d is the dimension of the ( i) sample x and is the covariance of z. When d, x) ( x x, h) returns the ( i) estimated marginal density; when d, x) ( x x, h) can be used to estimate the i density of bivariate variable ( x, and p ( x,, which is actually the oint density of x and y. For d, is often approximated by its diagonal components. the sake of robust estimation, for 6. Experiments To study the accuracy and effectiveness of two models, we have performed experiments on some Reuters-RCV. All experiments are conducted on a computer with an Intel Core Duo E GHz processor and 4GB of RAM. i 469

7 6. Effectiveness aive Bayes is so called because the independence assumptions we have ust made are indeed very naive for a model of natural language. The conditional independence assumption states that features are independent of each other given the class. This is hardly ever true for terms in documents. In many cases, the opposite is true. For example, the pairs Hong and Kong or London and English are examples of highly dependent terms. In addition, the multinomial model makes an assumption of positional independence. The Bernoulli model ignores positions in documents altogether because it only cares about absence or presence. This bag-of-words model discards all information that is communicated by the order of words in natural language sentences. So it oversimplifies the model of natural language. Even though the probability estimates of B are of low quality, its classification decisions are surprisingly good []. For example, consider a document d with true probabilities c d) and c d) 0. 3 as shown in Table. Assume that d contains many terms that are positive indicators for c and many terms that are negative indicators for c. Thus, when using the multinomial model in Equation (), c ) t k c will be much larger than ) ( ) k nd ) k nd c p t k c. The winning class in B classification usually has a much larger probability than the other classes and the estimates diverge very significantly from the true probabilities. It is shown in Table that the classification decision is based on which class gets the highest score. It does not matter how accurate the estimates are. Despite the bad estimates, B estimates a higher probability for and therefore assigns to the correct class in Table. Correct estimation implies accurate prediction, but accurate prediction does not imply correct estimation. B classifiers estimate badly, but often classify well. 6. F measure Precision and recall are two most frequent and basic measures for information retrieval. A single measure that trades off precision versus recall is the F measure, which is the weighted harmonic mean of precision and recall. We obtain terms with high mutual information scores for the six classes such as UK,China,Poultry,Coffee,Elections and Sports based on equation (8) in order to keep the informative terms and eliminating the non-informative ones tends to reduce noise and improve the classifier's accuracy the value of F Bernoulli model multinomial model the size of selected feature set Figure 4. Effect of selected feature set size on accuracy Table. The relationship of correct estimation vs accurate prediction class c class c selected class real value of p ( c d ) evaluation c evalution value based on B c Such an accuracy increase can be observed in Figure 4, which shows as a function of vocabulary 470

8 size after feature selection for Reuters-RCV. Comparing at 3,776 features,corresponding to selection of all features and at 0~00 features, we see that mutual information feature selection increases by about 0. for the multinomial model and by more than 0. for the Bernoulli model. For the Bernoulli model, peaks early, at ten features selected. At that point, the Bernoulli model is better than the multinomial model. When basing a classification decision on only a few features, it is more robust to consider binary occurrence only. For the multinomial model based on mutual information feature selection, the peak occurs later, at 00 features, and its effectiveness recovers somewhat at the end when we use all features. The reason is that the multinomial takes the number of occurrences into account in parameter estimation and classification and therefore better exploits a larger number of features than the Bernoulli model. Regardless of the differences between the two methods, it can be draw that using a carefully selected subset of the features results in better effectiveness than using all features. 7. Conclusions In this paper, we propose a naive Bayesian classification for classifying and predicting text based on two different probability models. Text classification is extensively presented in modern applications such as World Wide Web, Internet news feeds, electronic mail, corporate databases and so on. This paper follows the new paradigm of directly constructing B classifier based on multinomial model and Bernoulli model. We integrate mutual information technique with aïve Bayes theorem. Besides laying the theoretical foundations for enhancing naive Bayesian classification to process text classification, we show how to put these concepts into practice. Our experimental evaluation demonstrates that the classifiers for text classification can be efficiently constructed and effectively classify and predict after selected desirable feature set. 8. Acknowledgement This work was ointly supported by the Open Foundation of HBIR References []Pin Lv, Yuntao Wu, Generalization Step Analysis for Privacy Preserving Data Publishing, JDCTA: International Journal of Digital Content Technology and its Applications, Vol. 4, o. 6, pp. 6 ~ 7, 00. []Liu Hui, CAO Yonghui, The Research of machine learning algorithm for intrusion detection techniques, JDCTA: International Journal of Digital Content Technology and its Applications, Vol. 6, o., pp. 343 ~ 347, 0. [3]Maron, M. E., and J. L. Kuhns. On relevance, probabilistic indexing, and information retrieval.jacm, Vol.7,no.3,pp.6-4,960. [4]Lewis, David D. avie at forty: The independence assumption in information retrieval. In ECML,pp.4-5,998. [5]McCallum, Andrew, and Kamal igam. A comparison of event models for avie Bayes text classification.in Working otes of the 998 AAAI/ICML Workshop on Learning for Text Categorization,pp.4-48,998. [6]Eyheramendy, Susana, David Lewis,and David Madigan. On the aive Bayes model for text categorization.in Proc. International Workshop on Artificial Intelligence and Statistics.003. [7]Domingos, Pedro, and Michael J.Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, Vol.9, o.,pp.03-30,997. [8]Friedman, Jerome H. On bias, variance,0/ loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, Vol., o.,pp.55-77,997. [9]Hand, David J., and Keming Yu. Idiot s Bayes: ot so stupid after all. International Statistical Review, Vol.69,o.3,pp ,00. [0]Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press.008. []Hanchuan Peng, Fuhui Long, Chris Ding. Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 7, no. 8, pp.6-38,

The Bayes classifier

The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal