Automatic product categorization using Naïve Bayes classifier

Size: px

Start display at page:

Download "Automatic product categorization using Naïve Bayes classifier"

Bridget Carroll
5 years ago
Views:

1 Automatic product categorization using Naïve Bayes classifier Andres Viikmaa Institute of Computer Science, University of Tartu ABSTRACT The backbone of Product Search engines and online shopping sites is accurate product catalog and is its taxonomy system (schema). However, since a large number of products are released to the market with increasing speed, selecting the correct location in taxonomy tree for each product becomes a challenging task, therefore automated techniques are needed. Product categorization can be viewed as text classification problem, which is well studied topic. Naïve Bayes classifier algorithm was chosen to solve this task. Experiments were performed for tuning the algorithm by selecting best features for given dataset structured product data in Estonian language. Experimental comparisons were made on the subset of the data to overcome the limitations (ex. skewness) of Naïve Bayes classifier. Evaluation shows the prediction performance comparison in terms of precision and recall. The experiments confirm that Naïve Bayes method suits for assigning a large number of products into correct categories. We conclude that skewness is problem with Naïve Bayes classifier and selecting right features makes difference. 1. INTRODUCTION A comprehensive and accurate product catalog is essential to the success of Product Search engines and online shopping sites. To navigate through (browse and filter) through huge product catalog an accurate taxonomy system (schema) is needed. As large amount of products enter market every day. Given the limited information about them, the catalog owner must select the proper and unambiguously determined location in taxonomy tree for each product becomes a challenging task, therefore automated techniques are needed. As new products are identified the goal is to add them into the catalogue at correct place in taxonomy tree with highest accuracy. New product information arrives as periodic data feed from thousands of merchants, each containing tens of thousands product descriptions. The target product taxonomy contains about 4000 categories in hierarchical representation, 21 categories at top level. Although merchants have their own taxonomy system, even mapping those into target tree is time consuming when done manually and leads to inaccurate results as there is no one-to-one mapping between these taxonomy trees. And this information might not be available at all in majority cases. We propose the solution to use supervised machine learning technologies to guide this process. Depending of the accuracy required this can be fully automated or semi-automated process providing guidance to human. Product catalogues are text intensive documents, therefore text classification techniques can be applied to classification. There are many different methods for solving this text classification task like Decision Trees, Support Vector Machines, k-nearest Neighbour, Naïve Bayes and others. As the Naïve Bayes outperforms them for text classification [3][1], then the focus is only on this classifier. Experiments are conducted on limited non-skewed set of data, that is more suitable for plain Naïve Bayes classifier. Different feature/attribute selection approaches were studied and compared in order to find most meaningful features. 2. NAïVE BAYES CLASSIFIER A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes theorem (1) with strong independence assumptions. P (A B) = P (B A) P (A) P (B) When the Naïve Bayes Classifier is used for flat text classification, each word is defined to be an feature of the Naïve Bayes Classifier. By using Bayes theorem one can calculate probability for each category c i C = {c 1, c 2,..., c n} given that product text contains or does not contain words W = {w 1, w 2,..., w n}. where (1) w i = { 1, if word i is present in text 0, if word i is not present in text using the equation 3 below p(c i w 1,..., w n) = p(ci) p(w1,..., wn ci) p(w 1,..., w n) (2) (3)

2 In classification task only the numerator is relevant as denominator does not depend on category c i and is effectively constant. Therefore we can simplify the equation (3) p(c i w 1,..., w n) p(c i) p(w 1,..., w n c i) (4) By assuming feature independence the probability p(c i w 1,..., w n) can be written using chain rule p(c i w 1,..., w n) = p(c i) j p(w j c i) (5) Finally to get the predicted category ĉ the most likely one must be chosen from all of them. ĉ = argmax p(c i) i j p(w j c i) (6) Kunst ja meelelahutus Sõidukid ja sõidukite osad Kaamerad ja optika Tööriistad ja ehitus Pagas ja kotid Mööbel Mängud ja mänguasjad Imik ja väikelaps Meedia Ilu ja tervis Tarkvara Riided ja aksessuaarid Toit, joogid ja tubakas Kunst ja meelelahutus Kontoritarbed Kodu ja aed Elektroonika Spordikaubad DESCRIPTION OF DATA The test and training data consisted of product descriptions that were gathered from Estonian on-line shops using web crawling. Approximately 15 on-line shops with in total of unique product descriptions. As different merchants have different amount of data available on their web pages, the quality of product data varies form having only product name to full set of information Full set of data is illustrated on table 1. product name product code merchant taxonomy description parameter parameter Objektiiv Canon EF 50 mm 2514A011 Foto, video, GPS > Objektiivid Klassikaline 6 elementi... Fookuskaugus: 50 mm Kaal: 130 g Table 1: Detail product information Only the fraction (about 2%)of this product data is correctly categorized in our target taxonomy tree and is used as training and testing data in experiments. Figure 1. shows the distribution of correctly classified data. 3.1 Attribute selection As described also in section 2, the basis on text classification is to use words as features. That is each word is binary feature, having value 1 if word is present in given document and value 0 if word is not present. As product data is usually well structured (and in our case it is in most cases), then it is reasonable to treat parts of product data with higher priority or perhaps ignore some of it at all. In the work of J. Lee et al. (2006) Naïve Bayes Classifier was extended to make use of the structural characteristics of e- catalogs. It was shown that the accuracy of classification can be improved when appropriate characteristics of e-catalogs are utilized [6]. It was shown that using words from full product descriptions decreases categorization accuracy. To confirm this behaviour we generate multiple datasets with different level of information about products. See experiments section describes these datasets in more detail. Figure 1: Distribution of products per category 3.2 Word normalization Product data contains words in mixed cases and in different forms (ex. Objektiiv, objektiivid). As the meaning of the word does not depend on the casing then first all words are converted to lower-case. To normalize the words even more we can make morphological analysis and convert all word into its base form. To do this we can either do stemming or lemmatization. As there exists freeware and open sourse statistical lemmatizer 1. This simple lemmrizer has similar performance to commerical alternative [5]. 4. EXPERIMENTS From the manually categorized data the two with least products were excluded from experiments. Words from product data were extracted into five groups shown in table 2. Merchant taxonomy information was excluded from full and limited datasets in order to see its role in classification performance. Additionally duplicate datasets were created with Full All textual product information 2 Limited Taxonomy F + T L + T Name, manufacturer, product parameters Only merchant taxonomy Full + Taxonomy Limited + Taxonomy Table 2: Data sets with different features prefix added to each word with its meaning in product data. For example the product name objektiiv canon ef 50 mm was transformed into name objektiiv name canon name ef name 50 name mm. Finally lemmatized versions were created also. So in total 20 (5 4) datasets were generated. Experiments were conducted using machine leaning software Weka 3 3. This software has Naive Bayes classifier imple Product name, product description and all product parameters values (ex, Fookuskaugus, Kaal) 3 ml/weka/

3 mented with automatic test and training functionality. 10- fold cross-validation was used to test the performance of each datasets. True positive, false positive and F 1-measure was taken as comparison metric. 5. RESULTS Tables 3-6 show the calculated performance metrics for all datasets. Full Limited Taxonomy F + T L + T Table 3: As Is Dataset for limited dataset but surprisingly not for not that much for full dataset. Also it seems that adding merchant taxonomy nullifies the benefit of using prefixes. From table 6 we can see again that lemmatization does not significantly change the performance of our classifier. In overall we can see that exclusion of the product free text description gives the best results. Full Limited Taxonomy F + T L + T Table 5: Prefixed Dataset Full Limited Taxonomy F + T L + T Table 4: Lemmatized dataset From table 3 we can see the best results are achieved by only using merchant taxonomy as feature set. This seems to confirm the empirical observation that there exists close to 1:1 mapping from merchant taxonomy tree into our target tree. It is also visible form the results that the claim made in section 3.1 (about product free text description decreases prediction performance) was reconfirmed. One interesting observation is that although using only merchant taxonomy did give good results by adding limited product information (name, manufacture, parameter names) increased the performance significantly. The table 4. shows the lemmatized version of the same dataset. As it can be seen the performance has not changed a lot, for some dataset it has increased and for some has decreased but as the differences are not that big, it is hard to conclude whether the difference is systematic or due randomization in cross-validation. The table 5 shows the performance of the prefixed dataset. For merchant taxonomy dataset, the words were identical to As Is dataset as identical prefix was added to all words. So from this we can see that 10 fold cross-validation has about 1% error margin. The performance has increased noticeably Full Limited Taxonomy F + T L + T Table 6: Lemmatized prefixed dataset Although not visible from these tables the skewness issue (naive bayes classifier tends to prefer classes that have more samples) did not opposed problem when merchant taxonomy was included in data. Appendix 1 shows the confusion matrix across all categories for As Is dataset without merchant taxonomy, where we can clearly see the effect of skewness (lots of products are mis-classified into Spordikaubad catgory) and with the taxonomy. 6. CONCLUSIONS From the experiments we can conclude that product categorization using text classification techniques, namely Naive Bayes classifier is well justified. By simply using product data as is, without any preprocessing, does not give the optimal performance. By using limited set (without merchant taxonomy) of product parameters the classification performance increases about 10%. And when also using merchant taxonomy (although which might not be always available) boosts the performance additionally by 15%. Using lemmatization did not change the performance significantly. This might be due the fact, that there is less narrative text in product data than there is in books. Also using prefixed text to distinguish the context of words did not work as well

4 as we expected. It might be used when merchant taxonomy information is missing for majority of the products. 7. FUTURE WORK As the current dataset used was quite small and skewed it is wise to re run the experiments when more data is categorized. Although it seemed that by using merchant taxonomy in input the skewness problem was eliminated, it still needs further investigation. There is number of research been made that address this issue and improved versions of Naïve Bayes Classifiers exist (namely Complementary Naïve Bayes[2] and Negation Naïve Bayes classifiers[4]). When dataset gets larger we shoul retest the effectiveness of lemmatization. 8. REFERENCES [1] S. Hassan, M. Rafi, and M. S. Shaikh. Comparing svm and naive bayes classifiers for text categorization with wikitology as knowledge enrichment. CoRR, abs/ , [2] J. J.D.M.Rennie, L.Shih and D.R.Karge. Tackling the poor assumptions of naive bayes text classification. In ICML2003, pages , [3] C. X. L. Jin Huang, Jingjing Lu. Comparing naive bayes, decision trees, and svm with auc and accuracy. In The Third IEEE International Conference on Data Mining, page 553, [4] K. F. Y. K. K. Komiya, N. Sato. Negation naive bayes for categorization of product pages on the web. In Proceedings of Recent Advances in Natural Language Processing, pages , [5] A. Tkachenko, T. Petmanson, and S. Laur. Named entity recognition in estonian. In 4th Biennial International Workshop on Balto-Slavic Natural Language Processing, pages 78 83, [6] J. C. S.-g. L. Young-gon Kim, Taehee Lee. Modified naive bayes classifier for e-catalog classification. In DEECS 2006.

5 Appendix 1 a b c d e f g h i j k l m n o p < c l a s s i f i e d as a = MÃd ngud j a mãd nguasjad b = Kontoritarbed c = Kaamerad j a o p t i k a d = Imik j a vãd ikelaps e = TÃűÃűriistad j a e h i t u s f = Spordikaubad g = Toit, j o o g i d j a tubakas h = E l e k t r o o n i k a i = MÃűÃűbel j = Kunst j a meelelahutus k = Riided j a a k s e s s u a a r i d l = Tarkvara m = Pagas j a k o t i d n = I l u j a t e r v i s o = Kodu j a aed p = Meedia Confusion matrix 1: Full product data without merchant taxonomy a b c d e f g h i j k l m n o p < c l a s s i f i e d as a = MÃd ngud j a mãd nguasjad b = Kontoritarbed c = Kaamerad j a o p t i k a d = Imik j a vãd ikelaps e = TÃűÃűriistad j a e h i t u s f = Spordikaubad g = Toit, j o o g i d j a tubakas h = E l e k t r o o n i k a i = MÃűÃűbel j = Kunst j a meelelahutus k = Riided j a a k s e s s u a a r i d l = Tarkvara m = Pagas j a k o t i d n = I l u j a t e r v i s o = Kodu j a aed p = Meedia Confusion matrix 2: Full product data with merchant taxonomy

Modern Information Retrieval

Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction