Advanced Parametric Mixture Model for Multi-Label Text Categorization. by Tzu-Hsiang Kao

Size: px

Start display at page:

Download "Advanced Parametric Mixture Model for Multi-Label Text Categorization. by Tzu-Hsiang Kao"

Valerie Tyler
5 years ago
Views:

1 Advanced Parametric Mixture Model for Multi-Label Text Categorization by Tzu-Hsiang Kao A dissertation submitted in partial fulfillment of the requirements for the degree of Master of Science (Industrial Engineering) in National Taiwan University 2006

2 ABSTRACT This thesis studies Parametric Mixture Models (PMMs). They are efficient statistical models to solve multi-label text categorization problem. Conventional machine learning models usually training binary classifiers for predicting multi-label problem. In contrast, PMMs use a single statistical model to handle multi-label text. We propose an Advanced Parametric Mixture Model (APMM) based on PMMs. Its maximum likelihood is a concave programming problem. We design update rules so that iterations converge to a global maximum. The experiments use the real-world yahoo.com datasets under three common multi-label classification measurements. The results show that APMM is competitive. ii

3 TABLE OF CONTENTS ABSTRACT ii LIST OF FIGURES LIST OF TABLES v vi CHAPTER I. Introduction II. Advanced Parametric Mixture Model Parametric Mixture Model PMM PMM Advanced Parametric Mixture Model Training Update Formula PMM APMM Prediction Method PMM APMM III. Experiments and Results Data Description Evaluation Criteria Exact Match Ratio Labeling F-measure Retrieval F-measure Experimental Setting Results and Discussions IV. Discussion and Conclusions iii

4 APPENDICES BIBLIOGRAPHY iv

5 LIST OF FIGURES Figure 3.1 The relation between stopping tolerances and performances Number of iterations versus stop criteria from 10 to (ξ 1) from to (ξ 1) from 0.1 to v

6 LIST OF TABLES Table 3.1 Details of the yahoo.com Web page datasets. #Text is the number of texts in the dataset, #Voc is the number of vocabularies (i.e., features), #Tpc is the number of topics, #Lbl is the number of labels, and Label size Frequency is the relative frequency of each label size Single-label documents prediction performance. Pr s is the number of documents predicted as single-label, Co s is the number of singlelabel documents which have been predicted correctly, Co ratio is the ratio of single-label documents has been correctly predicted Training and testing time of models with different stopping tolerances. Since the numbers in this table are the average of several problems, the numbers of iterations have decimal point Performance of using stopping tolerances = Three evaluation criteria presented in Section 3.2. The Exact Match ratio of APMM is better than that of PMM1, but the Retrieval F-measure is lower then PMM1. The Labeling F-measures of the two models are quite similar Performance of using stop tolerance The leg is the same as Table Prediction accuracy of different label size. #label is the label size, mun is the total number of #label in the dataset, Pr is the size of label has been predicted, Co is the correctly predicted number, and Co ratio is the ratio of correctly predicted. Since Table 3.1 shows the frequency of label size larger than 4 are relatively smaller, we combines the correctly predicted ratio vi

7 CHAPTER I Introduction Multi-label classification has been a popular issue in recent years. In previous studies, the classification field often focuses on binary or multi-class problems. In many practical problems, one sample could belong to several classes. Therefore, an instance may have several class labels. This situation increasingly occurs in today s text categorization (i.e., text classification) problems. Since the number of online documents and newspaper articles (electronic files) is rapidly increasing, the needs for automatic categorizing method are rising. In this research, we propose the Advanced Parametric Mixture Model (APMM) to cope with the multi-label problem, and use the yahoo.com Web documents in the experiments to compare its performance to the state-of-the-art parametric mixture models. The goal of text categorization [15] is to classify the documents into a set of predefined categories. Text categorization begins in the early 60s [9], but it did not become the major subfield of information technology until early 90s. It has been applied in several contexts, ranging from document indexing based on a controlled vocabulary [13] [20], to document filtering [1], automated metadata generation [8] [5], word sense disambiguation [17], population of hierarchical catalogues of Web resources, and in general any application requiring document organization. Several learning models have been used for text categorization. Some of them have 1

8 2 shown good results. For example, Naive Bayes classifier [7] [12], the k-nearest-neighbor (k-nn) [10] [21], and the C4.5 decision tree [14] [7]. The Naive Bayes classifier uses a probabilistic model of text to estimate the probability that a document belongs to a category. The k-nn classifier is based on the assumption that a example is likely to belong to the class to which the majority of its nearest k examples belong. The decision tree learners construct models based on branching instances according to values of features. However, the above models assume that the data sets are multi-class. In practice, it is possible that a document is associated with more than one label. [2] discussed the following phenomenon: when two human experts decide whether to classify a document under a category, they may disagree, and this in fact happens very often. For example, a news article on Clinton atting Dizzy Gillespie s funeral could be related to Politics, Jazz, both, or even neither. Such scenarios are increasingly common in the Internet today. Multi-label text classification assigns each document at least one label as its category whereas multi-class text classification assigns a document exactly one class label. A variety of models have been developed for multi-label text categorization. For example, Support Vector Machines (SVM) [6] use a set of binary SVM classifiers to handle the multi-label classification problem. BoosTexter proposed in [16] exts Ada-Boost [4] to handle multi-label text categorization. However, these two models cope with the multi-label problem only with binary classifiers. It is reasonable to think that models which extract more information in the data will also make better prediction for future cases. While attempts to use only a set of binary classifiers to solve the multi-label problem, these models might lose certain information embedded in the datasets, such as the probability of label co-occurrence.

9 3 Mixture model is a popular way to handle multi-label text classification problems. It considers more factors than binary classifier. In the text categorization area, [18] propose the parametric mixture models (PMMs) to categorize the multi-label documents. They proposed two parametric mixture models. One could be considered as the first-order approximation, PMM1, and the other is a second-order model, PMM2. They found that the difference in accuracy rate between those two models are very minor, and the Exact Match ratio of PMMs are very low. To improve the predicting accuracy, we propose the Advanced Parametric Mixture Model (APMM). APMM exts the spirits of PMMs by introducing additional parameters in the models to account for depencies in the labels. The experiments use the yahoo.com text data and APMM is shown to achieve higher Exact Match ratio than PMMs. In this thesis, we first review the related work on PMMs, and introduce their extension, the Advanced Parametric Mixture Model, including describing its objective functions, training models, and the predicting method in Chapter II. Chapter III contains the dataset description and the experiment results obtained with three common multi-label classification measurements. Finally, we conclude this work in Chapter IV.

10 CHAPTER II Advanced Parametric Mixture Model In the multiple topics documents such as Web pages, [18] proposes parametric mixture models (PMMs) and shows that it outperforms some conventional methods. Traditionally, a multi-label classification problem is regarded as many binary problems. It was solved by training several classifiers, and then, each classifier handles a binary problem. In contrast, PMMs directly handle multi-label classification problems by a probability model. In the chapter, we firstly review existing parametric mixture models, PMM1 and PMM2 [18]. Followed by a detailed derivation of PMM1, an example is given to illustrate its training procedure. We then propose a new model called Advanced Parametric Mixture Model (APMM). 2.1 Parametric Mixture Model The notations of PMMs denote N as the total number of documents. The d n represents the nth document. V is the vocabulary set containing all words appearing in all documents. It could be denoted as w 1,..., w V, where w i is the ith word in the vocabulary V. The input data here are represented by Bag-of-Words (BOW) [17], a frequencybased representation. It assumed that each word is indepent to the others. There- 4

11 5 fore, BOW ignores the order of the words. No matter the combination of the words, BOW only considers the word frequency (a.k.a. term frequency) of documents. Although pondering the combinations of the words may be more accurate, a basic model without taking into account the order of words might be efficient and serves as a good starting point. Under the BOW framework, d n could be represented as a wordfrequency vector, x n = [x n 1,..., x n V ]T. Here, the x n i denotes the frequency of word w i appearing in d n. The labels of document d n are presented as a category vactor y n = [y1 n,..., yl n]t. L is the total number of categories. Those L categories are predefined. The given training data has N samples denoted as D = {(x n,y n )} N n=1. Here, each yl n is a Boolean value, which only containing 0 or 1. yl n = 1 means that the nth document belongs to l category, y n l = 0 means that the nth document does not belongs to l category. In practice, we categorize a document into one or more categories. Therefore, add the constraint l y l > 0 to insure that a document remains with at least one category. PMMs consider that each word in the vocabulary set has a probability related to each class, hence the class-depent probabilities could be parameterized. In the multi-class single-label classification problem, the probability of a document x in the lth category could be written as a multinomial distribution: P (x l) V (θ l,i ) x i, where θ l,i 0 and V θ l,i = 1. (2.1) Here, θ l,i is the probability that the ith word w i occurs in the l class. Furthermore, the multi-class multi-label classification problem is generated as: P (x y) V (ϕ i (y)) x i, where ϕ i (y) 0 and V ϕ i (y) = 1, (2.2) where ϕ i (y) is a class-depent probability that representing the ith word w i occurs in the (y) class. If we consider every possible combination of y, the total number of

12 6 labels is 2 L 1. Therefore, if L is large, the number of possible multi-label classes will be huge, then the model will be inefficient. Thus, we try to efficiently parameterize them using a reasonable number of parameters. Let Θ denote the vector of unknown parameters of ϕ i (y) in (2.2). The Bayes rules state that P (A, B) = P (A)P (B A) = P (B)P (A B), (2.3) and P (A B) = P (A, B) P (B) = P (A)P (B A) P (B) = P (B)P (A B). (2.4) P (B) where P (A, B) (a.k.a. P (A B)) represents the probability that both A and B occur. We can obtain the posterior P (Θ D), i.e., P (Θ x, y), based on (2.3) and (2.4) such that P (Θ x, y) = P (Θ)P (x, y Θ) P (x, y) = P (Θ)P (y Θ)P (x y, Θ). (2.5) P (x, y) PMM assumes that y is indepent to Θ and therefore P (y Θ) is equal to P (y). This assumption simplified this problem. Since both P (y) and P (x, y) are not a function of the parameter set Θ of interest and serve only as normalizing constant, (2.5) can be further simplified as P (Θ x, y) P (Θ)P (x y, Θ), (2.6) where P (Θ) represents the prior of parameters. After obtaining the posterior P (Θ x, y) of Θ, the Maximum A Posterior (MAP) is estimated as ˆΘmap = arg max Θ log P (xn y n, Θ) + log P (Θ). (2.7) The goal of this model is to find ˆΘmap in each iteration to find out the optimal parameter set for prediction.

13 7 The text mining field extensively use the Dirichlet distribution to describe the text dataset [11] [19]. PMMs using the Dirichlet distributions as conjugate priors over parameter vectors, θ s, in Θ since the Dirichlet distribution is the conjugate prior of the parameters of multinomial distribution. The probability density function of the Dirichlet distribution for θ m is : Where the normalization constant is p(θ m ) = Dirichlet(θ m ; ξ) = 1 Z(ξ) Z(ξ) = V θ ξ i 1 m,i. (2.8) V Γ(ξ i) Γ( V ξ i), (2.9) when the parameters following V θ m,i = 1 and (ξ i 1) > 0. PMMs assume every ξ i = 2 and denote it as ξ, which is equivalent to Laplace smoothing. Thus, every normalization constant here is equal. Our goal is maximizing (2.7). Therefore, we can ignore Z(ξ). The above Dirichlet prior p(θ m ) represents the prior of mth parameter vector in Θ. Consequently, the conjugate priors over all the parameter set can be interpreted as p(θ) M V m=1 where M is the number of parameter vectors in Θ. θ ξ 1 m,i, (2.10) Laplace smoothing gives each probability an equal initial value. Every unobserved event will be assigned a equal positive probability. Laplace smoothing avoid any probability estimates to be zero. It is necessary because zero probabilities may make the model too sensitive. Using the conjugate Dirichlet priors for Θ, the objective function of PMMs is given by where M V J(Θ; D) = L(Θ; D) + (ξ 1) log θ m,i, (2.11) n=1 m=1 N V L(Θ; D) = x n,i log ϕ i (y n ). (2.12)

14 PMM1 In [18], PMM1 is regarded as a special case of PMM2. PMM1 represents a firstorder approximation model while PMM2 is considered the second-order one. It always converges to global optimum from iteratively updating the parameter set. In addition, PMM1 resulted in similar, but slightly better, accuracy in match rate than those by PMM2. In the following, we briefly go through the interior of PMM1. Let Θ = [θ 1, θ 2,..., θ L ] with θ l = [θ 1,i,..., θ l,v ] T. That is, the number of parameter vectors in Θ corresponds to the number of categories L and therefore we call the θ l to be category-depent. This assumption implies that ϕ(y)(= [ϕ 1 (y),..., ϕ V (y)] T ) can be represented by the parametric mixture as: ϕ(y) = L h l (y)θ l,where h l (y) = 0 for l such that y l = 0. (2.13) Since y = [y 1,..., y L ] T, we only have to estimate the parameter vectors from θ 1 to θ L instead of estimating 2 L 1 parameter vectors. By this assumption, using h l (y) as the weight of each θ l, we can obtain the multi-label parameter vector ϕ(y) via h l(y)θ l. The mixing proportion h l is used as the degree of which x has the lth category, where h l(y) = 1. Clearly, the term h n l is a weight for combination the θ s. PMM1 assumes the degree to be uniform for all categories such that h l (y) = y l / l =1 y l. That is, 1 P h n L if y l = yn l n = 1, l 0 otherwise. (2.14) For document n, h l s are all equal if y l > 0, l = 1,..., L. For example, in the case of L = 4, if y n = [1, 0, 1, 1] T, ϕ([1, 0, 1, 1] T ) = [θ 1 + θ 3 + θ 4 ] T /3, and if y n = [1, 0, 0, 1] T,

15 9 ϕ([1, 0, 0, 1] T ) = [θ 1 + θ 4 ] T /2. Consequently, P (x y, Θ) under PMM1 becomes P (x y, Θ) ( V ) xi h l (y)θ l,i = ( V L ) xi. y lθ l,i l =1 y (2.15) l Accordingly, the objective function of PMM1 is given by L V J(Θ; D) = L(Θ; D) + (ξ 1) log θ l,i N V = x n,i log n=1 constraints: PMM2 L h n l θ l,i + (ξ 1) L V log θ l,i V θ l,i = 1, l = 1,..., L. (2.16) PMM2 is a more flexible model in which parameter vectors of duplicate-category, θ l,m s, are also used to approximate ϕ(y) such that ϕ(y) = L h l (y)h m (y)θ l,m, (2.17) m=1 where h l (y) = 0 for l such that y l = 0 and θ l,m = α l,m θ l + α m,l θ m. In addition, α l,m is a non-negative bias parameter satisfying α l,m + α m,l = 1, l, m. Clearly, α l,l = 0.5 and θ l,m = θ m,l Since y = [y 1,..., y L ] T, we now have to estimate the parameter vectors from θ 1 to θ L and the additional L(L 1)/2 parameters α l,m s instead of estimating 2 L 1 parameter vectors. Similar to the case of PMM1, the mixing proportion h l for PMM2 is also assumed to be 1 P L if y h l = y l = 1, l 0 otherwise. (2.18) By using the product of h l (y) and h m (y) as the weight of each θ l,m, we can obtain the multi-label parameter vector ϕ(y) via (2.17). For example, in the case of L = 4, if

16 10 y n = [1, 0, 1, 1] T, ϕ([1, 0, 1, 1] T ) = [θ 1,1 + θ 3,3 + θ 4,4 + θ 1,3 + θ 3,1 + θ 1,4 + θ 4,1 + θ 3,4 + θ 4,3 ] T /9 = [θ 1,1 + θ 3,3 + θ 4,4 + 2θ 1,3 + 2θ 1,4 + 2θ 3,4 ] T /9 = [{1 + 2(α 1,3 + α 1,4 )}θ 1 + {1 + 2(α 3,1 + α 3,4 )}θ 3 + {1 + 2(α 4,1 + α 4,3 )}θ 4 ] T /9 = [{1 + 2(α 1,3 + α 1,4 )}θ 1 + {3 + 2( α 1,3 + α 3,4 )}θ 3 + {5 2(α 1,4 + α 3,4 )}θ 4 ] T /9. In addition, for y n = [1, 0, 0, 1] T, ϕ([1, 0, 0, 1] T ) = [θ 1,1 + θ 4,4 + 2θ 1,4 ] T /4 = [(1 + 2α 1,4 )θ 1 + (3 2α 1,4 )}θ 4 ] T /4. Consequently, P (x y, Θ) under PMM2 becomes P (x y, Θ) = ( V L xi h l (y)h m (y)(α l,m θ l,i + α m,l θ m,i )) (2.19) ( V L m=1 y ly m (α l,m θ l,i + α m,l θ m,i ) l =1 y l m =1 y m.) xi. (2.20) Accordingly, the objective function of PMM2 is given by N V L J(Θ; D) = x n,i log h n l h n m(α l,m θ l,i + α m,l )θ m,i constraints: n=1 +(ξ 1) L m=1 V log θ l,i + (ξ 1) m=1 L log α l,m V θ l,i = 1, l = 1,..., L and α l,m + α m,l = 1. (2.21) 2.2 Advanced Parametric Mixture Model As mentioned above, PMM1 is regarded as a special case of PMM2 where PMM1 represents a first-order approximation model while PMM2 is considered the secondorder one. However, that the seemingly second order duplicate-category parameter

17 11 vectors are in fact formulated as linear combinations of the category-depent parameter vectors. Here we exted PMM1 in Section to allow for free duplicatecategory parameter vectors to produce a better second-order approximation model. The new parametric mixture model is called the advanced parametric mixture model, or APMM in short. In APMM, ϕ(y) is formulated as ϕ(y) = L h l (y)h m (y)θ l,m, (2.22) m=1 where h l (y) = 0 for l satisfying y l = 0. The difference between APMM and PMM2 is mainly that θ l,m is restricted to be a weighted average of θ l and θ m in PMM2 whereas they are free in APMM. In addition, we assume the symmetry of θ l,m such that θ l,m = θ m,l. In addition, we denote θ l,l as θ l. Since y = [y 1,..., y L ] T, we now have to estimate L(L + 1)/2 parameter vectors {θ l, θ l,m } l,m=1,...,l l<m instead of estimating 2 L 1 parameter vectors. The mixing proportion h l for APMM is, similar to the case of PMM1 and PMM2, also assumed to be 1 P L if y h l = y l = 1, l 0 otherwise. (2.23) By using the product of h l (y) and h m (y) as the weight of each θ l,m, we can obtain the multi-label parameter vector ϕ(y) via (2.22). For example, in the case of L = 4, if y n = [1, 0, 1, 1] T, ϕ([1, 0, 1, 1] T ) = [θ 1,1 + θ 3,3 + θ 4,4 + θ 1,3 + θ 3,1 + θ 1,4 + θ 4,1 + θ 3,4 + θ 4,3 ] T /9 = [θ 1 + θ 3 + θ 4 + 2θ 1,3 + 2θ 1,4 + 2θ 3,4 ] T /9.

18 12 Consequently, P (x y, Θ) under APMM can be formulated as P (x y, Θ) = = ( V m=1 ( V L ) xi L h l (y)h m (y)θ l,m,i m=1 y ly m θ l,m,i l =1 y l m =1 y m ) xi V ( y lθ l,i + 2 l>m y ly m θ ) l,m,i xi. y l + 2 (2.24) l>m y ly m APMM also uses the Dirichlet distributions as priors conjugate with P (x y, Θ) over parameter vectors, θ s, in Θ ( L V P (Θ) )( L θ ξ 1 l,i l>m V θ ω 1 l,m,i, ). (2.25) where ξ and ω are the hyperparameters of the prior for category-depent parameter vectors θ l and duplicate-category ones θ l,m, respectively. Combining logarithm of (2.24) and (2.25), we obtain the objective function of APMM that J(Θ; D) = = N log P (x n y n, Θ) + log P (Θ) n=1 N n=1 + (ξ 1) V ( x n,i log yn l θ l,i + 2 L subject to yn l V log θ l,i + (ω 1) V θ l,i = 1, Here we only consider the case that ξ = ω. l>m yn l yn mθ l,m,i + 2 l>m yn l yn m L V l>m log θ l,m,i, V θ l,m,i = 1. (2.26) ) 2.3 Training Update Formula In the training update procedure, the goal is to find a parameter set Θ, such that the objective function becomes optimum. If a parameter set Θ can let objective

19 13 function J(Θ; D) maximum, that seems possible that the same parameter set will predict the testing data well. The training procedure of PMMs, including APMM, is like the Expectation Maximization Algorithm (EM) [3] [12]. Both of two algorithms are iteratively maximize the objective function to find the best parameter set. In each iteration, the algorithm generating a new parameter set via maximize the objective function. And then using the generated parameter set to next iteration until the sequence becomes stable. In addition, the Lagrange multipliers are used to deal with constraints in the optimization problems. Both PMM1 and PMM2 could be considered as special cases of APMM where θ l,m,i parameter are restricted as follows: 1. PMM1: θ l,m,i = θ θ l,i +θ m,i 2 = 1 2 θ θ l,i θ m,i. 2. PMM2: θ l,m,i = α l,m θ θl,i + α m,l θ m,i. While comparing the ϕ(y) functions (2.13), (2.17) and (2.22) for PMM1, PMM2 and APMM respectively, we find that the parameterization are almost the same for PMM2 and APMM except the above mentioned restriction. However, when it comes to the achieving optimization for the three models, APMM is in fact more similar to the case of PMM2 other than PMM1 because the θ l,m,i in PMM2 consists of product of unknown parameters and such formulation in turn complicates the estimation problem. On other other hand, the objective functions are very similar between PMM1 and APMM and therefore good optimality and convergence properties of the updating formula for the former remain for the latter. Thus, in the following, we first detail the updating formula and convergence properties of optimization algorithms for PMM1, followed by those for APMM.

20 PMM1 At each iteration, the training procedure update the parameter set Θ. Let Θ t represent the parameter set found at the tth iteration. In order to understand the relationship between Θ and Θ t, here we define a auxiliary function, J(Θ Θ t ), which is equals to J(Θ; D) in (2.16). J(Θ Θ t ) = N n=1 V ( h n l x θt ) l,i n,i log l =1 hn θ t l l,i +(ξ 1) L {( h n l θ ) l,i } h n l θ h n l θ l,i l,i l =1 V log θ l,i. (2.27) Further deriving it, (2.27) becomes N n=1 +(ξ 1) V ( L x n,i L h n l θt [ l,i log h n l =1 hn θ t l θ l,i log l l,i h n l θ ]) l,i l =1 hn θ l l,i V log θ l,i. (2.28) To simplify the objective function, we decompose J(Θ Θ t ) into three parts. Denote J(Θ Θ t ) as = U(Θ Θ t ) T (Θ Θ t ) + (ξ 1) V log θ l,i. According to (2.14), h n l is set to be equal for all nonzero h l with the same n. In addition, h n l = 0 if y n l = 0. The second term of J(Θ Θ t ) becomes T (Θ Θ t ) = N n=1 thus the first term is denoted as: U(Θ Θ t ) = N V ( L x n,i n=1 h n l θt l,i log l =1 hn θ t l l,i h n l θ l,i l =1 hn l θ l,i ), (2.29) V ( L h n l x θt l,i n,i log h n l =1 hn θ t l θ l,i ). (2.30) l l,i In (2.16), the Likelihood term L(Θ; D) can be denoted as U(Θ Θ t ) T (Θ Θ t ). We first show that T (Θ Θ t ) T (Θ t Θ t ), Θ. The inequality log x y x y 1 always holds when x y > 0. The term h n l θ l,i P L l =1 hn l θ l,i is the weight of θ l,i with y n l > 0. Denote

21 15 that h n l θt l,i P L log l =1 hn l θt l,i h n l θ l,i P L l =1 hn l θ l,i log y log x + 1 x, (2.29) could be written as y in (2.29) as y log x. Apply the inequality as log h n l θt l,i l =1 hn l θ t l,i log h n l θ l,i l =1 hn l θ l,i + 1 l =1 hn l θt l,i h n l θt l,i h n l θ l,i. (2.31) l =1 hn θ l l,i = log h n l θ l,i h n l log θt l,i l =1 hn θ L. (2.32) l l,i l =1 hn θ t l l,i Multiplying both sides by a positive value h n l θt l,i P L, the inequality still holds. Sum- l =1 hn l θt l,i ming up over the l indices on both sides we further obtain L h n l θt l,i h n l log θ l,i l =1 hn θ t L l l,i l =1 hn θ l l,i L h n l θt l,i h n l log θt l,i l =1 hn θ t L. l l,i l =1 hn θ t l l,i The above inequality implies that T (Θ Θ t ) T (Θ t Θ t ), Θ. (2.33) Substituting this property into the objective function yields U(Θ Θ t ) T (Θ Θ t ) + (ξ 1) U(Θ Θ t ) T (Θ t Θ t ) + (ξ 1) L L V log θ l,i V log θ l,i, Θ, (2.34) where Θ is an arbitrary parameter set and Θ t is fixed. Thus we can just maximize U(Θ Θ t ) + (ξ 1) V log θ l,i with respect to Θ to derive the parameter update formula: Θ t+1 arg max{u(θ Θ t ) + (ξ 1) L V log θ l,i }. (2.35) Since (ξ 1) V log θ l,i is strictly concave and U(Θ Θ t ) is a concave function, (2.35) minimizes a strictly concave problem, which has a unique optimal solution.

22 16 Applying this property and (2.33), from (2.34) we conclude that: Θ t+1 Θ t if and only if J(Θ t+1 ) > J(Θ t ). (2.36) Since the objective function is subject to the constraint V θ l,i = 1, we use the Lagrange multiplier to find the optimum θ l,ī in previous derivations. Let θ l,ī be any in the L V matrix Θ. The differentiation of the Lagrangian function is = θ l,ī N n=1 h n l >0 ( U(Θ Θ t ) + (ξ 1) x n,ī h n l θ t l,ī l =1 hn θ t l l,i 1 θ l,ī L V V log θ l,i λ( θ l,i 1) ) + (ξ 1) 1 θ l,ī λ. (2.37) The optimality condition is achieved when (2.37) equals 0, l, ī. We then multiply each side with θ l,ī. When the optimality condition is satisfied, (2.37) becomes N n=1 x nī h n l θ t l,ī + (ξ 1) θ l,ī λ = 0, l, ī. (2.38) hn l θt l,ī Therefore, we can find the θ l,ī which optimizes the objective function U(Θ Θ t ) + (ξ 1) V log θ l,i with the constraint V θ l,i at each iteration: N θ l,ī t+1 n=1 x h n l θ t l,ī nī P L + (ξ 1) = hn l θt l,ī. (2.39) λ Moreover, since Vī=1 θ l,ī = 1, we sum up all V vocabularies on each side N V n=1 x h n l θ t l,ī nī P L + (ξ 1) hn l θt l,ī = 1. (2.40) λ ī=1 Hence, we can easily find the Lagrange multiplier λ = ( V N ī=1 n=1 h n l θ t l,ī x n,ī hn l θt l,ī ) + V (ξ 1). (2.41) Based on the derivation of (2.39) and (2.41), we can find the parameters θ t+1 θ t+1 l,ī = ( V ( N n=1 x h n l θ t l,ī n,ī P L hn l θt l,ī N n=1 x n,i h n l θ t l,i P L hn l θt l,i l,ī : ) + (ξ 1) ). (2.42) + V (ξ 1)

23 17 Since h l (y) in PMM1 is of uniform mixing degree within a document, h n l 0 are all equal. Via this property, we can translate the effect of h to y and (2.42) becomes ( N ) θ l,ī t+1 n=1 x y n l θ t l,ī n,ī P L + (ξ 1) = yn l θt l,ī ( V ), l, ī. (2.43) N n=1 x y n l θ t l,i n,i + V (ξ 1) P L yn l θt l,i Next, we show that these parameter updates converge to the unique maximum of L(Θ; D). Our objective is to prove that any limit point of the sequence generated by (2.43) is a global maximum of (2.16). As a first step, we have to confirm that the J(Θ; D) in the whole sequence are strictly increasing. To prove that any limit point in the sequence is an optimum, we assume that That is, Θ not optimum of Θ is not optimal of max J(Θ; D). (2.44) max U(Θ Θ ) T (Θ Θ ) + (ξ 1) From (2.44), Θ does not satisfy L V log θ l,i. (2.45) Θ θ l,i = 0, l, i. (2.46) Thus, there exists at least one θ l,i such that its corresponding partial derivative is nonzero. Therefore, (2.43) could be applied further iteratively to find a parameter set Θ +1, where such that Θ +1 Θ, (2.47) J(Θ +1 ; D) > J(Θ ; D). (2.48) Since the update rule is a continuous function, every Θ in Θ t could be further update to Θ t+1, and the limit point of Θ t+1 is lim t {Θt+1 } = Θ +1. (2.49)

24 18 Thus, we obtain that lim t J(Θt+1 ; D) = J(Θ +1 ; D) > J(Θ ; D). (2.50) It contradicts the fact that J(Θ ; D) > J(Θ t ; D), t. The Θ is a stationary point. Note that a stationary point may be only a saddle point. If the maximization of (2.16) is a concave programming problem, the stationary point is a global maximum. To check whether that is the case here, we first examine the characteristics of the objective function. The objective function (2.16) is a sum of logarithm functions of θ l,i s and therefore a concave function because logarithm functions are concave. Maximization of a concave function f could be equivalently regarded as minimization of a concave function f, as the scenario defined in concave programming problem. Moreover, the feasible region defined by those parameter constraints is a concave set. As a result, the maximization of (2.16) is in fact a concave programming problem and the stationary point Θ is not only a local optimum, but a global one. So we can conclude that any limit point of the sequence {Θ} generated by (2.43) is a global maximum of (2.16). Example To make the above derivations of the updating rule more transparent, we provide an example to illustrate its property. Consider a simple example with only sample size one (N = 1), two possible labels (L = 2), and two words in the vocabulary (V = 2). The problem could be denoted as: N = 1, L = 2, V = 2, y = [1, 1] T, x 1 = 1, x 2 = 1, θ 11 + θ 12 = 1, θ 21 + θ 22 = 1.

25 19 The objective function of PMM1 for this example is ( θ11 + θ ) ( 21 θ12 + θ ) 22 max log + log 2 2 s.t. θ 11 + θ 12 = 1, θ 21 + θ 22 = 1, 0 θ 11, θ 12, θ 21, θ 22, (2.51) where the penalty term is omitted for simplification. This problem only has two constraints and four parameters, or in other words two free parameters. Suppose that θ l,i updates θ l,i in successive step in the updating sequence. For l = 1 θ 11 = θ 12 = θ 11 θ 11 +θ 21 θ 11 θ 11 +θ 21 + θ 12 θ 12 +θ 22, (2.52) θ 12 θ 12 +θ 22 θ 11 θ 11 +θ 21 + θ 12 θ 12 +θ 22. (2.53) In order to make sure that the updating rule always increases the objective value via iteratively updating its parameter Θ, we have to show that or in other words, ( θ11 + θ ) ( 21 θ12 + θ ) 22 log + log 2 2 ( θ11 + θ ) ( 21 θ12 + θ ) 22 > log + log, (2.54) 2 2 ( θ11 + θ ) ( 21 θ12 + θ ) 22 log log > 0. (2.55) θ 11 + θ 21 θ 12 + θ 22

26 20 Because the logarithm inequality that log(1 + x) > x always holds for x 1, the left-hand side of (2.55) satisfies that ( θ11 + θ ) 21 log θ 11 + θ 21 = log ( θ12 + θ ) 22 log θ 12 + θ 22 ( 1 + θ 11 θ ) 11 log θ 11 + θ 21 θ 11 θ 11 θ 11 + θ 21 + θ 12 θ 12 θ 12 + θ 22 ( 1 + θ 12 θ ) 12 θ 12 + θ 22 = θ 11 θ 11 + θ 12 θ 12 θ 11 + θ 21 θ 12 + θ 22 1 = ( θ 11 + θ 21 )( θ 12 + θ 22 ) [( θ 11 θ 11 )( θ 12 + θ 22 ) + ( θ 12 θ 12 )( θ 11 + θ 21 )] 1 = ( θ 11 + θ 21 )( θ 12 + θ 22 ) ( θ 11 θ 11 )( θ 12 θ 11 + θ 22 θ 21 ) 2 = ( θ 11 + θ 21 )( θ 12 + θ 22 ) ( θ 11 θ 11 )(θ 22 θ 11 ), (2.56) where the last two equalities hold by considering the constraints that V θ l,i = 1, l: θ 11 + θ 12 = θ 11 + θ 12 = 1. Because the denominator of (2.56) is always positive, it is sufficient to check whether its numerator is also positive. Let = θ 11 θ 11, that is, If > 0, then = θ 11 θ 12 θ 11 +θ 21 θ 11θ 12 θ 12 +θ 22 θ 11 θ 11 +θ 21 + θ 12 θ 12 +θ 22. (2.57) and since the constraints of (2.51), (2.58) becomes Moreover, θ 12 + θ 22 > θ 11 + θ 21, (2.58) θ 22 θ 11 = θ 22 1 > θ 11 + θ 21. (2.59) = = θ 11 θ 11 +θ 21 θ 11 θ 11 +θ 21 + θ 12 θ 12 +θ 22 θ 11 θ 21 θ 11 +θ 21 + θ 12θ 22 θ 12 +θ 22 θ 11 θ 11 +θ 21 + θ 12 θ 12 +θ 22 θ 11 θ 21 θ 11 +θ θ 11 θ 21 +θ 11 θ 21 θ 12 +θ 22 θ 11 θ 11 +θ 21 + θ 12 θ 12 +θ 22. (2.60)

27 21 The denominator of (2.60) is positive, we examine the sign of its numerator such that Numerator of (2.60) = θ 11θ 21 θ 11 + θ θ 11 θ 21 + θ 11 θ 21 (1 θ 11 ) + (1 θ 21 ). (2.61) Again, the denominator of (2.61) is positive, so we could consider only the numerator of (2.61) that numerator = θ 11 θ 21 (1 θ 11 ) θ 11 θ 21 (1 θ 21 ) + (θ 11 + θ 21 )(1 θ 11 )(1 θ 21 ) = θ 11 θ 21 (1 θ 11 ) θ 11 θ 21 (1 θ 21 ) +θ 11 (1 θ 11 )(1 θ 21 ) + θ 21 (1 θ 11 )(1 θ 21 ) = θ 21 (1 θ 11 )(1 θ 21 θ 11 ) + θ 11 (1 θ 11 θ 21 )(1 θ 21 ) = (1 θ 11 θ 21 )[θ 21 θ 11 θ 21 + θ 11 θ 11 θ 21 ] = (1 (θ 11 + θ 21 ))(θ 11 θ 22 + θ 21 θ 12 ). (2.62) To sum up, in the case that > 0, (1 (θ 11 + θ 21 )) is positive according to(2.59) and consequently θ 22 θ 11 is also positive. Therefore, (2.55) holds. In the case that < 0, (2.55) can be similarly shown to hold as well. Thus, (2.54) has been proved, indicating that the updating rule always increases the objective function (2.51) APMM Given the above section of derivations, the updating formulae for APMM becomes straightforward due to its strong similarity with PMM1. The major and perhaps the only change needed for their differences in the deriving equations are that the first-order category-depent related term h l θ l,i has to be changed into the duplicate-category related term h l h m θ l,m,i. In addition, the summation has to be for duplicate-categories, that is, m=1 instead of in PMM1. To avoid redundancy in the derivation, we write out only the key equations in arriving at the updating formula. The auxiliary function, J(Θ Θ t ), which actually

28 22 equals to J(Θ; D) in (2.26) is J(Θ Θ t ) = N n=1 {( h n log l h n ) mθ l,m,i L h n l hn mθ l,m,i V ( L h n l x hn mθ t ) l,m,i n,i m=1 l =1 m =1 hn h n θ t l m l,m,i L } L V h n l hn θ m l,m,i + (ξ 1) log θ l,m,i. l =1 m =1 Further deriving it, J(Θ Θ t ) becomes n=1 m=1 m=1 N V ( L h n l x hn mθ t [ l,m,i n,i log h n m=1 l =1 m =1 hn h n θ t l h n mθ l,m,i log l m l,m,i L V + (ξ 1) log θ l,m,i, (2.63) h n ]) l hn mθ l,m,i l =1 hn θ m l,m i where J(Θ Θ t ) = U(Θ Θ t ) T (Θ Θ t ) + (ξ 1) m=1 V log θ l,m,i with U(Θ Θ t ) = N n=1 T (Θ Θ t ) = ( L L m=1 l =1 V ( x n,i N n=1 L m=1 l =1 V x n,i h n l hn mθl,,m,i t log m =1 hn h n θ t l m l,m,i h n l hn mθl,,m,i t log h n m =1 hn h n θ t l h n mθ l,m,i ), l m l,m,i l =1 h n ) l hn mθ l,m,i. (2.64) m =1 hn h n θ l m l,m,i It can be similarly shown that T (Θ t+1 Θ t ) will not make the value of the objective function decrease. Thus we can just maximize U(Θ Θ t ) + (ξ 1) L L m=1 with respect to Θ to derive the parameter update formula. V log θ l,m,i (2.65) Let θ l, m,ī be the elements in the (L(L + 1)/2) V matrix. The partial derivative of (2.26) with Lagrange multipliers respect to θ l, m,ī becomes = θ l, m,ī N ( U(Θ Θ t ) + (ξ 1) x n,ī n=1 h n l h n m >0 l =1 L L m=1 h n l h n mθ t l, mī m =1 hn h n θ t l m l,m,i V V log θ l,m,i λ( θ l,m,i 1) ) 1 θ l, m,ī + (ξ 1) 1 θ l, m,ī λ (2.66)

29 23 Letting the above partial derivative equal to zero yields θ l, m,ī = N n=1 x n,ī P L h n l h n m >0 l =1 h n l h n m θt l, mī P L + (ξ 1) m =1 hn l hn m θt l,m,i. (2.67) λ Since V θ l, m,i = 1, we have that λ = ( V ī=1 N x n,ī n=1 h n l h n m >0 l =1 Substituting (2.68) into (2.67), the update rule becomes h n l h n mθ t l, ) mī + V (ξ 1). (2.68) m =1 hn h n θ t l m l,m,i θ l, m,ī = N n=1 x n,ī P L h n l h n m >0 l =1 Vī=1 N n=1 x n,ī P L h n l h n m >0 l =1 h n l h n m θt l, mī P L + (ξ 1) m =1 hn l hn m θt l,m,i. (2.69) h n l h n m θt l, mī P L + V (ξ 1) m =1 hn l hn m θt l,m,i In other words, the updating formula for θ l,l,i = θ l,i and θ l,m,i (l m) are respectively and θ t+1 l,ī = Nn=1 x n,ī y n l >0 V Nn=1 x n,i y n l >0 y lθ t l,ī P L y lθ t l,ī +2 P L l>m y ly mθ t l,m,ī y lθ t l,i P L y lθl,i t +2 P L l>m y ly mθl,m,i t + (ξ 1), (2.70) + V (ξ 1) θ t+1 l, m,ī = N n=1 x n,ī y n l,y n m >0 V N n=1 x n,i y n l,y n m >0 2y ly mθ t l, m,ī P L y lθ t l,ī +2 P L l>m y ly mθ l,m,ī 2y ly mθ t l, m,i P L y lθl,i t +2 P L l>m y ly mθl,m,i t + (ω 1). (2.71) + V (ω 1) Similar to PMM1, it can be shown that the updating sequence converges to a optimum. The objective function J(Θ; D) of APMM is again a concave function. Based on the same argument, the maximization of J(Θ; D) is a a concave programming problem and consequently, the sequence converges to a unique global solution. 2.4 Prediction Method Let x denote a new document and ˆΘ denote the optimal parameter set obtained in the training procedure. Applying Bayes rule (2.3), the optimal class label vector of x is defined as y = arg max P (y x, ˆΘ), (2.72) y

30 24 where P (y x, ˆΘ) = P (y ˆΘ)P (x y, ˆΘ) P (x ˆΘ). (2.73) Since P (x Θ) is irrelated to y and PMMs assume P (y) is indepent of Θ, (2.73) becomes Under the uniform class prior assumption P (y x, ˆΘ) P (y)p (x y, ˆΘ). P (y x, ˆΘ) = P (x y, ˆΘ). (2.74) It regresses to the original model (2.15) PMM1 According to (2.72), (2.74), and (2.15), the class label vector is estimated by y = arg max y ( V L y ˆθ ) x i l l,i l =1 y. (2.75) l The class label vector y is a 0/1 vector. PMMs predict the class label vector via the greedy search algorithm. The prediction is accomplished by a simple procedure:

31 25 Algorithm 1 1. Start with initial y = 0, P 0 l = 0, l = 1,..., L. 2. Repeat (t = 0, 1,...) (a) Consider l (t mod L) + 1. i. y = y. ii. If y l = 0, then set y l = 1, and P t+1 l V x i log ( L y l ˆθ l,i l =1 y l ). (2.76) (b) If max(p t+1 l and y l = 1. ) > max(pl t ), then until max(p t+1 l ) max(pl t). l = arg max l (P t+1 l ), Here θ l,i is obtained from the training procedure. The greedy algorithm is quite efficient as it needs at most L(L + 1)/2 to predict a new document x. Because the multi-label problem cause 2 L 1 possible label combinations, if we use the exhaustive search, the prediction procedure will be time consuming. The equation (2.76) in the Algorithm 1 is the logarithm of (2.75). Since logarithm is an increasing function and we just need to rank the (P t+1 l ) to find the maximum, using logarithm does not change the result. Furthermore, using ( the logarithm ) can xi simplify the calculation, and lessen the computational error, since could P L y l ˆθ l,i P L l =1 y l be very small. Hence, the logarithm is useful and also commonly used to cope with this problem APMM The prediction procedure of APMM is similar to Algorithm 1. Since ˆΘ of APMM contains (L(L + 1)/2) V elements so that the class label vector will be mapping to

32 26 L(L + 1)/2-dimensional space. Define D = L(L + 1)/2 as the new class label space. The prediction is accomplished by the following procedure: Algorithm 2 1. Start with initial y l = 0, P 0 l = 0, l = 1,..., L. 2. Repeat (t = 0, 1,...) (a) Consider l (t mod L) + 1. i. y = y. ii. If y l = 0, then set y l = 1. iii. Mapping y l to y d, d = 1,..., D. (Appix B for details) P t+1 l V x i log ( D d=1 y dˆθ d,i D d =1 y d ). (2.77) (b) If max(p t+1 l and y l = 1. ) > max(pl t ), then until max(p t+1 l ) max(pl t). l = arg max l (P t+1 l ), Since the class label vector y is mapped to a L(L + 1)/2-dimensional space, the original decision function P t+1 l = V x i log ( y ˆθ l l,i + 2 l>m y ) ly mˆθl,m,i l =1 y l + 2 l>m y ly m could be represented as (2.77). Thus we use similar formula to predict class label in APMM. The estimated value P t+1 l, l = 1,... L. Hence each iteration we update P t+1 l at most total number of L. Therefore, the prediction procedure of a new document x only needs at most L(L + 1)/2, too.

33 CHAPTER III Experiments and Results This section first describes the datasets used in the experiments. The real-world yahoo.com datasets were generated in [18]. We then present three evaluation criteria used in the experiments. Those are common ways to measure multi-label classification models. Finally, we compare our new model, APMM with PMMs, and make some discussions. We redo the experiments in [18] and construct a more efficient program to reduce the training and predicting time. 3.1 Data Description The yahoo.com datasets are generated from real world Web pages. It has 14 toplevel categories. Each of the top-category has several sub-categories, which we consider as second-categories. A Web page associated with one top-category is further classified into one or more sub-categories. Those datasets were originally generated in [18]. The details of the datasets are shown in Table 3.1. They using the GNU Wget to automatically collect the Web pages from yahoo.com. Although there are 14 top-level categories, they only use 11 of them in the experiments. We consider each top-level as an indepent problem. Therefore, we have 11 indepent datasets in the experiments. The numbers of labels range from 21 to

34 28 The number of vocabulary varies from 21,924 to 52,350. Furthermore, the proportions of multi-label documents are 30% 45% in these 11 problems. The data we used was generated as word frequency, the feature type introduced in Section 2.1. Table 3.1: Details of the yahoo.com Web page datasets. #Text is the number of texts in the dataset, #Voc is the number of vocabularies (i.e., features), #Tpc is the number of topics, #Lbl is the number of labels, and Label size Frequency is the relative frequency of each label size. Label Size Frequence(%) Dataset #Text #Vocab #Topic #Label Ar 7,484 23, Bu 11,214 21, Co 12,444 21, Ed 12,030 27, En 12,730 32, He 9,205 30, Evaluation Criteria In the experiments, we use three measurements to evaluate the performances of PMM and APMM. These evaluation criteria are common in the multi-label classification field. In single-label classification problems accuracy (i.e., Exact Match ratio) is often used. In the multi-label classification problem, however, Exact Match ratio may not be the most suitable. Thus, we apply the Labeling F-measure and Retrieval F-measure to estimate the model. For all of them, larger is better Exact Match Ratio This criterion calculates how many samples are correctly predicted by the model. Its formula is as the following: E X = 1 N N B[ŷ n = y n ]. (3.1) n=1

35 29 Here, ŷ n denotes the predicted label vector of the nth document, and y n is the true class label vector of the nth document. B denotes the Boolean function. It has two possible values (0 or 1) to evaluate the truth or falsity. 1 if ŷ B[ŷ n = y n l n = yl n, l, ] = 0 otherwise. (3.2) Labeling F-measure This is a partial matched ratio. Since it is difficult have high Exact Match ratio for multi-label classification problems, we can use some partial measures to check how the model performs. The formulation is: F L = 1 N N n=1 2 ŷn l yn l + yn l ŷn l. (3.3) Furthermore, F-measure could be defined as the combination of Precision (P ) and Recall (R). Let the precision/recall of the nth document be defined as follows: P n = yn l ŷn l ŷn l, (3.4) R n = yn l ŷn l yn l. (3.5) Precision and Recall are widely used in many fields, such as the data mining and machine learning field. Precision ratio shows how the model precisely detects the true class labels. Recall ratio shows how many percentages of true class labels are detected.

36 30 We have F L = 1 N = 1 N = 1 N = 1 N N n=1 N n=1 N n=1 N n=1 2( yn l ŷn l )2 ( yn l ŷn l ) ( yn l + ŷn l ) 2( yn l ŷn l )2 L yn l ( yn l ŷn l ) + 2 ( P L yn l ŷn l P L ŷn l ( P L yn l ŷn l P L ŷn l 2P n R n (P n + R n ). P L yn l ŷn l P L yn l + P L yn l ŷn l P L yn l ) ) ŷn l ( yn l ŷn l ) Consider an instance with L = 4, its true label vector y = [1, 1, 0, 1] T and the predicted label vector ŷ = [1, 0, 0, 1] T. Its Precision is P = 2/2, Recall is 2/3, and its F L becomes 4/ Retrieval F-measure This is a partial measurement to label-wise evaluate the performance. In the text categorization community Retrieval F-measure is called the macro average of F-measures. F R = 1 L L 2 N n=1 ŷn l yn l + N n=1 yn l N n=1 ŷn l. (3.6) 3.3 Experimental Setting Six of the 11 datasets are used in the following experiments. The training size of each dataset is 2000, and the testing size of each dataset is Those are original datasets used in [18]. Since the performances of PMM2 and PMM1 are close and [18] observed that PMM1 is better than PMM2 in the experiment results, we implement the program of

37 31 Table 3.2: Single-label documents prediction performance. Pr s is the number of documents predicted as single-label, Co s is the number of single-label documents which have been predicted correctly, Co ratio is the ratio of single-label documents has been correctly predicted. PMM1 APMM Dataset single Pr s Co s Co ratio Pr s Co s Co ratio Ar Bu Co Ed En He PMM1 (Appix A) to compare the proposed model, APMM. Furthermore, we set the constant (ξ 1) = 1 in the penalty term. 3.4 Results and Discussions Most of the exactly matched labels are single-label documents. Actually, Table 3.1 shows that the single-label documents has the largest proportion. Table 3.3 shows that the number of iterations increases while the stopping tolerance decreases. It also shows that the training and testing time of APMM is only slightly larger then PMM even though the class label y is mapped to the L(L+1)/2 dimensional space. Our stopping condition is θ t+1 l,i θ t l,i tolerance, l, i. Tables 3.4 and 3.5 present the performances under the stopping tolerances 0.01 and , respectively. The differences of performances between Table 3.4 and Table 3.5 are quite small. Furthermore, no matter what the initial Θ is given, the same stop tolerance leads to similar iterations and performances. Figure 3.1 presents the relation between stopping

38 32 Table 3.3: Training and testing time of models with different stopping tolerances. Since the numbers in this table are the average of several problems, the numbers of iterations have decimal point PMM1 APMM Stop tol #Iter T tr T te #Iter T tr T te Table 3.4: Performance of using stopping tolerances = Three evaluation criteria presented in Section 3.2. The Exact Match ratio of APMM is better than that of PMM1, but the Retrieval F-measure is lower then PMM1. The Labeling F-measures of the two models are quite similar. PMM1 APMM Dataset E X F L F R E X F L F R Ar Bu Co Ed En He tolerances and performances. Clearly the performance is not affected much by the stopping tolerance. And the convergence speed of three models shows in Figure 3.2. Considering the prior P (Θ) showed in (2.10), we tried several (ξ 1) in the experiments. The constants we tried range from 10 to Figure 3.3 shows the different (ξ 1) s versus the three evaluation criteria. It seems reasonable to set (ξ 1) = 1 since the accuracy rates are performed well in the range of (ξ 1) from 2 to 0.1. Details of (ξ 1) in this similar range are in Figure 3.4.

39 33 Table 3.5: Performance of using stop tolerance Table 3.4. PMM1 APMM Dataset E X F L F R E X F L F R Ar Bu Co Ed En He The leg is the same as Exact Match ratio Labeling F measure Retrieval F measure Stopping tolerances (a) Stopping tolerances (b) Stopping tolerances (c) Figure 3.1: The relation between stopping tolerances and performances Number of iteration Stopping tolerances (a) Figure 3.2: Number of iterations versus stop criteria from 10 to

40 Exact Match ratio Labeling F measure Retrieval F measure ξ 1 (a) ξ 1 (b) ξ 1 (c) Figure 3.3: (ξ 1) from to Exact Match ratio Labeling F measure Retrieval F measure ξ 1 (a) ξ 1 (b) ξ 1 (c) Figure 3.4: (ξ 1) from 0.1 to 2. Furthermore, analyzing the performance of the models we find some interesting things: why the critical criterion, Exact Match ratio, is better than that of PMM1, but the partial match ratio, Labeling F-measure and Retrieval F-measure, are not as well. So, further exploring the causality, we decompose the datasets with respect to different label size. We calculate Precisions and Recalls, and find that the prediction results of APMM are mainly in label size one. The predicting proportions of PMM1 are relatively fair enough. However, though the predicting proportions are relatively unevenness, the Precision and Recall of APMM may better then PMM1. For example,

41 35 Table 3.6 shows that the predicted number with label size two are less then PMM1, but the Precision and Recall of APMM with label size two are larger then PMM1. Table 3.6: Prediction accuracy of different label size. #label is the label size, mun is the total number of #label in the dataset, Pr is the size of label has been predicted, Co is the correctly predicted number, and Co ratio is the ratio of correctly predicted. Since Table 3.1 shows the frequency of label size larger than 4 are relatively smaller, we combines the correctly predicted ratio 4. PMM1 APMM #label num Pr Co Precision Recall Pr Co Precision Recall >

Parametric Mixture Models for Multi-Labeled Text

Parametric Mixture Models for Multi-Labeled Text Naonori Ueda Kazumi Saito NTT Communication Science Laboratories 2-4 Hikaridai, Seikacho, Kyoto 619-0237 Japan {ueda,saito}@cslab.kecl.ntt.co.jp Abstract