Advanced Parametric Mixture Model for Multi-Label Text Categorization. by Tzu-Hsiang Kao

Size: px
Start display at page:

Download "Advanced Parametric Mixture Model for Multi-Label Text Categorization. by Tzu-Hsiang Kao"

Transcription

1 Advanced Parametric Mixture Model for Multi-Label Text Categorization by Tzu-Hsiang Kao A dissertation submitted in partial fulfillment of the requirements for the degree of Master of Science (Industrial Engineering) in National Taiwan University 2006

2 ABSTRACT This thesis studies Parametric Mixture Models (PMMs). They are efficient statistical models to solve multi-label text categorization problem. Conventional machine learning models usually training binary classifiers for predicting multi-label problem. In contrast, PMMs use a single statistical model to handle multi-label text. We propose an Advanced Parametric Mixture Model (APMM) based on PMMs. Its maximum likelihood is a concave programming problem. We design update rules so that iterations converge to a global maximum. The experiments use the real-world yahoo.com datasets under three common multi-label classification measurements. The results show that APMM is competitive. ii

3 TABLE OF CONTENTS ABSTRACT ii LIST OF FIGURES LIST OF TABLES v vi CHAPTER I. Introduction II. Advanced Parametric Mixture Model Parametric Mixture Model PMM PMM Advanced Parametric Mixture Model Training Update Formula PMM APMM Prediction Method PMM APMM III. Experiments and Results Data Description Evaluation Criteria Exact Match Ratio Labeling F-measure Retrieval F-measure Experimental Setting Results and Discussions IV. Discussion and Conclusions iii

4 APPENDICES BIBLIOGRAPHY iv

5 LIST OF FIGURES Figure 3.1 The relation between stopping tolerances and performances Number of iterations versus stop criteria from 10 to (ξ 1) from to (ξ 1) from 0.1 to v

6 LIST OF TABLES Table 3.1 Details of the yahoo.com Web page datasets. #Text is the number of texts in the dataset, #Voc is the number of vocabularies (i.e., features), #Tpc is the number of topics, #Lbl is the number of labels, and Label size Frequency is the relative frequency of each label size Single-label documents prediction performance. Pr s is the number of documents predicted as single-label, Co s is the number of singlelabel documents which have been predicted correctly, Co ratio is the ratio of single-label documents has been correctly predicted Training and testing time of models with different stopping tolerances. Since the numbers in this table are the average of several problems, the numbers of iterations have decimal point Performance of using stopping tolerances = Three evaluation criteria presented in Section 3.2. The Exact Match ratio of APMM is better than that of PMM1, but the Retrieval F-measure is lower then PMM1. The Labeling F-measures of the two models are quite similar Performance of using stop tolerance The leg is the same as Table Prediction accuracy of different label size. #label is the label size, mun is the total number of #label in the dataset, Pr is the size of label has been predicted, Co is the correctly predicted number, and Co ratio is the ratio of correctly predicted. Since Table 3.1 shows the frequency of label size larger than 4 are relatively smaller, we combines the correctly predicted ratio vi

7 CHAPTER I Introduction Multi-label classification has been a popular issue in recent years. In previous studies, the classification field often focuses on binary or multi-class problems. In many practical problems, one sample could belong to several classes. Therefore, an instance may have several class labels. This situation increasingly occurs in today s text categorization (i.e., text classification) problems. Since the number of online documents and newspaper articles (electronic files) is rapidly increasing, the needs for automatic categorizing method are rising. In this research, we propose the Advanced Parametric Mixture Model (APMM) to cope with the multi-label problem, and use the yahoo.com Web documents in the experiments to compare its performance to the state-of-the-art parametric mixture models. The goal of text categorization [15] is to classify the documents into a set of predefined categories. Text categorization begins in the early 60s [9], but it did not become the major subfield of information technology until early 90s. It has been applied in several contexts, ranging from document indexing based on a controlled vocabulary [13] [20], to document filtering [1], automated metadata generation [8] [5], word sense disambiguation [17], population of hierarchical catalogues of Web resources, and in general any application requiring document organization. Several learning models have been used for text categorization. Some of them have 1

8 2 shown good results. For example, Naive Bayes classifier [7] [12], the k-nearest-neighbor (k-nn) [10] [21], and the C4.5 decision tree [14] [7]. The Naive Bayes classifier uses a probabilistic model of text to estimate the probability that a document belongs to a category. The k-nn classifier is based on the assumption that a example is likely to belong to the class to which the majority of its nearest k examples belong. The decision tree learners construct models based on branching instances according to values of features. However, the above models assume that the data sets are multi-class. In practice, it is possible that a document is associated with more than one label. [2] discussed the following phenomenon: when two human experts decide whether to classify a document under a category, they may disagree, and this in fact happens very often. For example, a news article on Clinton atting Dizzy Gillespie s funeral could be related to Politics, Jazz, both, or even neither. Such scenarios are increasingly common in the Internet today. Multi-label text classification assigns each document at least one label as its category whereas multi-class text classification assigns a document exactly one class label. A variety of models have been developed for multi-label text categorization. For example, Support Vector Machines (SVM) [6] use a set of binary SVM classifiers to handle the multi-label classification problem. BoosTexter proposed in [16] exts Ada-Boost [4] to handle multi-label text categorization. However, these two models cope with the multi-label problem only with binary classifiers. It is reasonable to think that models which extract more information in the data will also make better prediction for future cases. While attempts to use only a set of binary classifiers to solve the multi-label problem, these models might lose certain information embedded in the datasets, such as the probability of label co-occurrence.

9 3 Mixture model is a popular way to handle multi-label text classification problems. It considers more factors than binary classifier. In the text categorization area, [18] propose the parametric mixture models (PMMs) to categorize the multi-label documents. They proposed two parametric mixture models. One could be considered as the first-order approximation, PMM1, and the other is a second-order model, PMM2. They found that the difference in accuracy rate between those two models are very minor, and the Exact Match ratio of PMMs are very low. To improve the predicting accuracy, we propose the Advanced Parametric Mixture Model (APMM). APMM exts the spirits of PMMs by introducing additional parameters in the models to account for depencies in the labels. The experiments use the yahoo.com text data and APMM is shown to achieve higher Exact Match ratio than PMMs. In this thesis, we first review the related work on PMMs, and introduce their extension, the Advanced Parametric Mixture Model, including describing its objective functions, training models, and the predicting method in Chapter II. Chapter III contains the dataset description and the experiment results obtained with three common multi-label classification measurements. Finally, we conclude this work in Chapter IV.

10 CHAPTER II Advanced Parametric Mixture Model In the multiple topics documents such as Web pages, [18] proposes parametric mixture models (PMMs) and shows that it outperforms some conventional methods. Traditionally, a multi-label classification problem is regarded as many binary problems. It was solved by training several classifiers, and then, each classifier handles a binary problem. In contrast, PMMs directly handle multi-label classification problems by a probability model. In the chapter, we firstly review existing parametric mixture models, PMM1 and PMM2 [18]. Followed by a detailed derivation of PMM1, an example is given to illustrate its training procedure. We then propose a new model called Advanced Parametric Mixture Model (APMM). 2.1 Parametric Mixture Model The notations of PMMs denote N as the total number of documents. The d n represents the nth document. V is the vocabulary set containing all words appearing in all documents. It could be denoted as w 1,..., w V, where w i is the ith word in the vocabulary V. The input data here are represented by Bag-of-Words (BOW) [17], a frequencybased representation. It assumed that each word is indepent to the others. There- 4

11 5 fore, BOW ignores the order of the words. No matter the combination of the words, BOW only considers the word frequency (a.k.a. term frequency) of documents. Although pondering the combinations of the words may be more accurate, a basic model without taking into account the order of words might be efficient and serves as a good starting point. Under the BOW framework, d n could be represented as a wordfrequency vector, x n = [x n 1,..., x n V ]T. Here, the x n i denotes the frequency of word w i appearing in d n. The labels of document d n are presented as a category vactor y n = [y1 n,..., yl n]t. L is the total number of categories. Those L categories are predefined. The given training data has N samples denoted as D = {(x n,y n )} N n=1. Here, each yl n is a Boolean value, which only containing 0 or 1. yl n = 1 means that the nth document belongs to l category, y n l = 0 means that the nth document does not belongs to l category. In practice, we categorize a document into one or more categories. Therefore, add the constraint l y l > 0 to insure that a document remains with at least one category. PMMs consider that each word in the vocabulary set has a probability related to each class, hence the class-depent probabilities could be parameterized. In the multi-class single-label classification problem, the probability of a document x in the lth category could be written as a multinomial distribution: P (x l) V (θ l,i ) x i, where θ l,i 0 and V θ l,i = 1. (2.1) Here, θ l,i is the probability that the ith word w i occurs in the l class. Furthermore, the multi-class multi-label classification problem is generated as: P (x y) V (ϕ i (y)) x i, where ϕ i (y) 0 and V ϕ i (y) = 1, (2.2) where ϕ i (y) is a class-depent probability that representing the ith word w i occurs in the (y) class. If we consider every possible combination of y, the total number of

12 6 labels is 2 L 1. Therefore, if L is large, the number of possible multi-label classes will be huge, then the model will be inefficient. Thus, we try to efficiently parameterize them using a reasonable number of parameters. Let Θ denote the vector of unknown parameters of ϕ i (y) in (2.2). The Bayes rules state that P (A, B) = P (A)P (B A) = P (B)P (A B), (2.3) and P (A B) = P (A, B) P (B) = P (A)P (B A) P (B) = P (B)P (A B). (2.4) P (B) where P (A, B) (a.k.a. P (A B)) represents the probability that both A and B occur. We can obtain the posterior P (Θ D), i.e., P (Θ x, y), based on (2.3) and (2.4) such that P (Θ x, y) = P (Θ)P (x, y Θ) P (x, y) = P (Θ)P (y Θ)P (x y, Θ). (2.5) P (x, y) PMM assumes that y is indepent to Θ and therefore P (y Θ) is equal to P (y). This assumption simplified this problem. Since both P (y) and P (x, y) are not a function of the parameter set Θ of interest and serve only as normalizing constant, (2.5) can be further simplified as P (Θ x, y) P (Θ)P (x y, Θ), (2.6) where P (Θ) represents the prior of parameters. After obtaining the posterior P (Θ x, y) of Θ, the Maximum A Posterior (MAP) is estimated as ˆΘmap = arg max Θ log P (xn y n, Θ) + log P (Θ). (2.7) The goal of this model is to find ˆΘmap in each iteration to find out the optimal parameter set for prediction.

13 7 The text mining field extensively use the Dirichlet distribution to describe the text dataset [11] [19]. PMMs using the Dirichlet distributions as conjugate priors over parameter vectors, θ s, in Θ since the Dirichlet distribution is the conjugate prior of the parameters of multinomial distribution. The probability density function of the Dirichlet distribution for θ m is : Where the normalization constant is p(θ m ) = Dirichlet(θ m ; ξ) = 1 Z(ξ) Z(ξ) = V θ ξ i 1 m,i. (2.8) V Γ(ξ i) Γ( V ξ i), (2.9) when the parameters following V θ m,i = 1 and (ξ i 1) > 0. PMMs assume every ξ i = 2 and denote it as ξ, which is equivalent to Laplace smoothing. Thus, every normalization constant here is equal. Our goal is maximizing (2.7). Therefore, we can ignore Z(ξ). The above Dirichlet prior p(θ m ) represents the prior of mth parameter vector in Θ. Consequently, the conjugate priors over all the parameter set can be interpreted as p(θ) M V m=1 where M is the number of parameter vectors in Θ. θ ξ 1 m,i, (2.10) Laplace smoothing gives each probability an equal initial value. Every unobserved event will be assigned a equal positive probability. Laplace smoothing avoid any probability estimates to be zero. It is necessary because zero probabilities may make the model too sensitive. Using the conjugate Dirichlet priors for Θ, the objective function of PMMs is given by where M V J(Θ; D) = L(Θ; D) + (ξ 1) log θ m,i, (2.11) n=1 m=1 N V L(Θ; D) = x n,i log ϕ i (y n ). (2.12)

14 PMM1 In [18], PMM1 is regarded as a special case of PMM2. PMM1 represents a firstorder approximation model while PMM2 is considered the second-order one. It always converges to global optimum from iteratively updating the parameter set. In addition, PMM1 resulted in similar, but slightly better, accuracy in match rate than those by PMM2. In the following, we briefly go through the interior of PMM1. Let Θ = [θ 1, θ 2,..., θ L ] with θ l = [θ 1,i,..., θ l,v ] T. That is, the number of parameter vectors in Θ corresponds to the number of categories L and therefore we call the θ l to be category-depent. This assumption implies that ϕ(y)(= [ϕ 1 (y),..., ϕ V (y)] T ) can be represented by the parametric mixture as: ϕ(y) = L h l (y)θ l,where h l (y) = 0 for l such that y l = 0. (2.13) Since y = [y 1,..., y L ] T, we only have to estimate the parameter vectors from θ 1 to θ L instead of estimating 2 L 1 parameter vectors. By this assumption, using h l (y) as the weight of each θ l, we can obtain the multi-label parameter vector ϕ(y) via h l(y)θ l. The mixing proportion h l is used as the degree of which x has the lth category, where h l(y) = 1. Clearly, the term h n l is a weight for combination the θ s. PMM1 assumes the degree to be uniform for all categories such that h l (y) = y l / l =1 y l. That is, 1 P h n L if y l = yn l n = 1, l 0 otherwise. (2.14) For document n, h l s are all equal if y l > 0, l = 1,..., L. For example, in the case of L = 4, if y n = [1, 0, 1, 1] T, ϕ([1, 0, 1, 1] T ) = [θ 1 + θ 3 + θ 4 ] T /3, and if y n = [1, 0, 0, 1] T,

15 9 ϕ([1, 0, 0, 1] T ) = [θ 1 + θ 4 ] T /2. Consequently, P (x y, Θ) under PMM1 becomes P (x y, Θ) ( V ) xi h l (y)θ l,i = ( V L ) xi. y lθ l,i l =1 y (2.15) l Accordingly, the objective function of PMM1 is given by L V J(Θ; D) = L(Θ; D) + (ξ 1) log θ l,i N V = x n,i log n=1 constraints: PMM2 L h n l θ l,i + (ξ 1) L V log θ l,i V θ l,i = 1, l = 1,..., L. (2.16) PMM2 is a more flexible model in which parameter vectors of duplicate-category, θ l,m s, are also used to approximate ϕ(y) such that ϕ(y) = L h l (y)h m (y)θ l,m, (2.17) m=1 where h l (y) = 0 for l such that y l = 0 and θ l,m = α l,m θ l + α m,l θ m. In addition, α l,m is a non-negative bias parameter satisfying α l,m + α m,l = 1, l, m. Clearly, α l,l = 0.5 and θ l,m = θ m,l Since y = [y 1,..., y L ] T, we now have to estimate the parameter vectors from θ 1 to θ L and the additional L(L 1)/2 parameters α l,m s instead of estimating 2 L 1 parameter vectors. Similar to the case of PMM1, the mixing proportion h l for PMM2 is also assumed to be 1 P L if y h l = y l = 1, l 0 otherwise. (2.18) By using the product of h l (y) and h m (y) as the weight of each θ l,m, we can obtain the multi-label parameter vector ϕ(y) via (2.17). For example, in the case of L = 4, if

16 10 y n = [1, 0, 1, 1] T, ϕ([1, 0, 1, 1] T ) = [θ 1,1 + θ 3,3 + θ 4,4 + θ 1,3 + θ 3,1 + θ 1,4 + θ 4,1 + θ 3,4 + θ 4,3 ] T /9 = [θ 1,1 + θ 3,3 + θ 4,4 + 2θ 1,3 + 2θ 1,4 + 2θ 3,4 ] T /9 = [{1 + 2(α 1,3 + α 1,4 )}θ 1 + {1 + 2(α 3,1 + α 3,4 )}θ 3 + {1 + 2(α 4,1 + α 4,3 )}θ 4 ] T /9 = [{1 + 2(α 1,3 + α 1,4 )}θ 1 + {3 + 2( α 1,3 + α 3,4 )}θ 3 + {5 2(α 1,4 + α 3,4 )}θ 4 ] T /9. In addition, for y n = [1, 0, 0, 1] T, ϕ([1, 0, 0, 1] T ) = [θ 1,1 + θ 4,4 + 2θ 1,4 ] T /4 = [(1 + 2α 1,4 )θ 1 + (3 2α 1,4 )}θ 4 ] T /4. Consequently, P (x y, Θ) under PMM2 becomes P (x y, Θ) = ( V L xi h l (y)h m (y)(α l,m θ l,i + α m,l θ m,i )) (2.19) ( V L m=1 y ly m (α l,m θ l,i + α m,l θ m,i ) l =1 y l m =1 y m.) xi. (2.20) Accordingly, the objective function of PMM2 is given by N V L J(Θ; D) = x n,i log h n l h n m(α l,m θ l,i + α m,l )θ m,i constraints: n=1 +(ξ 1) L m=1 V log θ l,i + (ξ 1) m=1 L log α l,m V θ l,i = 1, l = 1,..., L and α l,m + α m,l = 1. (2.21) 2.2 Advanced Parametric Mixture Model As mentioned above, PMM1 is regarded as a special case of PMM2 where PMM1 represents a first-order approximation model while PMM2 is considered the secondorder one. However, that the seemingly second order duplicate-category parameter

17 11 vectors are in fact formulated as linear combinations of the category-depent parameter vectors. Here we exted PMM1 in Section to allow for free duplicatecategory parameter vectors to produce a better second-order approximation model. The new parametric mixture model is called the advanced parametric mixture model, or APMM in short. In APMM, ϕ(y) is formulated as ϕ(y) = L h l (y)h m (y)θ l,m, (2.22) m=1 where h l (y) = 0 for l satisfying y l = 0. The difference between APMM and PMM2 is mainly that θ l,m is restricted to be a weighted average of θ l and θ m in PMM2 whereas they are free in APMM. In addition, we assume the symmetry of θ l,m such that θ l,m = θ m,l. In addition, we denote θ l,l as θ l. Since y = [y 1,..., y L ] T, we now have to estimate L(L + 1)/2 parameter vectors {θ l, θ l,m } l,m=1,...,l l<m instead of estimating 2 L 1 parameter vectors. The mixing proportion h l for APMM is, similar to the case of PMM1 and PMM2, also assumed to be 1 P L if y h l = y l = 1, l 0 otherwise. (2.23) By using the product of h l (y) and h m (y) as the weight of each θ l,m, we can obtain the multi-label parameter vector ϕ(y) via (2.22). For example, in the case of L = 4, if y n = [1, 0, 1, 1] T, ϕ([1, 0, 1, 1] T ) = [θ 1,1 + θ 3,3 + θ 4,4 + θ 1,3 + θ 3,1 + θ 1,4 + θ 4,1 + θ 3,4 + θ 4,3 ] T /9 = [θ 1 + θ 3 + θ 4 + 2θ 1,3 + 2θ 1,4 + 2θ 3,4 ] T /9.

18 12 Consequently, P (x y, Θ) under APMM can be formulated as P (x y, Θ) = = ( V m=1 ( V L ) xi L h l (y)h m (y)θ l,m,i m=1 y ly m θ l,m,i l =1 y l m =1 y m ) xi V ( y lθ l,i + 2 l>m y ly m θ ) l,m,i xi. y l + 2 (2.24) l>m y ly m APMM also uses the Dirichlet distributions as priors conjugate with P (x y, Θ) over parameter vectors, θ s, in Θ ( L V P (Θ) )( L θ ξ 1 l,i l>m V θ ω 1 l,m,i, ). (2.25) where ξ and ω are the hyperparameters of the prior for category-depent parameter vectors θ l and duplicate-category ones θ l,m, respectively. Combining logarithm of (2.24) and (2.25), we obtain the objective function of APMM that J(Θ; D) = = N log P (x n y n, Θ) + log P (Θ) n=1 N n=1 + (ξ 1) V ( x n,i log yn l θ l,i + 2 L subject to yn l V log θ l,i + (ω 1) V θ l,i = 1, Here we only consider the case that ξ = ω. l>m yn l yn mθ l,m,i + 2 l>m yn l yn m L V l>m log θ l,m,i, V θ l,m,i = 1. (2.26) ) 2.3 Training Update Formula In the training update procedure, the goal is to find a parameter set Θ, such that the objective function becomes optimum. If a parameter set Θ can let objective

19 13 function J(Θ; D) maximum, that seems possible that the same parameter set will predict the testing data well. The training procedure of PMMs, including APMM, is like the Expectation Maximization Algorithm (EM) [3] [12]. Both of two algorithms are iteratively maximize the objective function to find the best parameter set. In each iteration, the algorithm generating a new parameter set via maximize the objective function. And then using the generated parameter set to next iteration until the sequence becomes stable. In addition, the Lagrange multipliers are used to deal with constraints in the optimization problems. Both PMM1 and PMM2 could be considered as special cases of APMM where θ l,m,i parameter are restricted as follows: 1. PMM1: θ l,m,i = θ θ l,i +θ m,i 2 = 1 2 θ θ l,i θ m,i. 2. PMM2: θ l,m,i = α l,m θ θl,i + α m,l θ m,i. While comparing the ϕ(y) functions (2.13), (2.17) and (2.22) for PMM1, PMM2 and APMM respectively, we find that the parameterization are almost the same for PMM2 and APMM except the above mentioned restriction. However, when it comes to the achieving optimization for the three models, APMM is in fact more similar to the case of PMM2 other than PMM1 because the θ l,m,i in PMM2 consists of product of unknown parameters and such formulation in turn complicates the estimation problem. On other other hand, the objective functions are very similar between PMM1 and APMM and therefore good optimality and convergence properties of the updating formula for the former remain for the latter. Thus, in the following, we first detail the updating formula and convergence properties of optimization algorithms for PMM1, followed by those for APMM.

20 PMM1 At each iteration, the training procedure update the parameter set Θ. Let Θ t represent the parameter set found at the tth iteration. In order to understand the relationship between Θ and Θ t, here we define a auxiliary function, J(Θ Θ t ), which is equals to J(Θ; D) in (2.16). J(Θ Θ t ) = N n=1 V ( h n l x θt ) l,i n,i log l =1 hn θ t l l,i +(ξ 1) L {( h n l θ ) l,i } h n l θ h n l θ l,i l,i l =1 V log θ l,i. (2.27) Further deriving it, (2.27) becomes N n=1 +(ξ 1) V ( L x n,i L h n l θt [ l,i log h n l =1 hn θ t l θ l,i log l l,i h n l θ ]) l,i l =1 hn θ l l,i V log θ l,i. (2.28) To simplify the objective function, we decompose J(Θ Θ t ) into three parts. Denote J(Θ Θ t ) as = U(Θ Θ t ) T (Θ Θ t ) + (ξ 1) V log θ l,i. According to (2.14), h n l is set to be equal for all nonzero h l with the same n. In addition, h n l = 0 if y n l = 0. The second term of J(Θ Θ t ) becomes T (Θ Θ t ) = N n=1 thus the first term is denoted as: U(Θ Θ t ) = N V ( L x n,i n=1 h n l θt l,i log l =1 hn θ t l l,i h n l θ l,i l =1 hn l θ l,i ), (2.29) V ( L h n l x θt l,i n,i log h n l =1 hn θ t l θ l,i ). (2.30) l l,i In (2.16), the Likelihood term L(Θ; D) can be denoted as U(Θ Θ t ) T (Θ Θ t ). We first show that T (Θ Θ t ) T (Θ t Θ t ), Θ. The inequality log x y x y 1 always holds when x y > 0. The term h n l θ l,i P L l =1 hn l θ l,i is the weight of θ l,i with y n l > 0. Denote

21 15 that h n l θt l,i P L log l =1 hn l θt l,i h n l θ l,i P L l =1 hn l θ l,i log y log x + 1 x, (2.29) could be written as y in (2.29) as y log x. Apply the inequality as log h n l θt l,i l =1 hn l θ t l,i log h n l θ l,i l =1 hn l θ l,i + 1 l =1 hn l θt l,i h n l θt l,i h n l θ l,i. (2.31) l =1 hn θ l l,i = log h n l θ l,i h n l log θt l,i l =1 hn θ L. (2.32) l l,i l =1 hn θ t l l,i Multiplying both sides by a positive value h n l θt l,i P L, the inequality still holds. Sum- l =1 hn l θt l,i ming up over the l indices on both sides we further obtain L h n l θt l,i h n l log θ l,i l =1 hn θ t L l l,i l =1 hn θ l l,i L h n l θt l,i h n l log θt l,i l =1 hn θ t L. l l,i l =1 hn θ t l l,i The above inequality implies that T (Θ Θ t ) T (Θ t Θ t ), Θ. (2.33) Substituting this property into the objective function yields U(Θ Θ t ) T (Θ Θ t ) + (ξ 1) U(Θ Θ t ) T (Θ t Θ t ) + (ξ 1) L L V log θ l,i V log θ l,i, Θ, (2.34) where Θ is an arbitrary parameter set and Θ t is fixed. Thus we can just maximize U(Θ Θ t ) + (ξ 1) V log θ l,i with respect to Θ to derive the parameter update formula: Θ t+1 arg max{u(θ Θ t ) + (ξ 1) L V log θ l,i }. (2.35) Since (ξ 1) V log θ l,i is strictly concave and U(Θ Θ t ) is a concave function, (2.35) minimizes a strictly concave problem, which has a unique optimal solution.

22 16 Applying this property and (2.33), from (2.34) we conclude that: Θ t+1 Θ t if and only if J(Θ t+1 ) > J(Θ t ). (2.36) Since the objective function is subject to the constraint V θ l,i = 1, we use the Lagrange multiplier to find the optimum θ l,ī in previous derivations. Let θ l,ī be any in the L V matrix Θ. The differentiation of the Lagrangian function is = θ l,ī N n=1 h n l >0 ( U(Θ Θ t ) + (ξ 1) x n,ī h n l θ t l,ī l =1 hn θ t l l,i 1 θ l,ī L V V log θ l,i λ( θ l,i 1) ) + (ξ 1) 1 θ l,ī λ. (2.37) The optimality condition is achieved when (2.37) equals 0, l, ī. We then multiply each side with θ l,ī. When the optimality condition is satisfied, (2.37) becomes N n=1 x nī h n l θ t l,ī + (ξ 1) θ l,ī λ = 0, l, ī. (2.38) hn l θt l,ī Therefore, we can find the θ l,ī which optimizes the objective function U(Θ Θ t ) + (ξ 1) V log θ l,i with the constraint V θ l,i at each iteration: N θ l,ī t+1 n=1 x h n l θ t l,ī nī P L + (ξ 1) = hn l θt l,ī. (2.39) λ Moreover, since Vī=1 θ l,ī = 1, we sum up all V vocabularies on each side N V n=1 x h n l θ t l,ī nī P L + (ξ 1) hn l θt l,ī = 1. (2.40) λ ī=1 Hence, we can easily find the Lagrange multiplier λ = ( V N ī=1 n=1 h n l θ t l,ī x n,ī hn l θt l,ī ) + V (ξ 1). (2.41) Based on the derivation of (2.39) and (2.41), we can find the parameters θ t+1 θ t+1 l,ī = ( V ( N n=1 x h n l θ t l,ī n,ī P L hn l θt l,ī N n=1 x n,i h n l θ t l,i P L hn l θt l,i l,ī : ) + (ξ 1) ). (2.42) + V (ξ 1)

23 17 Since h l (y) in PMM1 is of uniform mixing degree within a document, h n l 0 are all equal. Via this property, we can translate the effect of h to y and (2.42) becomes ( N ) θ l,ī t+1 n=1 x y n l θ t l,ī n,ī P L + (ξ 1) = yn l θt l,ī ( V ), l, ī. (2.43) N n=1 x y n l θ t l,i n,i + V (ξ 1) P L yn l θt l,i Next, we show that these parameter updates converge to the unique maximum of L(Θ; D). Our objective is to prove that any limit point of the sequence generated by (2.43) is a global maximum of (2.16). As a first step, we have to confirm that the J(Θ; D) in the whole sequence are strictly increasing. To prove that any limit point in the sequence is an optimum, we assume that That is, Θ not optimum of Θ is not optimal of max J(Θ; D). (2.44) max U(Θ Θ ) T (Θ Θ ) + (ξ 1) From (2.44), Θ does not satisfy L V log θ l,i. (2.45) Θ θ l,i = 0, l, i. (2.46) Thus, there exists at least one θ l,i such that its corresponding partial derivative is nonzero. Therefore, (2.43) could be applied further iteratively to find a parameter set Θ +1, where such that Θ +1 Θ, (2.47) J(Θ +1 ; D) > J(Θ ; D). (2.48) Since the update rule is a continuous function, every Θ in Θ t could be further update to Θ t+1, and the limit point of Θ t+1 is lim t {Θt+1 } = Θ +1. (2.49)

24 18 Thus, we obtain that lim t J(Θt+1 ; D) = J(Θ +1 ; D) > J(Θ ; D). (2.50) It contradicts the fact that J(Θ ; D) > J(Θ t ; D), t. The Θ is a stationary point. Note that a stationary point may be only a saddle point. If the maximization of (2.16) is a concave programming problem, the stationary point is a global maximum. To check whether that is the case here, we first examine the characteristics of the objective function. The objective function (2.16) is a sum of logarithm functions of θ l,i s and therefore a concave function because logarithm functions are concave. Maximization of a concave function f could be equivalently regarded as minimization of a concave function f, as the scenario defined in concave programming problem. Moreover, the feasible region defined by those parameter constraints is a concave set. As a result, the maximization of (2.16) is in fact a concave programming problem and the stationary point Θ is not only a local optimum, but a global one. So we can conclude that any limit point of the sequence {Θ} generated by (2.43) is a global maximum of (2.16). Example To make the above derivations of the updating rule more transparent, we provide an example to illustrate its property. Consider a simple example with only sample size one (N = 1), two possible labels (L = 2), and two words in the vocabulary (V = 2). The problem could be denoted as: N = 1, L = 2, V = 2, y = [1, 1] T, x 1 = 1, x 2 = 1, θ 11 + θ 12 = 1, θ 21 + θ 22 = 1.

25 19 The objective function of PMM1 for this example is ( θ11 + θ ) ( 21 θ12 + θ ) 22 max log + log 2 2 s.t. θ 11 + θ 12 = 1, θ 21 + θ 22 = 1, 0 θ 11, θ 12, θ 21, θ 22, (2.51) where the penalty term is omitted for simplification. This problem only has two constraints and four parameters, or in other words two free parameters. Suppose that θ l,i updates θ l,i in successive step in the updating sequence. For l = 1 θ 11 = θ 12 = θ 11 θ 11 +θ 21 θ 11 θ 11 +θ 21 + θ 12 θ 12 +θ 22, (2.52) θ 12 θ 12 +θ 22 θ 11 θ 11 +θ 21 + θ 12 θ 12 +θ 22. (2.53) In order to make sure that the updating rule always increases the objective value via iteratively updating its parameter Θ, we have to show that or in other words, ( θ11 + θ ) ( 21 θ12 + θ ) 22 log + log 2 2 ( θ11 + θ ) ( 21 θ12 + θ ) 22 > log + log, (2.54) 2 2 ( θ11 + θ ) ( 21 θ12 + θ ) 22 log log > 0. (2.55) θ 11 + θ 21 θ 12 + θ 22

26 20 Because the logarithm inequality that log(1 + x) > x always holds for x 1, the left-hand side of (2.55) satisfies that ( θ11 + θ ) 21 log θ 11 + θ 21 = log ( θ12 + θ ) 22 log θ 12 + θ 22 ( 1 + θ 11 θ ) 11 log θ 11 + θ 21 θ 11 θ 11 θ 11 + θ 21 + θ 12 θ 12 θ 12 + θ 22 ( 1 + θ 12 θ ) 12 θ 12 + θ 22 = θ 11 θ 11 + θ 12 θ 12 θ 11 + θ 21 θ 12 + θ 22 1 = ( θ 11 + θ 21 )( θ 12 + θ 22 ) [( θ 11 θ 11 )( θ 12 + θ 22 ) + ( θ 12 θ 12 )( θ 11 + θ 21 )] 1 = ( θ 11 + θ 21 )( θ 12 + θ 22 ) ( θ 11 θ 11 )( θ 12 θ 11 + θ 22 θ 21 ) 2 = ( θ 11 + θ 21 )( θ 12 + θ 22 ) ( θ 11 θ 11 )(θ 22 θ 11 ), (2.56) where the last two equalities hold by considering the constraints that V θ l,i = 1, l: θ 11 + θ 12 = θ 11 + θ 12 = 1. Because the denominator of (2.56) is always positive, it is sufficient to check whether its numerator is also positive. Let = θ 11 θ 11, that is, If > 0, then = θ 11 θ 12 θ 11 +θ 21 θ 11θ 12 θ 12 +θ 22 θ 11 θ 11 +θ 21 + θ 12 θ 12 +θ 22. (2.57) and since the constraints of (2.51), (2.58) becomes Moreover, θ 12 + θ 22 > θ 11 + θ 21, (2.58) θ 22 θ 11 = θ 22 1 > θ 11 + θ 21. (2.59) = = θ 11 θ 11 +θ 21 θ 11 θ 11 +θ 21 + θ 12 θ 12 +θ 22 θ 11 θ 21 θ 11 +θ 21 + θ 12θ 22 θ 12 +θ 22 θ 11 θ 11 +θ 21 + θ 12 θ 12 +θ 22 θ 11 θ 21 θ 11 +θ θ 11 θ 21 +θ 11 θ 21 θ 12 +θ 22 θ 11 θ 11 +θ 21 + θ 12 θ 12 +θ 22. (2.60)

27 21 The denominator of (2.60) is positive, we examine the sign of its numerator such that Numerator of (2.60) = θ 11θ 21 θ 11 + θ θ 11 θ 21 + θ 11 θ 21 (1 θ 11 ) + (1 θ 21 ). (2.61) Again, the denominator of (2.61) is positive, so we could consider only the numerator of (2.61) that numerator = θ 11 θ 21 (1 θ 11 ) θ 11 θ 21 (1 θ 21 ) + (θ 11 + θ 21 )(1 θ 11 )(1 θ 21 ) = θ 11 θ 21 (1 θ 11 ) θ 11 θ 21 (1 θ 21 ) +θ 11 (1 θ 11 )(1 θ 21 ) + θ 21 (1 θ 11 )(1 θ 21 ) = θ 21 (1 θ 11 )(1 θ 21 θ 11 ) + θ 11 (1 θ 11 θ 21 )(1 θ 21 ) = (1 θ 11 θ 21 )[θ 21 θ 11 θ 21 + θ 11 θ 11 θ 21 ] = (1 (θ 11 + θ 21 ))(θ 11 θ 22 + θ 21 θ 12 ). (2.62) To sum up, in the case that > 0, (1 (θ 11 + θ 21 )) is positive according to(2.59) and consequently θ 22 θ 11 is also positive. Therefore, (2.55) holds. In the case that < 0, (2.55) can be similarly shown to hold as well. Thus, (2.54) has been proved, indicating that the updating rule always increases the objective function (2.51) APMM Given the above section of derivations, the updating formulae for APMM becomes straightforward due to its strong similarity with PMM1. The major and perhaps the only change needed for their differences in the deriving equations are that the first-order category-depent related term h l θ l,i has to be changed into the duplicate-category related term h l h m θ l,m,i. In addition, the summation has to be for duplicate-categories, that is, m=1 instead of in PMM1. To avoid redundancy in the derivation, we write out only the key equations in arriving at the updating formula. The auxiliary function, J(Θ Θ t ), which actually

28 22 equals to J(Θ; D) in (2.26) is J(Θ Θ t ) = N n=1 {( h n log l h n ) mθ l,m,i L h n l hn mθ l,m,i V ( L h n l x hn mθ t ) l,m,i n,i m=1 l =1 m =1 hn h n θ t l m l,m,i L } L V h n l hn θ m l,m,i + (ξ 1) log θ l,m,i. l =1 m =1 Further deriving it, J(Θ Θ t ) becomes n=1 m=1 m=1 N V ( L h n l x hn mθ t [ l,m,i n,i log h n m=1 l =1 m =1 hn h n θ t l h n mθ l,m,i log l m l,m,i L V + (ξ 1) log θ l,m,i, (2.63) h n ]) l hn mθ l,m,i l =1 hn θ m l,m i where J(Θ Θ t ) = U(Θ Θ t ) T (Θ Θ t ) + (ξ 1) m=1 V log θ l,m,i with U(Θ Θ t ) = N n=1 T (Θ Θ t ) = ( L L m=1 l =1 V ( x n,i N n=1 L m=1 l =1 V x n,i h n l hn mθl,,m,i t log m =1 hn h n θ t l m l,m,i h n l hn mθl,,m,i t log h n m =1 hn h n θ t l h n mθ l,m,i ), l m l,m,i l =1 h n ) l hn mθ l,m,i. (2.64) m =1 hn h n θ l m l,m,i It can be similarly shown that T (Θ t+1 Θ t ) will not make the value of the objective function decrease. Thus we can just maximize U(Θ Θ t ) + (ξ 1) L L m=1 with respect to Θ to derive the parameter update formula. V log θ l,m,i (2.65) Let θ l, m,ī be the elements in the (L(L + 1)/2) V matrix. The partial derivative of (2.26) with Lagrange multipliers respect to θ l, m,ī becomes = θ l, m,ī N ( U(Θ Θ t ) + (ξ 1) x n,ī n=1 h n l h n m >0 l =1 L L m=1 h n l h n mθ t l, mī m =1 hn h n θ t l m l,m,i V V log θ l,m,i λ( θ l,m,i 1) ) 1 θ l, m,ī + (ξ 1) 1 θ l, m,ī λ (2.66)

29 23 Letting the above partial derivative equal to zero yields θ l, m,ī = N n=1 x n,ī P L h n l h n m >0 l =1 h n l h n m θt l, mī P L + (ξ 1) m =1 hn l hn m θt l,m,i. (2.67) λ Since V θ l, m,i = 1, we have that λ = ( V ī=1 N x n,ī n=1 h n l h n m >0 l =1 Substituting (2.68) into (2.67), the update rule becomes h n l h n mθ t l, ) mī + V (ξ 1). (2.68) m =1 hn h n θ t l m l,m,i θ l, m,ī = N n=1 x n,ī P L h n l h n m >0 l =1 Vī=1 N n=1 x n,ī P L h n l h n m >0 l =1 h n l h n m θt l, mī P L + (ξ 1) m =1 hn l hn m θt l,m,i. (2.69) h n l h n m θt l, mī P L + V (ξ 1) m =1 hn l hn m θt l,m,i In other words, the updating formula for θ l,l,i = θ l,i and θ l,m,i (l m) are respectively and θ t+1 l,ī = Nn=1 x n,ī y n l >0 V Nn=1 x n,i y n l >0 y lθ t l,ī P L y lθ t l,ī +2 P L l>m y ly mθ t l,m,ī y lθ t l,i P L y lθl,i t +2 P L l>m y ly mθl,m,i t + (ξ 1), (2.70) + V (ξ 1) θ t+1 l, m,ī = N n=1 x n,ī y n l,y n m >0 V N n=1 x n,i y n l,y n m >0 2y ly mθ t l, m,ī P L y lθ t l,ī +2 P L l>m y ly mθ l,m,ī 2y ly mθ t l, m,i P L y lθl,i t +2 P L l>m y ly mθl,m,i t + (ω 1). (2.71) + V (ω 1) Similar to PMM1, it can be shown that the updating sequence converges to a optimum. The objective function J(Θ; D) of APMM is again a concave function. Based on the same argument, the maximization of J(Θ; D) is a a concave programming problem and consequently, the sequence converges to a unique global solution. 2.4 Prediction Method Let x denote a new document and ˆΘ denote the optimal parameter set obtained in the training procedure. Applying Bayes rule (2.3), the optimal class label vector of x is defined as y = arg max P (y x, ˆΘ), (2.72) y

30 24 where P (y x, ˆΘ) = P (y ˆΘ)P (x y, ˆΘ) P (x ˆΘ). (2.73) Since P (x Θ) is irrelated to y and PMMs assume P (y) is indepent of Θ, (2.73) becomes Under the uniform class prior assumption P (y x, ˆΘ) P (y)p (x y, ˆΘ). P (y x, ˆΘ) = P (x y, ˆΘ). (2.74) It regresses to the original model (2.15) PMM1 According to (2.72), (2.74), and (2.15), the class label vector is estimated by y = arg max y ( V L y ˆθ ) x i l l,i l =1 y. (2.75) l The class label vector y is a 0/1 vector. PMMs predict the class label vector via the greedy search algorithm. The prediction is accomplished by a simple procedure:

31 25 Algorithm 1 1. Start with initial y = 0, P 0 l = 0, l = 1,..., L. 2. Repeat (t = 0, 1,...) (a) Consider l (t mod L) + 1. i. y = y. ii. If y l = 0, then set y l = 1, and P t+1 l V x i log ( L y l ˆθ l,i l =1 y l ). (2.76) (b) If max(p t+1 l and y l = 1. ) > max(pl t ), then until max(p t+1 l ) max(pl t). l = arg max l (P t+1 l ), Here θ l,i is obtained from the training procedure. The greedy algorithm is quite efficient as it needs at most L(L + 1)/2 to predict a new document x. Because the multi-label problem cause 2 L 1 possible label combinations, if we use the exhaustive search, the prediction procedure will be time consuming. The equation (2.76) in the Algorithm 1 is the logarithm of (2.75). Since logarithm is an increasing function and we just need to rank the (P t+1 l ) to find the maximum, using logarithm does not change the result. Furthermore, using ( the logarithm ) can xi simplify the calculation, and lessen the computational error, since could P L y l ˆθ l,i P L l =1 y l be very small. Hence, the logarithm is useful and also commonly used to cope with this problem APMM The prediction procedure of APMM is similar to Algorithm 1. Since ˆΘ of APMM contains (L(L + 1)/2) V elements so that the class label vector will be mapping to

32 26 L(L + 1)/2-dimensional space. Define D = L(L + 1)/2 as the new class label space. The prediction is accomplished by the following procedure: Algorithm 2 1. Start with initial y l = 0, P 0 l = 0, l = 1,..., L. 2. Repeat (t = 0, 1,...) (a) Consider l (t mod L) + 1. i. y = y. ii. If y l = 0, then set y l = 1. iii. Mapping y l to y d, d = 1,..., D. (Appix B for details) P t+1 l V x i log ( D d=1 y dˆθ d,i D d =1 y d ). (2.77) (b) If max(p t+1 l and y l = 1. ) > max(pl t ), then until max(p t+1 l ) max(pl t). l = arg max l (P t+1 l ), Since the class label vector y is mapped to a L(L + 1)/2-dimensional space, the original decision function P t+1 l = V x i log ( y ˆθ l l,i + 2 l>m y ) ly mˆθl,m,i l =1 y l + 2 l>m y ly m could be represented as (2.77). Thus we use similar formula to predict class label in APMM. The estimated value P t+1 l, l = 1,... L. Hence each iteration we update P t+1 l at most total number of L. Therefore, the prediction procedure of a new document x only needs at most L(L + 1)/2, too.

33 CHAPTER III Experiments and Results This section first describes the datasets used in the experiments. The real-world yahoo.com datasets were generated in [18]. We then present three evaluation criteria used in the experiments. Those are common ways to measure multi-label classification models. Finally, we compare our new model, APMM with PMMs, and make some discussions. We redo the experiments in [18] and construct a more efficient program to reduce the training and predicting time. 3.1 Data Description The yahoo.com datasets are generated from real world Web pages. It has 14 toplevel categories. Each of the top-category has several sub-categories, which we consider as second-categories. A Web page associated with one top-category is further classified into one or more sub-categories. Those datasets were originally generated in [18]. The details of the datasets are shown in Table 3.1. They using the GNU Wget to automatically collect the Web pages from yahoo.com. Although there are 14 top-level categories, they only use 11 of them in the experiments. We consider each top-level as an indepent problem. Therefore, we have 11 indepent datasets in the experiments. The numbers of labels range from 21 to

34 28 The number of vocabulary varies from 21,924 to 52,350. Furthermore, the proportions of multi-label documents are 30% 45% in these 11 problems. The data we used was generated as word frequency, the feature type introduced in Section 2.1. Table 3.1: Details of the yahoo.com Web page datasets. #Text is the number of texts in the dataset, #Voc is the number of vocabularies (i.e., features), #Tpc is the number of topics, #Lbl is the number of labels, and Label size Frequency is the relative frequency of each label size. Label Size Frequence(%) Dataset #Text #Vocab #Topic #Label Ar 7,484 23, Bu 11,214 21, Co 12,444 21, Ed 12,030 27, En 12,730 32, He 9,205 30, Evaluation Criteria In the experiments, we use three measurements to evaluate the performances of PMM and APMM. These evaluation criteria are common in the multi-label classification field. In single-label classification problems accuracy (i.e., Exact Match ratio) is often used. In the multi-label classification problem, however, Exact Match ratio may not be the most suitable. Thus, we apply the Labeling F-measure and Retrieval F-measure to estimate the model. For all of them, larger is better Exact Match Ratio This criterion calculates how many samples are correctly predicted by the model. Its formula is as the following: E X = 1 N N B[ŷ n = y n ]. (3.1) n=1

35 29 Here, ŷ n denotes the predicted label vector of the nth document, and y n is the true class label vector of the nth document. B denotes the Boolean function. It has two possible values (0 or 1) to evaluate the truth or falsity. 1 if ŷ B[ŷ n = y n l n = yl n, l, ] = 0 otherwise. (3.2) Labeling F-measure This is a partial matched ratio. Since it is difficult have high Exact Match ratio for multi-label classification problems, we can use some partial measures to check how the model performs. The formulation is: F L = 1 N N n=1 2 ŷn l yn l + yn l ŷn l. (3.3) Furthermore, F-measure could be defined as the combination of Precision (P ) and Recall (R). Let the precision/recall of the nth document be defined as follows: P n = yn l ŷn l ŷn l, (3.4) R n = yn l ŷn l yn l. (3.5) Precision and Recall are widely used in many fields, such as the data mining and machine learning field. Precision ratio shows how the model precisely detects the true class labels. Recall ratio shows how many percentages of true class labels are detected.

36 30 We have F L = 1 N = 1 N = 1 N = 1 N N n=1 N n=1 N n=1 N n=1 2( yn l ŷn l )2 ( yn l ŷn l ) ( yn l + ŷn l ) 2( yn l ŷn l )2 L yn l ( yn l ŷn l ) + 2 ( P L yn l ŷn l P L ŷn l ( P L yn l ŷn l P L ŷn l 2P n R n (P n + R n ). P L yn l ŷn l P L yn l + P L yn l ŷn l P L yn l ) ) ŷn l ( yn l ŷn l ) Consider an instance with L = 4, its true label vector y = [1, 1, 0, 1] T and the predicted label vector ŷ = [1, 0, 0, 1] T. Its Precision is P = 2/2, Recall is 2/3, and its F L becomes 4/ Retrieval F-measure This is a partial measurement to label-wise evaluate the performance. In the text categorization community Retrieval F-measure is called the macro average of F-measures. F R = 1 L L 2 N n=1 ŷn l yn l + N n=1 yn l N n=1 ŷn l. (3.6) 3.3 Experimental Setting Six of the 11 datasets are used in the following experiments. The training size of each dataset is 2000, and the testing size of each dataset is Those are original datasets used in [18]. Since the performances of PMM2 and PMM1 are close and [18] observed that PMM1 is better than PMM2 in the experiment results, we implement the program of

37 31 Table 3.2: Single-label documents prediction performance. Pr s is the number of documents predicted as single-label, Co s is the number of single-label documents which have been predicted correctly, Co ratio is the ratio of single-label documents has been correctly predicted. PMM1 APMM Dataset single Pr s Co s Co ratio Pr s Co s Co ratio Ar Bu Co Ed En He PMM1 (Appix A) to compare the proposed model, APMM. Furthermore, we set the constant (ξ 1) = 1 in the penalty term. 3.4 Results and Discussions Most of the exactly matched labels are single-label documents. Actually, Table 3.1 shows that the single-label documents has the largest proportion. Table 3.3 shows that the number of iterations increases while the stopping tolerance decreases. It also shows that the training and testing time of APMM is only slightly larger then PMM even though the class label y is mapped to the L(L+1)/2 dimensional space. Our stopping condition is θ t+1 l,i θ t l,i tolerance, l, i. Tables 3.4 and 3.5 present the performances under the stopping tolerances 0.01 and , respectively. The differences of performances between Table 3.4 and Table 3.5 are quite small. Furthermore, no matter what the initial Θ is given, the same stop tolerance leads to similar iterations and performances. Figure 3.1 presents the relation between stopping

38 32 Table 3.3: Training and testing time of models with different stopping tolerances. Since the numbers in this table are the average of several problems, the numbers of iterations have decimal point PMM1 APMM Stop tol #Iter T tr T te #Iter T tr T te Table 3.4: Performance of using stopping tolerances = Three evaluation criteria presented in Section 3.2. The Exact Match ratio of APMM is better than that of PMM1, but the Retrieval F-measure is lower then PMM1. The Labeling F-measures of the two models are quite similar. PMM1 APMM Dataset E X F L F R E X F L F R Ar Bu Co Ed En He tolerances and performances. Clearly the performance is not affected much by the stopping tolerance. And the convergence speed of three models shows in Figure 3.2. Considering the prior P (Θ) showed in (2.10), we tried several (ξ 1) in the experiments. The constants we tried range from 10 to Figure 3.3 shows the different (ξ 1) s versus the three evaluation criteria. It seems reasonable to set (ξ 1) = 1 since the accuracy rates are performed well in the range of (ξ 1) from 2 to 0.1. Details of (ξ 1) in this similar range are in Figure 3.4.

39 33 Table 3.5: Performance of using stop tolerance Table 3.4. PMM1 APMM Dataset E X F L F R E X F L F R Ar Bu Co Ed En He The leg is the same as Exact Match ratio Labeling F measure Retrieval F measure Stopping tolerances (a) Stopping tolerances (b) Stopping tolerances (c) Figure 3.1: The relation between stopping tolerances and performances Number of iteration Stopping tolerances (a) Figure 3.2: Number of iterations versus stop criteria from 10 to

40 Exact Match ratio Labeling F measure Retrieval F measure ξ 1 (a) ξ 1 (b) ξ 1 (c) Figure 3.3: (ξ 1) from to Exact Match ratio Labeling F measure Retrieval F measure ξ 1 (a) ξ 1 (b) ξ 1 (c) Figure 3.4: (ξ 1) from 0.1 to 2. Furthermore, analyzing the performance of the models we find some interesting things: why the critical criterion, Exact Match ratio, is better than that of PMM1, but the partial match ratio, Labeling F-measure and Retrieval F-measure, are not as well. So, further exploring the causality, we decompose the datasets with respect to different label size. We calculate Precisions and Recalls, and find that the prediction results of APMM are mainly in label size one. The predicting proportions of PMM1 are relatively fair enough. However, though the predicting proportions are relatively unevenness, the Precision and Recall of APMM may better then PMM1. For example,

41 35 Table 3.6 shows that the predicted number with label size two are less then PMM1, but the Precision and Recall of APMM with label size two are larger then PMM1. Table 3.6: Prediction accuracy of different label size. #label is the label size, mun is the total number of #label in the dataset, Pr is the size of label has been predicted, Co is the correctly predicted number, and Co ratio is the ratio of correctly predicted. Since Table 3.1 shows the frequency of label size larger than 4 are relatively smaller, we combines the correctly predicted ratio 4. PMM1 APMM #label num Pr Co Precision Recall Pr Co Precision Recall >

Parametric Mixture Models for Multi-Labeled Text

Parametric Mixture Models for Multi-Labeled Text Parametric Mixture Models for Multi-Labeled Text Naonori Ueda Kazumi Saito NTT Communication Science Laboratories 2-4 Hikaridai, Seikacho, Kyoto 619-0237 Japan {ueda,saito}@cslab.kecl.ntt.co.jp Abstract

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) Exercises Introduction to Machine Learning SS 2018 Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) LAS Group, Institute for Machine Learning Dept of Computer Science, ETH Zürich Prof

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun yzsun@cs.ucla.edu December 4, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

Expectation maximization tutorial

Expectation maximization tutorial Expectation maximization tutorial Octavian Ganea November 18, 2016 1/1 Today Expectation - maximization algorithm Topic modelling 2/1 ML & MAP Observed data: X = {x 1, x 2... x N } 3/1 ML & MAP Observed

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Generative Learning algorithms

Generative Learning algorithms CS9 Lecture notes Andrew Ng Part IV Generative Learning algorithms So far, we ve mainly been talking about learning algorithms that model p(y x; θ), the conditional distribution of y given x. For instance,

More information

Ranked Retrieval (2)

Ranked Retrieval (2) Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF

More information

Programming Assignment 4: Image Completion using Mixture of Bernoullis

Programming Assignment 4: Image Completion using Mixture of Bernoullis Programming Assignment 4: Image Completion using Mixture of Bernoullis Deadline: Tuesday, April 4, at 11:59pm TA: Renie Liao (csc321ta@cs.toronto.edu) Submission: You must submit two files through MarkUs

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Nathan Schneider (some slides borrowed from Chris Dyer) ENLP 12 February 2018 23 Outline Words, probabilities Features,

More information

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn CMU 10-701: Machine Learning (Fall 2016) https://piazza.com/class/is95mzbrvpn63d OUT: September 13th DUE: September

More information

Chapter 14 Combining Models

Chapter 14 Combining Models Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

The Bayes classifier

The Bayes classifier The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Generative Models for Discrete Data

Generative Models for Discrete Data Generative Models for Discrete Data ddebarr@uw.edu 2016-04-21 Agenda Bayesian Concept Learning Beta-Binomial Model Dirichlet-Multinomial Model Naïve Bayes Classifiers Bayesian Concept Learning Numbers

More information

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction 15-0: Learning vs. Deduction Artificial Intelligence Programming Bayesian Learning Chris Brooks Department of Computer Science University of San Francisco So far, we ve seen two types of reasoning: Deductive

More information

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic

More information

Lecture 4 September 15

Lecture 4 September 15 IFT 6269: Probabilistic Graphical Models Fall 2017 Lecture 4 September 15 Lecturer: Simon Lacoste-Julien Scribe: Philippe Brouillard & Tristan Deleu 4.1 Maximum Likelihood principle Given a parametric

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Day 6: Classification and Machine Learning

Day 6: Classification and Machine Learning Day 6: Classification and Machine Learning Kenneth Benoit Essex Summer School 2014 July 30, 2013 Today s Road Map The Naive Bayes Classifier The k-nearest Neighbour Classifier Support Vector Machines (SVMs)

More information

CS Homework 3. October 15, 2009

CS Homework 3. October 15, 2009 CS 294 - Homework 3 October 15, 2009 If you have questions, contact Alexandre Bouchard (bouchard@cs.berkeley.edu) for part 1 and Alex Simma (asimma@eecs.berkeley.edu) for part 2. Also check the class website

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others) Text Categorization CSE 454 (Based on slides by Dan Weld, Tom Mitchell, and others) 1 Given: Categorization A description of an instance, x X, where X is the instance language or instance space. A fixed

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation Lecture 15. Pattern Classification (I): Statistical Formulation Outline Statistical Pattern Recognition Maximum Posterior Probability (MAP) Classifier Maximum Likelihood (ML) Classifier K-Nearest Neighbor

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Machine Learning: Assignment 1

Machine Learning: Assignment 1 10-701 Machine Learning: Assignment 1 Due on Februrary 0, 014 at 1 noon Barnabas Poczos, Aarti Singh Instructions: Failure to follow these directions may result in loss of points. Your solutions for this

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,

More information

A REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES

A REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 10, October 2014,

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Machine Learning Lecture Notes

Machine Learning Lecture Notes Machine Learning Lecture Notes Predrag Radivojac January 25, 205 Basic Principles of Parameter Estimation In probabilistic modeling, we are typically presented with a set of observations and the objective

More information

Midterm Exam, Spring 2005

Midterm Exam, Spring 2005 10-701 Midterm Exam, Spring 2005 1. Write your name and your email address below. Name: Email address: 2. There should be 15 numbered pages in this exam (including this cover sheet). 3. Write your name

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 22 Dan Weld Learning Gaussians Naïve Bayes Last Time Gaussians Naïve Bayes Logistic Regression Today Some slides from Carlos Guestrin, Luke Zettlemoyer

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan Bayesian Learning CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Bayes Theorem MAP Learners Bayes optimal classifier Naïve Bayes classifier Example text classification Bayesian networks

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Homework 6: Image Completion using Mixture of Bernoullis

Homework 6: Image Completion using Mixture of Bernoullis Homework 6: Image Completion using Mixture of Bernoullis Deadline: Wednesday, Nov. 21, at 11:59pm Submission: You must submit two files through MarkUs 1 : 1. a PDF file containing your writeup, titled

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Due Thursday, September 19, in class What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48 Logistic Regression Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, 2017 1 / 48 Outline 1 Administration 2 Review of last lecture 3 Logistic regression

More information

CS 188: Artificial Intelligence Spring Today

CS 188: Artificial Intelligence Spring Today CS 188: Artificial Intelligence Spring 2006 Lecture 9: Naïve Bayes 2/14/2006 Dan Klein UC Berkeley Many slides from either Stuart Russell or Andrew Moore Bayes rule Today Expectations and utilities Naïve

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007 Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning

More information

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2. CS229 Problem Set #2 Solutions 1 CS 229, Public Course Problem Set #2 Solutions: Theory Kernels, SVMs, and 1. Kernel ridge regression In contrast to ordinary least squares which has a cost function J(θ)

More information

Machine Learning: Logistic Regression. Lecture 04

Machine Learning: Logistic Regression. Lecture 04 Machine Learning: Logistic Regression Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Supervised Learning Task = learn an (unkon function t : X T that maps input

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines

More information

Support Vector Machine & Its Applications

Support Vector Machine & Its Applications Support Vector Machine & Its Applications A portion (1/3) of the slides are taken from Prof. Andrew Moore s SVM tutorial at http://www.cs.cmu.edu/~awm/tutorials Mingyue Tan The University of British Columbia

More information

Expectation maximization

Expectation maximization Expectation maximization Subhransu Maji CMSCI 689: Machine Learning 14 April 2015 Motivation Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is

More information

A Note on the Expectation-Maximization (EM) Algorithm

A Note on the Expectation-Maximization (EM) Algorithm A Note on the Expectation-Maximization (EM) Algorithm ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign March 11, 2007 1 Introduction The Expectation-Maximization

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Jordan Boyd-Graber University of Colorado Boulder LECTURE 7 Slides adapted from Tom Mitchell, Eric Xing, and Lauren Hannah Jordan Boyd-Graber Boulder Support Vector Machines 1 of

More information

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

CS446: Machine Learning Fall Final Exam. December 6 th, 2016 CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

Kernel Density Topic Models: Visual Topics Without Visual Words

Kernel Density Topic Models: Visual Topics Without Visual Words Kernel Density Topic Models: Visual Topics Without Visual Words Konstantinos Rematas K.U. Leuven ESAT-iMinds krematas@esat.kuleuven.be Mario Fritz Max Planck Institute for Informatics mfrtiz@mpi-inf.mpg.de

More information