Generative MaxEnt Learning for Multiclass Classification

Generative Maximum Entropy Learning for Multiclass Classification A. Dukkipati, G. Pandey, D. Ghoshdastidar, P. Koley, D. M. V. S. Sriram Dept. of Computer Science and Automation Indian Institute of Science, Bangalore December 5, 2013

Outline I 1 Introduction Generative vs Discriminative Classification Information Theoretic Learning Contributions 2 Maximum Entropy Models and divergences 3 Why maximum discrimination? The MeMd approach 4 The one vs. all approach MeMd using JS divergence 5

6 Introduction Outline II

Outline Generative vs Discriminative Classification Information Theoretic Learning Contributions 1 Introduction Generative vs Discriminative Classification Information Theoretic Learning Contributions 2 Maximum Entropy Models and divergences 3 Why maximum discrimination? The MeMd approach 4 The one vs. all approach MeMd using JS divergence 5 6

Generative vs Discriminative Classification Information Theoretic Learning Contributions Generative vs Discriminative Classification Discriminative Approaches Model the posterior distribution of the class labels given the data. Have smaller asymptotic error than discriminative approaches. May overfit when training size is small. Generative Approaches Model the joint distribution of data and class labels. Require lesser data for training to achieve their asymptotic error. Easier to incorporate dependencies among data/features. Easier to incorporate latent variables. More intuitive to understand.

Generative vs Discriminative Classification Information Theoretic Learning Contributions Information theoretic learning Maximum entropy methods make minimum assumptions about the data. Have been successful in natural language processing where curse of dimensionality is large. However, most methods considered have been discriminative in nature.

Contributions Generative vs Discriminative Classification Information Theoretic Learning Contributions We propose a generative maximum entropy classification model. We incorporate feature selection in the model using a discriminative criteria based on Jeffrey s divergence. Extend the approach to multi-class in a unique manner by approximating the Jensen Shannon divergence. Experimental study of the proposed approaches on large text datasets and gene expression datasets.

Outline Maximum Entropy Models and divergences 1 Introduction Generative vs Discriminative Classification Information Theoretic Learning Contributions 2 Maximum Entropy Models and divergences 3 Why maximum discrimination? The MeMd approach 4 The one vs. all approach MeMd using JS divergence 5 6

Notations Maximum Entropy Models and divergences X = X 1... X d is the input space and X = (X 1,..., X d ) is random vector taking values in X. x = (x 1,..., x d ) indicates an input instance. {c 1,...c M } denote the class labels. Class conditional density for j th class is denoted by P cj (.). Γ denotes a set of feature functions.

Maximum Entropy Models and divergences Maximum entropy modelling If the only information available about the random vector X is in the form of expected values of real valued feature functions φ r, 1 r l, then the distribution obtained by maximizing entropy is l P(x) = exp λ 0 λ j φ j (x), (1) where where λ 0, λ 1,..., λ l are the Lagrangian parameters. In maximum entropy modelling, the expected values of feature functions is approximated from the observed data. The Lagrangian parameters can then be estimated using maximum likelihood estimation on the training data. j=1

Divergences Maximum Entropy Models and divergences Jeffrey s divergence: A symmetrized version of KL divergence. J(P Q) = KL(P Q) + KL(Q P) = (P(x) Q(x)) ln P(x) dx. (2) Q(x) X Jensen-Shannon divergence: A multi-distribution divergence. M JS(P 1,..., P M ) = π i KL(P i P), (3) where P is the arithmetic mean of the distributions P 1,..., P M. JS divergence is non-negative, symmetric and bounded. i=1

Outline Why maximum discrimination? The MeMd approach 1 Introduction Generative vs Discriminative Classification Information Theoretic Learning Contributions 2 Maximum Entropy Models and divergences 3 Why maximum discrimination? The MeMd approach 4 The one vs. all approach MeMd using JS divergence 5 6

Why maximum discrimination? The MeMd approach Why maximum discrimination? Let the classes be labelled +1 and 1. In Bayes classification, a point x is assigned the label +1 if π + P + (x) > π P (x) (4) where π + and π denote the prior probabilities of the two classes. Hence, Bayes classification margin y log π+p+(x) π P (x) must be greater than zero for a point to be classified correctly.

Why maximum discrimination? The MeMd approach Why maximum discrimination? (..contd.) Hence, one can select features so as to maximize the Bayes classification margin over the training set. Γ = arg max S 2 Γ N i=1 y (i) log π +P + (x (i) ; S) π P (x (i) ; S) When the class conditional distributions have been obtained using maximum entropy, the above quantity corresponds to the J divergence between the two classes.

Why maximum discrimination? The MeMd approach MeMd approach (Dukkipati et al., 2010) Let Γ denote the set of all features Aim: To find the feature subset Γ Γ such that Γ = arg max J(Pc 1 (x; S) Pc 2 (x; S)), (5) S 2 Γ The problem is intractable for large number of features. Since, for text data, naive Bayes classifiers work well, we assume class conditional independence among features. P cj (x) = d i=1 P (i) c j (x i ), A. Dukkipati, A. K. Yadav, and M. N. Murty, Maximum entropy model based classification with feature selection, in Proceedings of IEEE International Conference on Pattern Recognition (ICPR). IEEE Press, 2010, pp. 565 568.

Why maximum discrimination? The MeMd approach MeMd under conditional independence The assumption of class conditional independence allows one to compute Γ in linear time with respect to the number of features. At k th step, the feature with the k th highest J divergence is selected. Using only the top K features, the class conditional densities can be approximated as P cj (x) i S P (i) c j (x i ), j = 1, 2. (6) The Bayes decision rule is then used to assign a class to a test pattern, that is, a test pattern is assigned to class c 1 if P c1 (x)p(c 1 ) > P c2 (x)p(c 2 )

Outline The one vs. all approach MeMd using JS divergence 1 Introduction Generative vs Discriminative Classification Information Theoretic Learning Contributions 2 Maximum Entropy Models and divergences 3 Why maximum discrimination? The MeMd approach 4 The one vs. all approach MeMd using JS divergence 5 6

The one vs. all approach MeMd using JS divergence The one vs. all approach For M classes, 2M Maximum Entropy models are estimated: one for each class and one for complement of each class. J divergence between models of each class and its complement is computed. Average of such J divergences is computed, weighted by class probabilities. At k th step, the feature with the k th highest average J divergence is selected. With the top K features, algorithm proceeds as before.

The one vs. all approach MeMd using JS divergence Use of multi-distribution divergences J divergence provides only pairwise discrimination of classes. Average J divergence requires estimation of models for complement of each class (can be computationally expensive). Jensen-Shannon (JS) divergence provides discriminative measure among multiple class conditional probabilities. JS divergence of models of classes is same as mutual information between a data and its label (Grosse et al., 2002). Difficult to explicitly compute JS divergence (approximation required). I. Grosse, P. Bernaola-Galván, P. Carpena, R. Román-Roldán, J. Oliver, and H. E. Stanley, Analysis of symbolic sequences using the Jensen-Shannon divergence, Physical Review E., vol. 65, 2002.

JS GM -divergence: Introduction The one vs. all approach MeMd using JS divergence MeMd with JS GM divergence Replace arithmetic mean in JS divergence by a groemetric mean probability mass function. JS GM acts as an upper bound for JS divergence. Can be expressed in terms of J divergence as JS GM (P 1,..., P M ) = 1 2 M π i π j J(P i P j ). (7) i=1 j i MeMd algorithm in this case: Select the top K features with highest JS GM -divergence. Perform the Naive Bayes classification as before.

Outline 1 Introduction Generative vs Discriminative Classification Information Theoretic Learning Contributions 2 Maximum Entropy Models and divergences 3 Why maximum discrimination? The MeMd approach 4 The one vs. all approach MeMd using JS divergence 5 6

Comparison of complexity of algorithms Algorithm Training time Testing time Estimation Feature ranking per sample MeMd one vs. all (MeMd-J) O(MNd) O(Md + d log d) O(MK) MeMd JS GM (MeMd-JS) O(MNd) O(M 2 d + d log d) O(MK) Support Vector Machine [1] #iterations*o(md) O(M 2 Sd) MaxEnt Discrimination [2] #iterations*o(mnd) O(Md) M = no. of classes d = no. of features S = no. of support vectors N = no. of training samples K = no. of selected features [1] C. C. Chang and C. J. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, pp. 27:1â27:27, 2011. [2] K. Nigam, J. Lafferty, and A. McCallum, Using maximum entropy for text classification, in IJCAI-99 workshop on machine learning for information filtering, 1999, pp. 61â67.

Experiments on gene expression datasets Data attributes 10-fold cross validation accuracy Dataset No. of No. of No. of SVM MeMD-J MeMd-JS classes samples features (linear) (2-moment) (2-moment) Colon cancer 2 62 2000 84.00 86.40 Leukemia 2 72 5147 96.89 98.97 CNS 2 60 7129 62.50 63.75 DLBCL 2 77 7070 97.74 86.77 Prostate 2 102 12533 89.51 89.75 SRBCT 4 83 2308 99.20 97.27 98.33 Lung 5 203 12600 93.21 93.52 92.60 GCM 14 190 16063 66.85 66.98 66.98 Folds in cross-validation randomly chosen. Best accuracies highlighted for each method. DME not performed as it has developed only for text datasets.

Experiments on text datasets (Reuters) Data attributes 2-fold cross validation accuracy No. of No. of No. of SVM DME MeMD-J MeMd-JS classes samples features (RBF) (1-moment) (1-moment) 1 2 1588 7777 95.96 95.59 97.35 2 2 1227 8776 91.35 92.33 91.69 3 2 1973 9939 92.80 93.60 93.81 4 2 1945 6970 89.61 90.48 89.77 5 2 3581 13824 98.49 98.68 99.02 6 2 3952 17277 96.63 96.93 95.04 7 2 3918 13306 88.23 91.88 91.75 8 4 3253 17998 88.62 90.34 91.91 91.39 9 4 3952 17275 94.63 95.26 95.14 94.88 10 4 3581 13822 95.83 96.23 96.14 95.86 11 4 4891 15929 81.08 83.41 82.11 82.03 Experiments constructed by grouping classes in different ways. Best accuracies highlighted for each method.

This is the first work on generative maximum entropy approach to classification. Proposed a method of classification using maximum entropy with maximum discrimination (MeMd) Generative approach: Modelling class conditional densities Discrimination: Use of divergences to measure discriminative abilities of features Feature selection: Selection of most discriminative features The use of multi-distribution divergence for multi-class problem is a new concept in this work. Linear time complexity (suitable for large datasets with high dimensional features)

Thank you!!