Generative MaxEnt Learning for Multiclass Classification

Similar documents
PATTERN RECOGNITION AND MACHINE LEARNING

A Tutorial on Support Vector Machine

Naïve Bayes classification

Logistic Regression. Machine Learning Fall 2018

Chapter 6 Classification and Prediction (2)

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Kernel expansions with unlabeled examples

Unsupervised Learning with Permuted Data

Probabilistic Time Series Classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Conditional Random Field

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning for NLP

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Support Vector Machine (continued)

Posterior Regularization

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

CMU-Q Lecture 24:

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Support Vector Machines

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Final Exam. December 11 th, This exam booklet contains five problems, out of which you are expected to answer four problems of your choice.

Linear Models for Classification

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Tackling the Poor Assumptions of Naive Bayes Text Classifiers

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Machine Learning for natural language processing

Comparison of Log-Linear Models and Weighted Dissimilarity Measures

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Machine Learning, Fall 2012 Homework 2

CSCI-567: Machine Learning (Spring 2019)

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

CS 188: Artificial Intelligence. Outline

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Undirected Graphical Models

Kernel Methods and Support Vector Machines

Brief Introduction of Machine Learning Techniques for Content Analysis

Data Mining Techniques

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Gaussian Models

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Machine Learning Lecture 5

Cluster Kernels for Semi-Supervised Learning

CS6220: DATA MINING TECHNIQUES

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Adapted Feature Extraction and Its Applications

MIRA, SVM, k-nn. Lirong Xia

Expectation Maximization

Expectation Propagation for Approximate Bayesian Inference

Unsupervised Learning

Generative Adversarial Networks

Introduction to Support Vector Machines

Machine Learning : Support Vector Machines

Machine Learning Srihari. Information Theory. Sargur N. Srihari

U Logo Use Guidelines

Introduction to Statistical Learning Theory

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Active and Semi-supervised Kernel Classification

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Bayesian Learning. Bayesian Learning Criteria

Click Prediction and Preference Ranking of RSS Feeds

SUBMITTED TO IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1. Toward Optimal Feature Selection in Naive Bayes for Text Categorization

Information Theory and Feature Selection (Joint Informativeness and Tractability)

Introduction to Machine Learning

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Microarray Data Analysis: Discovery

What is semi-supervised learning?

STA 4273H: Statistical Machine Learning

15-381: Artificial Intelligence. Decision trees

Predicting flight on-time performance

A Family of Probabilistic Kernels Based on Information Divergence. Antoni B. Chan, Nuno Vasconcelos, and Pedro J. Moreno

Machine Learning for Signal Processing Bayes Classification and Regression

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Classification & Information Theory Lecture #8

Quantitative Biology II Lecture 4: Variational Methods

Maximum Entropy Klassifikator; Klassifikation mit Scikit-Learn

Support Vector Machine (SVM) and Kernel Methods

Introduction to Machine Learning

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Interpreting Deep Classifiers

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Transcription:

Generative Maximum Entropy Learning for Multiclass Classification A. Dukkipati, G. Pandey, D. Ghoshdastidar, P. Koley, D. M. V. S. Sriram Dept. of Computer Science and Automation Indian Institute of Science, Bangalore December 5, 2013

Outline I 1 Introduction Generative vs Discriminative Classification Information Theoretic Learning Contributions 2 Maximum Entropy Models and divergences 3 Why maximum discrimination? The MeMd approach 4 The one vs. all approach MeMd using JS divergence 5

6 Introduction Outline II

Outline Generative vs Discriminative Classification Information Theoretic Learning Contributions 1 Introduction Generative vs Discriminative Classification Information Theoretic Learning Contributions 2 Maximum Entropy Models and divergences 3 Why maximum discrimination? The MeMd approach 4 The one vs. all approach MeMd using JS divergence 5 6

Generative vs Discriminative Classification Information Theoretic Learning Contributions Generative vs Discriminative Classification Discriminative Approaches Model the posterior distribution of the class labels given the data. Have smaller asymptotic error than discriminative approaches. May overfit when training size is small. Generative Approaches Model the joint distribution of data and class labels. Require lesser data for training to achieve their asymptotic error. Easier to incorporate dependencies among data/features. Easier to incorporate latent variables. More intuitive to understand.

Generative vs Discriminative Classification Information Theoretic Learning Contributions Information theoretic learning Maximum entropy methods make minimum assumptions about the data. Have been successful in natural language processing where curse of dimensionality is large. However, most methods considered have been discriminative in nature.

Contributions Generative vs Discriminative Classification Information Theoretic Learning Contributions We propose a generative maximum entropy classification model. We incorporate feature selection in the model using a discriminative criteria based on Jeffrey s divergence. Extend the approach to multi-class in a unique manner by approximating the Jensen Shannon divergence. Experimental study of the proposed approaches on large text datasets and gene expression datasets.

Outline Maximum Entropy Models and divergences 1 Introduction Generative vs Discriminative Classification Information Theoretic Learning Contributions 2 Maximum Entropy Models and divergences 3 Why maximum discrimination? The MeMd approach 4 The one vs. all approach MeMd using JS divergence 5 6

Notations Maximum Entropy Models and divergences X = X 1... X d is the input space and X = (X 1,..., X d ) is random vector taking values in X. x = (x 1,..., x d ) indicates an input instance. {c 1,...c M } denote the class labels. Class conditional density for j th class is denoted by P cj (.). Γ denotes a set of feature functions.

Maximum Entropy Models and divergences Maximum entropy modelling If the only information available about the random vector X is in the form of expected values of real valued feature functions φ r, 1 r l, then the distribution obtained by maximizing entropy is l P(x) = exp λ 0 λ j φ j (x), (1) where where λ 0, λ 1,..., λ l are the Lagrangian parameters. In maximum entropy modelling, the expected values of feature functions is approximated from the observed data. The Lagrangian parameters can then be estimated using maximum likelihood estimation on the training data. j=1

Divergences Maximum Entropy Models and divergences Jeffrey s divergence: A symmetrized version of KL divergence. J(P Q) = KL(P Q) + KL(Q P) = (P(x) Q(x)) ln P(x) dx. (2) Q(x) X Jensen-Shannon divergence: A multi-distribution divergence. M JS(P 1,..., P M ) = π i KL(P i P), (3) where P is the arithmetic mean of the distributions P 1,..., P M. JS divergence is non-negative, symmetric and bounded. i=1

Outline Why maximum discrimination? The MeMd approach 1 Introduction Generative vs Discriminative Classification Information Theoretic Learning Contributions 2 Maximum Entropy Models and divergences 3 Why maximum discrimination? The MeMd approach 4 The one vs. all approach MeMd using JS divergence 5 6

Why maximum discrimination? The MeMd approach Why maximum discrimination? Let the classes be labelled +1 and 1. In Bayes classification, a point x is assigned the label +1 if π + P + (x) > π P (x) (4) where π + and π denote the prior probabilities of the two classes. Hence, Bayes classification margin y log π+p+(x) π P (x) must be greater than zero for a point to be classified correctly.

Why maximum discrimination? The MeMd approach Why maximum discrimination? (..contd.) Hence, one can select features so as to maximize the Bayes classification margin over the training set. Γ = arg max S 2 Γ N i=1 y (i) log π +P + (x (i) ; S) π P (x (i) ; S) When the class conditional distributions have been obtained using maximum entropy, the above quantity corresponds to the J divergence between the two classes.

Why maximum discrimination? The MeMd approach MeMd approach (Dukkipati et al., 2010) Let Γ denote the set of all features Aim: To find the feature subset Γ Γ such that Γ = arg max J(Pc 1 (x; S) Pc 2 (x; S)), (5) S 2 Γ The problem is intractable for large number of features. Since, for text data, naive Bayes classifiers work well, we assume class conditional independence among features. P cj (x) = d i=1 P (i) c j (x i ), A. Dukkipati, A. K. Yadav, and M. N. Murty, Maximum entropy model based classification with feature selection, in Proceedings of IEEE International Conference on Pattern Recognition (ICPR). IEEE Press, 2010, pp. 565 568.

Why maximum discrimination? The MeMd approach MeMd under conditional independence The assumption of class conditional independence allows one to compute Γ in linear time with respect to the number of features. At k th step, the feature with the k th highest J divergence is selected. Using only the top K features, the class conditional densities can be approximated as P cj (x) i S P (i) c j (x i ), j = 1, 2. (6) The Bayes decision rule is then used to assign a class to a test pattern, that is, a test pattern is assigned to class c 1 if P c1 (x)p(c 1 ) > P c2 (x)p(c 2 )

Outline The one vs. all approach MeMd using JS divergence 1 Introduction Generative vs Discriminative Classification Information Theoretic Learning Contributions 2 Maximum Entropy Models and divergences 3 Why maximum discrimination? The MeMd approach 4 The one vs. all approach MeMd using JS divergence 5 6

The one vs. all approach MeMd using JS divergence The one vs. all approach For M classes, 2M Maximum Entropy models are estimated: one for each class and one for complement of each class. J divergence between models of each class and its complement is computed. Average of such J divergences is computed, weighted by class probabilities. At k th step, the feature with the k th highest average J divergence is selected. With the top K features, algorithm proceeds as before.

The one vs. all approach MeMd using JS divergence Use of multi-distribution divergences J divergence provides only pairwise discrimination of classes. Average J divergence requires estimation of models for complement of each class (can be computationally expensive). Jensen-Shannon (JS) divergence provides discriminative measure among multiple class conditional probabilities. JS divergence of models of classes is same as mutual information between a data and its label (Grosse et al., 2002). Difficult to explicitly compute JS divergence (approximation required). I. Grosse, P. Bernaola-Galván, P. Carpena, R. Román-Roldán, J. Oliver, and H. E. Stanley, Analysis of symbolic sequences using the Jensen-Shannon divergence, Physical Review E., vol. 65, 2002.

JS GM -divergence: Introduction The one vs. all approach MeMd using JS divergence MeMd with JS GM divergence Replace arithmetic mean in JS divergence by a groemetric mean probability mass function. JS GM acts as an upper bound for JS divergence. Can be expressed in terms of J divergence as JS GM (P 1,..., P M ) = 1 2 M π i π j J(P i P j ). (7) i=1 j i MeMd algorithm in this case: Select the top K features with highest JS GM -divergence. Perform the Naive Bayes classification as before.

Outline 1 Introduction Generative vs Discriminative Classification Information Theoretic Learning Contributions 2 Maximum Entropy Models and divergences 3 Why maximum discrimination? The MeMd approach 4 The one vs. all approach MeMd using JS divergence 5 6

Comparison of complexity of algorithms Algorithm Training time Testing time Estimation Feature ranking per sample MeMd one vs. all (MeMd-J) O(MNd) O(Md + d log d) O(MK) MeMd JS GM (MeMd-JS) O(MNd) O(M 2 d + d log d) O(MK) Support Vector Machine [1] #iterations*o(md) O(M 2 Sd) MaxEnt Discrimination [2] #iterations*o(mnd) O(Md) M = no. of classes d = no. of features S = no. of support vectors N = no. of training samples K = no. of selected features [1] C. C. Chang and C. J. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, pp. 27:1â27:27, 2011. [2] K. Nigam, J. Lafferty, and A. McCallum, Using maximum entropy for text classification, in IJCAI-99 workshop on machine learning for information filtering, 1999, pp. 61â67.

Experiments on gene expression datasets Data attributes 10-fold cross validation accuracy Dataset No. of No. of No. of SVM MeMD-J MeMd-JS classes samples features (linear) (2-moment) (2-moment) Colon cancer 2 62 2000 84.00 86.40 Leukemia 2 72 5147 96.89 98.97 CNS 2 60 7129 62.50 63.75 DLBCL 2 77 7070 97.74 86.77 Prostate 2 102 12533 89.51 89.75 SRBCT 4 83 2308 99.20 97.27 98.33 Lung 5 203 12600 93.21 93.52 92.60 GCM 14 190 16063 66.85 66.98 66.98 Folds in cross-validation randomly chosen. Best accuracies highlighted for each method. DME not performed as it has developed only for text datasets.

Experiments on text datasets (Reuters) Data attributes 2-fold cross validation accuracy No. of No. of No. of SVM DME MeMD-J MeMd-JS classes samples features (RBF) (1-moment) (1-moment) 1 2 1588 7777 95.96 95.59 97.35 2 2 1227 8776 91.35 92.33 91.69 3 2 1973 9939 92.80 93.60 93.81 4 2 1945 6970 89.61 90.48 89.77 5 2 3581 13824 98.49 98.68 99.02 6 2 3952 17277 96.63 96.93 95.04 7 2 3918 13306 88.23 91.88 91.75 8 4 3253 17998 88.62 90.34 91.91 91.39 9 4 3952 17275 94.63 95.26 95.14 94.88 10 4 3581 13822 95.83 96.23 96.14 95.86 11 4 4891 15929 81.08 83.41 82.11 82.03 Experiments constructed by grouping classes in different ways. Best accuracies highlighted for each method.

Outline 1 Introduction Generative vs Discriminative Classification Information Theoretic Learning Contributions 2 Maximum Entropy Models and divergences 3 Why maximum discrimination? The MeMd approach 4 The one vs. all approach MeMd using JS divergence 5 6

This is the first work on generative maximum entropy approach to classification. Proposed a method of classification using maximum entropy with maximum discrimination (MeMd) Generative approach: Modelling class conditional densities Discrimination: Use of divergences to measure discriminative abilities of features Feature selection: Selection of most discriminative features The use of multi-distribution divergence for multi-class problem is a new concept in this work. Linear time complexity (suitable for large datasets with high dimensional features)

Thank you!!