Unsupervised Classifiers, Mutual Information and 'Phantom Targets'

Similar documents
An Adaptive Bayesian Network for Low-Level Image Processing

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

L11: Pattern recognition principles

Learning Vector Quantization

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Learning Vector Quantization (LVQ)

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Artificial Neural Networks. MGS Lecture 2

PATTERN CLASSIFICATION

Linear Models for Classification

Statistical Data Mining and Machine Learning Hilary Term 2016

Reading Group on Deep Learning Session 1

Restricted Boltzmann Machines

Independent Components Analysis

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Ch 4. Linear Models for Classification

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Notation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

From perceptrons to word embeddings. Simon Šuster University of Groningen

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Cheng Soon Ong & Christian Walder. Canberra February June 2018

COM336: Neural Computing

Tutorial on Machine Learning for Advanced Electronics

Artificial Neural Networks

Neural Networks Lecture 4: Radial Bases Function Networks

Linear Models in Machine Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Deep Neural Networks

(i) The optimisation problem solved is 1 min

Log-Linear Models, MEMMs, and CRFs

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Brief Introduction of Machine Learning Techniques for Content Analysis

Neural Network Training

Robust Classification using Boltzmann machines by Vasileios Vasilakakis

Contrastive Divergence

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Expectation Maximization (EM)

Introduction. Chapter 1

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

MODULE -4 BAYEIAN LEARNING

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Pattern Recognition and Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

HOMEWORK #4: LOGISTIC REGRESSION

In the Name of God. Lectures 15&16: Radial Basis Function Networks

How to do backpropagation in a brain

STA 4273H: Statistical Machine Learning

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Course 395: Machine Learning - Lectures

Bayesian Machine Learning

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Lecture 16 Deep Neural Generative Models

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

6.036 midterm review. Wednesday, March 18, 15

Multi-layer Neural Networks

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Inf2b Learning and Data

Artificial Neural Networks

PATTERN RECOGNITION AND MACHINE LEARNING

UNSUPERVISED LEARNING

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Interpretable Latent Variable Models

Statistical Learning Reading Assignments

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Qualifying Exam in Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Learning Deep Architectures for AI. Part II - Vijay Chakilam

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

CS534 Machine Learning - Spring Final Exam

CSC321 Lecture 7 Neural language models

Clustering and Gaussian Mixture Models

Motivating the Covariance Matrix

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Logistic Regression. Machine Learning Fall 2018

Advanced statistical methods for data analysis Lecture 2

PAC Generalization Bounds for Co-training

10-701/ Machine Learning, Fall

Generative Learning algorithms

Christian Mohr

Machine Learning Lecture 5

STA 4273H: Statistical Machine Learning

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Final Exam, Machine Learning, Spring 2009

Transcription:

Unsupervised Classifiers, Mutual Information and 'Phantom Targets' John s. Bridle David J.e. MacKay Anthony J.R. Heading California Institute of Technology 139-74 Defence Research Agency Pasadena CA 91125 U.S.A St. Andrew's Road, Malvern ""orcs. "\VR14 3PS, U.K. Abstract We derive criteria for training adaptive classifier networks to perform unsupervised data analysis. The first criterion turns a simple Gaussian classifier into a simple Gaussian mixture analyser. The second criterion, which is much more generally applicable, is based on mutual information. It simplifies to an intuitively reasonable difference between two entropy functions, one encouraging 'decisiveness,' the other 'fairness' to the alternat.ive interpretations of the input. This 'firm but fair' criterion can be applied to any network that produces probability-type outputs, but it does not necessarily lead to useful behavior. 1096 1 Unsupervised Classification One of the main distinctions made in discussing neural network architectures, and pattern analysis algorithms generally, is between supervised and unsupervised data analysis. We should therefore be interested in any method of building bridges between techniques in these two categories. For instance, it is possible to use an unsupervised system such as a Boltzmann machine to learn the joint distribution of inputs and a teacher's classificat.ion labels. The particular type of bridge we seek is a method of taking a supervised pattern classifier and turning it into an unsupervised data analyser. That is, we are interested in methods of "bootstrapping" classifiers. Consider a classifier system. Its input is a vector x, and the output is a probability vector y(x). (That is, the elements ofy are positive and sum to 1.) The elements of y, (Yi (x), i = 1... Nc ) are to be taken as the probabilities that x should be assigned to each of Nc classes. (Note that our definition of classifier does not include a decision process.)

Unsupervised Classifiers, Mutual Information and 'Phantom Targets' 1097 To enforce the conditions we require for the output values, v,,'e recommend using a generalised logistic (normalised exponential, or SoftMax) output stage. \Ve call t.he unnormalised log probabilities of the classes ai, and the softmax performs: Yi = ea,/z with Z = Lea, (1 ) Normally the parameters of such a system would be adjust.ed using a training set comprising examples of inputs and corresponding classes, {(Xi, cd}, vve assume that the system includes means t.o convert derivatives of a t.raining criterion with respect to the outputs into a form suitable for adjusting the values of the parameters, for instance by "backpropagation", Imagine however that we have unlabelled data, X m, m. = 1,,,Nts, and wish to use it to 'improve' the classifier. We could think of this as self-supervised learning, to hone an already good system on lots of easily-obtained unlabelled real-world data, or to adapt to a slowly changing environment, or as a way of turning a classifier int.o some sort of cluster analyser. (Just what kind depends on details of the classifier itself.) The ideal method would be theoretically well-founded, generalpurpose (independent of the details of the classifier), and computationally tractable. One well known approach to unsupervised data analysis is to minimise a reconstruction error: for linear projections and squared euclidean distance this leads to principal components analysis, while reference-point based classifiers lead to vector quantizer design methods, such as the LBG algorithm, Variants on VQ, such as Kohonen's feature maps, can be motivated by requiring robustness to distortions in the code space. Reconstruction error is only available as a training criterion if reconstruction is defined: in general we are only given class label probabilities. 2 A Data Likelihood Criterion For the special case of a Gaussian clustering of an unlabelled data set, it was demonstrated in [1] that gradient ascent on the likelihood of the data has an appealing interpretation in terms of backpropagation in an equivalent unit-gaussian classifier network: for each input X presented to the network, the output y is doubled to give 'phantom targets' t = 2y; when the derivatives of the log likelihood criterion J = -Eiti 10gYi relative to these targets are propagated back through the network, it turns out that the resulting gradient is identical to t.he gradient of the likelihood of the data given a Gaussian mixture model. For the unit-gaussian classifier, the activations ai in (1) are so the outputs of the network are ai = -Ix - wd 2, (2) Yi = P(class = i I x, w) (3) where we assume the inputs are drawn from equi-probable unit-gaussian distributions with the mean of the distribution of the ith class equal to Wi. This result was only derived in a limited context, and it was speculated that it might be generalisable to arbitrary classification models. The above phantom t.arget. rule

1098 Bridle, Heading, and MacKay has been re-derived for a larger class of networks [4], but the conditions for strict applicability are quite severe. Briefly, there should be exponential density functions for each class, and the normalizing factors for these densit.ies should be independent of the parameters. Thus Gaussians with fixed covariance matrices are acceptable, but variable covariances are not, and neither are linear transformat.ions preceeding the Gaussians. The next section introduces a new objective function which is independent of details of the classifier. 3 Mutual Information Criterion Intuitively, an unsupervised adaptive classifier is doing a plausible job if its outputs usually give a fairly clear indication of the class of an input vector, and if there is also an even dist.ribution of input patterns between the classes. We could label these desiderata 'decisive' and 'fair' respectively. Note that it is trivial to achieve either of them alone. For a poorly regularised model it may also be trivial to achieve both. There are several ways to proceed. We could devise ad-hoc measures corresponding to our notions of decisiveness and fairness, or we could consider particular types of classifier and their unsupervised equivalents, seeking a general way of turning one into the other. Our approach is to return to the general idea that the class predictions should retain as much information about the input values as possible. We use a measure of the information about x which is conveyed by the output distribution, i. e. the mutual information between the inputs and the outputs. 'Ne interpret the outputs y as a probability distribution over a discrete random variable e (the class label), thus y = p( elx). The mutual information between x and e is r{ p(e,x) I(e; x) jj dcdxp(e, x) log p(e)p(x) J J dxp(x) dep(elx) log p~~~~) J J p(clx) dxp(x) de p(elx) log J dxp(x)p( elx) (4) (5) (6) The elements of this expression are separately recognizable: J dx p(x)(.) is equivalent to an average over a training set.~t. Lts (.); p( clx) is simply the network output Yc; J dc( ) is a sum over the class labels and corresponding network outputs. Hence: I(c; x) I Nc y. N L L Yi log :-! ts t$ i=l Yi (7)

Unsupervised Classifiers, Mutual Information and 'Phantom Targets' 1099 Nc 1 Nc - L fh log Yh + IV L L Yi log Yi i=l 1 is ts i=l (8) 1i(y) -1i(y) (9) The objective function I is the difference between the entropy of the average of the out.puts, and the average of the entropy of the outputs, where both averages are over the training set. 1i(y) has its maximum value when the average activities of the separate output.s are equal- this is 'fairness'. 1i(Y) has its minimum value when one output is full on and the rest are off for every training case - this is 'firmness'. \Ve now evaluate I for the training set. a.nd take the gradient of I. 4 Gradient descent To use this criterion with back-propagation network training, we need its derivatives with respect to the network outputs. oi(c ;x) OYi (10) (11 ) (12) The resulting expression is quite simple, but note that the presence of a fii term means that two passes through the training set are required: the first to calculate the average output node activations, and the second to back-propagate the derivatives. 5 Illustrations Figures 1 shows I (divided by its maximum possible value, log Nc ) for a run of a particular unit-gaussian classifier network. The 30 data points are drawn from a 2-d isotropic Gaussian. Figure 2 shows the fairness and firmness criteria separately. (The upper curve is 'fairness'?i(y )/log N e, and the lower curve is 'firmness' (1-1i(y)/log N c ).) The t.en reference points had starting values drawn from the same distribution as the data. Figure 3 shows their movement during training. From initial positions within the data cluster, they move outwards into a circle around the data. The resulting classification regions are shown in Figure 4. (The grey level is proportional to the value of the maximum response at each point, and since the outputs are positive normalised this value drops to 0.5 or less at the decision boundaries.) We observe that the space is being partitioned into regions with roughly equal numbers of points. It might be surprising at. first t.hat t.he reference points do not end up near

1100 Bridle, Heading, and MacKay 1.0 0.8 1.0 0.8 0.6 0.4 0.2 0.2 20 40 60 80 100 Iteration 1. The M.1. criterion 20 40 60 80 100 Iteration 2. Firm and Fair separately 4 2 o -2 4~~~~~~~~~~~ 4-2 o 2 4 3. 'Tracks of reference points 4. Decision Regions

Unsupervised Classifiers, Mutual Information and 'Phantom Targets' 1101 the dat.a. However, it is only the transformat.ion from dat.a x to out.puts y that is being trained, and t.he refereme points are just parameters of t.hat t.ra.nsformation. As t.he reference point.s move further away from OBe anot.her t.he de'cision bounclaries grow firmer. In t.his example the fairness crit.erion happens t.o decreasf' in favour of t.he firmness, and this usually happens. \Ve could consider different weightings of the two components of the criterion. 6 Con1n1ents The usefulness of this objective function will prove will depend very much on the form of classifier that it is applied t.o. For a poorly regularised classifier, maximisation of the criterion alone will not necessarily lead to good solutions to unsupervised classification; it could be ma.ximised by any implausible classification of the input. that is completely hard (i. e. the output vector always has one 1 and all the other outputs 0), and t.hat. chops the t.raining set int.o regions cont.aining similar numbers of training points; such a solution would be one of many global maxima, regardless of whether it chopped t.he data into natural classes. The meaning of a 'natural' partition in t.his cont.ext is, of course, rather ill-defined. Simple models often do not. have t.he capacity t.o break a pattern space int.o highly contorted regions - the decision boundaries shown in the figure below is an example of model producing a reasonable result as a consequence of its inherent simplicity. When we use more complex models, however, we must ensure t.hat we find simpler solutions in preference to more complex ones. Thus this criterion encourages us to pursue objective t.echniques for regularising classification networks [2, 3]; such techniques are probably long overdue. Copyright Controller HMSO London 1992 References [1] J.S. Bridle (1988). The phantom target cluster network: a peculiar relative of (unsupervised) maximum likelihood stochastic modelling and (supervised) error backpropagation, RSRE Research Note SP4: 66, DRA Malvern UK. [2] D.J.C. MacKay (1991). Bayesian interpolation, submitted to Neural computation. [3] D.J.C. MacKay (1991). A practical Bayesian framework for backprop networks, submitted to Neural computation. [4].J S Bridle and S J Cox. Recnorm: Simultaneous normalisation and classification applied to speech recognition. In Advances in Ne'ural Information Processing Systems ;g. Morgan Kaufmann, 1991. [5] J S Bridle. Training stochastic model recognition algorithms as networks can lead to maximum mut.ual informat.ion estimation of parameters. In Advances in Neural Informatio71 Processing Systems 2. Morgan Kaufmann, 1990.