An Adaptive Bayesian Network for Low-Level Image Processing

Similar documents
Unsupervised Classifiers, Mutual Information and 'Phantom Targets'

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Unsupervised Learning with Permuted Data

Hierarchy. Will Penny. 24th March Hierarchy. Will Penny. Linear Models. Convergence. Nonlinear Models. References

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Widths. Center Fluctuations. Centers. Centers. Widths

Linear Regression and Its Applications

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Gaussian process for nonstationary time series prediction

Managing Uncertainty

Variational Principal Components

Learning Vector Quantization

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Lie Groups for 2D and 3D Transformations

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Data Preprocessing. Cluster Similarity

1 Using standard errors when comparing estimated values

Error Empirical error. Generalization error. Time (number of iteration)

Contents. 2.1 Vectors in R n. Linear Algebra (part 2) : Vector Spaces (by Evan Dummit, 2017, v. 2.50) 2 Vector Spaces

COM336: Neural Computing

Review and Motivation

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space

Expectation Maximization and Mixtures of Gaussians

Learning Vector Quantization (LVQ)

Unsupervised Learning

p(z)

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Bayesian Learning in Undirected Graphical Models

Natural Image Statistics

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Mixture Models and EM

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Statistical Learning Reading Assignments

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Probabilistic & Unsupervised Learning

ECE521 week 3: 23/26 January 2017

Machine Learning Lecture Notes

INTRODUCTION TO PATTERN RECOGNITION

Supervised Learning Coursework

STA 414/2104: Lecture 8

STA 4273H: Statistical Machine Learning

Is early vision optimised for extracting higher order dependencies? Karklin and Lewicki, NIPS 2005

Batch-mode, on-line, cyclic, and almost cyclic learning 1 1 Introduction In most neural-network applications, learning plays an essential role. Throug

Gaussian Mixture Models

Homework 1 Solutions ECEn 670, Fall 2013

Computer Vision Group Prof. Daniel Cremers. 3. Regression

STA 4273H: Statistical Machine Learning

Probabilistic & Unsupervised Learning

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Statistics. Lent Term 2015 Prof. Mark Thomson. 2: The Gaussian Limit

Contrastive Divergence

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Bayesian Inference of Noise Levels in Regression

Minimum Entropy Data Partitioning. Exhibition Road, London SW7 2BT, UK. in even a 3-dimensional data space). a Radial-Basis Function (RBF) classier in

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

PATTERN RECOGNITION AND MACHINE LEARNING

f(z)dz = 0. P dx + Qdy = D u dx v dy + i u dy + v dx. dxdy + i x = v

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

Chris Bishop s PRML Ch. 8: Graphical Models

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Mixtures of Gaussians. Sargur Srihari

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

CV-NP BAYESIANISM BY MCMC. Cross Validated Non Parametric Bayesianism by Markov Chain Monte Carlo CARLOS C. RODRIGUEZ

1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo

Matching the dimensionality of maps with that of the data

Learning Gaussian Process Models from Uncertain Data

Unsupervised machine learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

= w 2. w 1. B j. A j. C + j1j2

1 Introduction Duality transformations have provided a useful tool for investigating many theories both in the continuum and on the lattice. The term

Gaussian Mixture Models

1 Matrices and Systems of Linear Equations

Clustering with k-means and Gaussian mixture distributions

Maximum Likelihood Estimation. only training data is available to design a classifier

Detection of Anomalies in Texture Images using Multi-Resolution Features

Clustering and Gaussian Mixture Models

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Bayesian ensemble learning of generative models

Study Notes on the Latent Dirichlet Allocation

Mixture Models and Expectation-Maximization

STA 4273H: Statistical Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja

Expectation propagation for signal detection in flat-fading channels

Notes on Markov Networks

Human Pose Tracking I: Basics. David Fleet University of Toronto

Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint

Latent Dirichlet Allocation Introduction/Overview

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Pattern Recognition and Machine Learning

Data Mining Techniques

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Statistical Pattern Recognition

Transcription:

An Adaptive Bayesian Network for Low-Level Image Processing S P Luttrell Defence Research Agency, Malvern, Worcs, WR14 3PS, UK. I. INTRODUCTION Probability calculus, based on the axioms of inference, Cox [1], is the only consistent scheme for performing inference; this is also known as Bayesian inference. The objects which this approach manipulates, namely probability density functions (PDFs), may be created in a variety of ways, but the focus of this paper is on the use of adaptive PDF networks. Adaptive mixture distribution (MD) networks are already widely used, Luttrell [2]. In this paper an extension of the standard MD approach is presented; it is called a partitioned mixture distribution (PMD). PMD networks are designed specically to scale sensibly to high-dimensional problems, such as image processing. Several numerical simulations are performed which demonstrate that the emergent properties of PMD networks are similar to those of biological low-level vision processing systems. II. THEORY In this section the use of PDFs as a vehicle for solving inference problems is discussed, and the standard theory of MDs is summarised. The new theory of PMDs is then presented. A. The Bayesian Approach The axioms of inference, Cox [1], lead to the use of probabilities as the unique scheme for performing consistent inference. This approach has three essential stages: (i) choose a state space x, (ii) assign a joint PDF Q(x), and (iii) draw inferences by computing conditional PDFs. This is generally called the Bayesian approach. Usually, x is split into two or more subspaces in order to distinguish between externally accessible components (e.g. data) and inaccessible components (e.g. model parameters). To illustrate the use of the Bayesian approach, the joint PDF Q(x +, x, s) (where x is the past data, x + is the This appeared in Proceedings of the 3rd International Confererence on Articial Neural Networks, Brighton, 61-65, 1993. c British Crown Copyright 1993/DRA. Published with the permission of the Controller of Her Britannic Majesty's Stationery Oce Electronic address: luttrell%uk.mod.hermes@relay.mod.uk future data, and s are the model parameters) allows predictions to be made by computing the conditional PDF of the future given the past. Q(x + x ) = ds Q(x + s) Q(s x ) (2.1) If Q(s x ) has a single sharp peak (as a function of s) then the integration can be approximated as follows. Q(x + x ) Q(x + s(x )) Q(x s(x s s) max. likelihood ) = Q(s x s ) max. posterior (2.2) This approximation is accurate only in situations where the integration over model parameters is dominated by contributions in the vicinity of s = s(x ). The above approximations can be interpreted in neural network language as follows. The parameters s = s(x ) correspond to the neural network weights (or whatever) that emerge from the training process. Subsequent computation of Q(x s) corresponds to testing the neural network in the eld. This analogy holds for both supervised and unsupervised networks, although the details of the training algorithm and the correspondence between (x, s) and (network inputs/outputs, network weights) is dierent in each case. This correspondence between PDF models and neural networks will be tacitly assumed throughout this paper. B. The Mixture Distribution An MD can be written as (omitting s, for clarity) Q(x) = n Q(x, c) = n Q(x c) Q(c) (2.3) An additional variable c is introduced, and the required PDF Q(x) is recovered by averaging the joint PDF Q(x, c) over the possible states of c. c can be interpreted as a class label, in which case the Q(x c) are class likelihoods, and the Q(c) are class prior probabilities. In Figure 1 a MD is drawn as if it were a neural network. It is a type of radial basis function network in which the input layer is transformed through a set of nonlinear functions Q(x c) (which do not necessarily have a purely radial dependence), and then transformed through a linear summation operation.

2 An Adaptive Bayesian Network for Low-Level Image Processing Figure 1: Mixture distribution network. MDs are computed. The phrase partitioned mixture distribution derives from the fact that the state space of the class label c is partitioned into overlapping sets of n states. The network in Figure 2 can be extended laterally without encountering any scaling problems because its connectivity is local. However, it is not able to capture long-range statistical properties of the input vector x (e.g. the correlation between pixels on opposite sides of an image) because no partition simultaneously sees information from widely separated sources. A multilayer PMD network based on the Adaptive Cluster Expansion approach, Luttrell [4, 5], is needed to remove this limitation. C. The Partitioned Mixture Distribution A useful extension of the standard MD is a set of coupled MDs, called a partitioned mixture distribution (PMD), Luttrell [3]. The basic idea is to retain the additional variable c, and to construct a large number of MD models, each of which uses only a subset of the states that are available to c, and each of which sees only a subspace of the input data x. For simplicity, the theory of a 1-dimensional PMD is presented; the extension to higher dimensions or alternative topologies is obvious. Assume that x is N- dimensional, c has N possible states, and each MD uses only n of these states (n N, n is odd-valued). Each MD is then dened as (assuming circular boundary conditions, so that the class label c is modulo N) Q c (x) = c =c c =c Q(x c )Q(c ) Q(c ) c = 1, 2,, N (2.4) where the integer part of is used. The set of all N MDs is the PMD. The summations used in this denition of Q c (x) can be generalised to any symmetric weighted summation. D. Optimising the Network Parameters The PMD network is optimised by maximising the geometric mean of the likelihoods of the N MD models. s(x ) = s N Q c (x s) (2.5) In the limit of a large training set size this is equivalent to maximising the average logarithmic likelihood. s(x ) = s dx P (x) N log Q c (x s) (2.6) The average over training data has been replaced by a shorthand notation in which the PDF P (x) is used to represent the density of training vectors, as follows 1 T T ( ) t=1 x=xt dx P (x) ( ) (2.7) The posterior probability for model c is (omitting the s variables) Q c (c x) = c =c Q(x,c ) Q(x,c ) c c 0 c c > (2.8) which satises the normalisation condition Q c (c x) = 1. c =c The sum over model index of this posterior probability is Figure 2: Partitioned mixture distribution network. The network representation of a PMD is shown schematically in Figure 2, where multiple overlapping Q(c x) c =c Q c (c x) = c =c c =c + c =c Q(x, c) Q(x, c ) (2.9)

An Adaptive Bayesian Network for Low-Level Image Processing 3 and similar expression for the sum of the prior probabilities is Q(c) c =c Q(c) c =c + c =c Q(c ) (2.10) These denitions satisfy the normalisation conditions N Q(c x) = N Q(c) = N. Normalisation to N, rather than 1, is used to ensure that Q(c x) and Q(c) do not depend on the dimensionality N of the input data. Note that this tilde-transform can be used heuristically to normalise any array of numbers. For clarity, two alternative prescriptions for optimising s are now presented without proof. The detailed derivations of these two prescriptions can be found in Appendix AA 1. Re-estimation prescription: ŝ = s dx P (x) N Q(c x, s) log Q(x, c s ) log c =c Q(c s ) (2.11) 2. Gradient ascent prescription: s s dx P (x) N ( log Q(x, c s) Q(c x, s) ) log Q(c s) Q(c s) (2.12) In both of these prescriptions, the second term derives from the denominator of the PMD expression, and in the case n = N the standard MD prescriptions emerge. E. The Gaussian Partitioned Mixture Distribution A popular model to use for the Q(x c) is a set of Gaussian PDFs with means m(c) and covariances A(c). The prior probabilities Q(c) are their own parameters (i.e. non-parametric). Dene some average moments as M 0 M 1 (c) dx P (x) Q(c x) 1 x M 2 xx T (2.13) The gradient ascent equations become Q(c) Q(c x) Q(c) Q(c) m(c) Q(c x) (x m(c)) A 1 (c) Q(c x) (A(c) ) (x m(c)) (x m(c)) T (2.15) These results for a Gaussian PMD reduce to the standard results for the corresponding Gaussian MD when n = N. III. NUMERICAL SIMULATIONS Numerical simulations were performed, using images as the input data, in order to demonstrate some of the emergent properties of a PMD network. The network architecture that was used in these simulations is shown in Figure 3. (which should be compared with Figure 2). The re-estimation equations become Q(c) = M 0 (c) m(c) = M 1(c) M 0(c) Â(c) = M 2(c) M 0(c) m(c) m(c)t (2.14) The re-estimation equation for Q(c) is the same whether not a Gaussian model is assumed, and it must be solved iteratively. The re-estimation equations for m(c) and Â(c) are straightforward to solve. A. Completeness of Partitioned Mixture Distribution This simulation was conducted to demonstrate that each partition of the PMD has sucient resources to construct a full MD within its input window. This is a type of completeness property. The network was a 16 16 toroidal array, each input window size was 9 9, each mixture window size was 5 5. The training data was normalised by applying the tilde-transform discussed earlier, the variance of the Gaussians was xed at 0.5, and (for convenience only)

4 An Adaptive Bayesian Network for Low-Level Image Processing B. Translation Invariant Partitions This simulation was conducted to demonstrate that a PMD network can compute MD models in an approximately translation invariant fashion, despite being nontranslation invariant at the level of its class probabilities. A toroidal 64 64 network was used. Figure 3: Partitioned mixture distribution network for image image processing. the Gaussian means became topographically ordered by including Kohonen-style neighbourhood updates. Figure 4: (a) Example of a training image. The superimposed square indicates the size of an input window. (b) Montage of Gaussian means after training. The superimposed square indicates the size of a mixture window. Note how a complete repertoire of Gaussian means can be found within each 5 5 mixture window, wherever the window is located. The topographic ordering ensures that the Gaussian mean varies smoothly across the montage. Topographic ordering is not necessary for a PMD to function correctly; it is introduced to make the montage easier to interpret visually. However, in a multilayer PMD network (not studied here) topographic ordering is actually needed for non-cosmetic reasons. In more sophisticated simulations (e.g. pairs of input images) the striations that occur in the montage can be directly related to the dominance columns that are observed in the visual cortex. Figure 5: Response patterns that occur as the input image is moved upwards past the network. (a) Class probabilities. (b) Mixture probabilities. In Figure 5a note that the class probabilities uctuate dramatically as the input data is moved past the network, but in Figure 5b the mixture probabilities move with the data and uctuate only slightly. This invariance property is important in image processing applications, where all parts of the image are (initially) treated on an equal footing. For obvious reasons, it is convenient to refer to the images in Figure 5 as probability images.

An Adaptive Bayesian Network for Low-Level Image Processing 5 C. Normalised Partitioned Mixture Distribution Inputs This simulation was conducted to demonstrate what happens when the input to each Gaussian is normalised. This forces the PMD to concentrate on the variation of the input data (rather than its absolute value) within each input window. A toroidal 20 20 network was used. various displacements and orientations. For convenience, the displacement of the centre of mass of each Gaussian mean (with respect to its central pixel) is shown in Figure 6b. This resembles an orientation map, as observed in the visual cortex. Each partition of the PMD contains the full repertoire of Gaussian means that is needed to construct an MD using normalised inputs. IV. CONCLUSIONS Figure 6: (a) Montage of Gaussian means after training with normalised input vectors. (b) Vector eld of displacements after training with normalised input vectors. In Figure 6a the PMD network develops Gaussian means that resemble small blob or bar-like features with Partitioned mixture distribution networks are a scalable generalisation of standard mixture distributions. Ultimately, as in all Bayesian networks, the power of these networks derives from the use of probabilities as the basic computational objects. The results of the numerical simulations demonstrate that PMD networks have many desirable properties in common with the visual cortex. Multilayer versions of PMD networks, Luttrell [4, 5], have the potential to extend this correspondence. PMD networks have a structure that is amenable to hardware implementation. This opens up the possibility of constructing a fast low-level vision engine based entirely on rigorous Bayesian principles. Appendix A This appendix contains the derivations of two prescriptions for maximising the average logarithmic likelihood that the set of models contained in a PMD ts the training data. The last line of each of these derivations appears in the main text without proof.

6 An Adaptive Bayesian Network for Low-Level Image Processing 1. Re-estimation Prescription L(s ; s) = L(s ) L(s) = dx P (x) N log( Qc(x s ) Q ) c(x s) = dx P (x) N log = dx P (x) N log dx P (x) N c =c c =c Q c(x s) c =c c =c Q(x,c s ) Q(c s ) Q c (c x, s) Q c (c x, s) log = ( dx P (x) N Q(c x, s) log Q(x, c s ) log +a term that is independent of s Q(x,c s ) n Q c(x,c 2 s) Q(c s ) c =c Q(x,c s ) c =c n Q(x,c 2 s) c =c ( c =c Q(c s) Q(c s ) Q(c s ) )) (A1) The penultimate step was obtained by using Jensen's inequality for convex functions, and the last step was obtained by using the result Q c (c x) = 1 and the result c =c N c =c Q c (c x, s) ( ) = N c=c + c =1 c=c = N c =1 Q c (c x, s) ( ) Q(c x, s) ( ) (A2) The maximisation of L(s ) can now be replaced by the maximisation of its lower bound L(s ; s) with respect to s, which immediately yields the re-estimation equation. 2. Gradient Ascent Prescription The gradient ascent prescription can be similarly derived by directly dierentiating the logarithmic likelihood with respect to the parameter vector. L(s) = = N dx P (x) log Q c (x s) N dx P (x) log = dx P (x) N = dx P (x) N c =c c =c c =c Q(x,c s) Q(c s) Q(x,c s) log Q(x,c s) c =c Q(x,c s) ( Q(c x, s) log Q(x,c s) dx P (x) N ) log Q(c s) Q(c s) c =c Q(c s) log Q(c s) c =c Q(c s) (A3) As in the re-estimation prescription, the c and c summations are interchanged to obtain the nal result.

An Adaptive Bayesian Network for Low-Level Image Processing 7 [1] R. T. Cox (1946). Probability, frequency and reasonable expectation. Am. J. Phys., 14(1), 1-13. [2] S. P. Luttrell (1992). Adaptive Bayesian networks. In Proc. SPIE Conf. on Adaptive Signal Processing, (SPIE, Orlando), 151-140. [3] S. P. Luttrell (1992). Partitioned mixture distributions: an introduction. DRA, Malvern. Technical Report 4671. [4] S. P. Luttrell (1990). A trainable texture anomaly detector using the adaptive cluster expansion (ACE) method. RSRE, Malvern. Technical Report 4437. [5] S. P. Luttrell (1991). A hierarchical network for clutter and texture modelling. In Proc. SPIE Conf. on Adaptive Signal Processing, (SPIE, San Diego), 518-528.