An Adaptive Bayesian Network for Low-Level Image Processing

An Adaptive Bayesian Network for Low-Level Image Processing S P Luttrell Defence Research Agency, Malvern, Worcs, WR14 3PS, UK. I. INTRODUCTION Probability calculus, based on the axioms of inference, Cox [1], is the only consistent scheme for performing inference; this is also known as Bayesian inference. The objects which this approach manipulates, namely probability density functions (PDFs), may be created in a variety of ways, but the focus of this paper is on the use of adaptive PDF networks. Adaptive mixture distribution (MD) networks are already widely used, Luttrell [2]. In this paper an extension of the standard MD approach is presented; it is called a partitioned mixture distribution (PMD). PMD networks are designed specically to scale sensibly to high-dimensional problems, such as image processing. Several numerical simulations are performed which demonstrate that the emergent properties of PMD networks are similar to those of biological low-level vision processing systems. II. THEORY In this section the use of PDFs as a vehicle for solving inference problems is discussed, and the standard theory of MDs is summarised. The new theory of PMDs is then presented. A. The Bayesian Approach The axioms of inference, Cox [1], lead to the use of probabilities as the unique scheme for performing consistent inference. This approach has three essential stages: (i) choose a state space x, (ii) assign a joint PDF Q(x), and (iii) draw inferences by computing conditional PDFs. This is generally called the Bayesian approach. Usually, x is split into two or more subspaces in order to distinguish between externally accessible components (e.g. data) and inaccessible components (e.g. model parameters). To illustrate the use of the Bayesian approach, the joint PDF Q(x +, x, s) (where x is the past data, x + is the This appeared in Proceedings of the 3rd International Confererence on Articial Neural Networks, Brighton, 61-65, 1993. c British Crown Copyright 1993/DRA. Published with the permission of the Controller of Her Britannic Majesty's Stationery Oce Electronic address: luttrell%uk.mod.hermes@relay.mod.uk future data, and s are the model parameters) allows predictions to be made by computing the conditional PDF of the future given the past. Q(x + x ) = ds Q(x + s) Q(s x ) (2.1) If Q(s x ) has a single sharp peak (as a function of s) then the integration can be approximated as follows. Q(x + x ) Q(x + s(x )) Q(x s(x s s) max. likelihood ) = Q(s x s ) max. posterior (2.2) This approximation is accurate only in situations where the integration over model parameters is dominated by contributions in the vicinity of s = s(x ). The above approximations can be interpreted in neural network language as follows. The parameters s = s(x ) correspond to the neural network weights (or whatever) that emerge from the training process. Subsequent computation of Q(x s) corresponds to testing the neural network in the eld. This analogy holds for both supervised and unsupervised networks, although the details of the training algorithm and the correspondence between (x, s) and (network inputs/outputs, network weights) is dierent in each case. This correspondence between PDF models and neural networks will be tacitly assumed throughout this paper. B. The Mixture Distribution An MD can be written as (omitting s, for clarity) Q(x) = n Q(x, c) = n Q(x c) Q(c) (2.3) An additional variable c is introduced, and the required PDF Q(x) is recovered by averaging the joint PDF Q(x, c) over the possible states of c. c can be interpreted as a class label, in which case the Q(x c) are class likelihoods, and the Q(c) are class prior probabilities. In Figure 1 a MD is drawn as if it were a neural network. It is a type of radial basis function network in which the input layer is transformed through a set of nonlinear functions Q(x c) (which do not necessarily have a purely radial dependence), and then transformed through a linear summation operation.

2 An Adaptive Bayesian Network for Low-Level Image Processing Figure 1: Mixture distribution network. MDs are computed. The phrase partitioned mixture distribution derives from the fact that the state space of the class label c is partitioned into overlapping sets of n states. The network in Figure 2 can be extended laterally without encountering any scaling problems because its connectivity is local. However, it is not able to capture long-range statistical properties of the input vector x (e.g. the correlation between pixels on opposite sides of an image) because no partition simultaneously sees information from widely separated sources. A multilayer PMD network based on the Adaptive Cluster Expansion approach, Luttrell [4, 5], is needed to remove this limitation. C. The Partitioned Mixture Distribution A useful extension of the standard MD is a set of coupled MDs, called a partitioned mixture distribution (PMD), Luttrell [3]. The basic idea is to retain the additional variable c, and to construct a large number of MD models, each of which uses only a subset of the states that are available to c, and each of which sees only a subspace of the input data x. For simplicity, the theory of a 1-dimensional PMD is presented; the extension to higher dimensions or alternative topologies is obvious. Assume that x is N- dimensional, c has N possible states, and each MD uses only n of these states (n N, n is odd-valued). Each MD is then dened as (assuming circular boundary conditions, so that the class label c is modulo N) Q c (x) = c =c c =c Q(x c )Q(c ) Q(c ) c = 1, 2,, N (2.4) where the integer part of is used. The set of all N MDs is the PMD. The summations used in this denition of Q c (x) can be generalised to any symmetric weighted summation. D. Optimising the Network Parameters The PMD network is optimised by maximising the geometric mean of the likelihoods of the N MD models. s(x ) = s N Q c (x s) (2.5) In the limit of a large training set size this is equivalent to maximising the average logarithmic likelihood. s(x ) = s dx P (x) N log Q c (x s) (2.6) The average over training data has been replaced by a shorthand notation in which the PDF P (x) is used to represent the density of training vectors, as follows 1 T T ( ) t=1 x=xt dx P (x) ( ) (2.7) The posterior probability for model c is (omitting the s variables) Q c (c x) = c =c Q(x,c ) Q(x,c ) c c 0 c c > (2.8) which satises the normalisation condition Q c (c x) = 1. c =c The sum over model index of this posterior probability is Figure 2: Partitioned mixture distribution network. The network representation of a PMD is shown schematically in Figure 2, where multiple overlapping Q(c x) c =c Q c (c x) = c =c c =c + c =c Q(x, c) Q(x, c ) (2.9)

An Adaptive Bayesian Network for Low-Level Image Processing 3 and similar expression for the sum of the prior probabilities is Q(c) c =c Q(c) c =c + c =c Q(c ) (2.10) These denitions satisfy the normalisation conditions N Q(c x) = N Q(c) = N. Normalisation to N, rather than 1, is used to ensure that Q(c x) and Q(c) do not depend on the dimensionality N of the input data. Note that this tilde-transform can be used heuristically to normalise any array of numbers. For clarity, two alternative prescriptions for optimising s are now presented without proof. The detailed derivations of these two prescriptions can be found in Appendix AA 1. Re-estimation prescription: ŝ = s dx P (x) N Q(c x, s) log Q(x, c s ) log c =c Q(c s ) (2.11) 2. Gradient ascent prescription: s s dx P (x) N ( log Q(x, c s) Q(c x, s) ) log Q(c s) Q(c s) (2.12) In both of these prescriptions, the second term derives from the denominator of the PMD expression, and in the case n = N the standard MD prescriptions emerge. E. The Gaussian Partitioned Mixture Distribution A popular model to use for the Q(x c) is a set of Gaussian PDFs with means m(c) and covariances A(c). The prior probabilities Q(c) are their own parameters (i.e. non-parametric). Dene some average moments as M 0 M 1 (c) dx P (x) Q(c x) 1 x M 2 xx T (2.13) The gradient ascent equations become Q(c) Q(c x) Q(c) Q(c) m(c) Q(c x) (x m(c)) A 1 (c) Q(c x) (A(c) ) (x m(c)) (x m(c)) T (2.15) These results for a Gaussian PMD reduce to the standard results for the corresponding Gaussian MD when n = N. III. NUMERICAL SIMULATIONS Numerical simulations were performed, using images as the input data, in order to demonstrate some of the emergent properties of a PMD network. The network architecture that was used in these simulations is shown in Figure 3. (which should be compared with Figure 2). The re-estimation equations become Q(c) = M 0 (c) m(c) = M 1(c) M 0(c) Â(c) = M 2(c) M 0(c) m(c) m(c)t (2.14) The re-estimation equation for Q(c) is the same whether not a Gaussian model is assumed, and it must be solved iteratively. The re-estimation equations for m(c) and Â(c) are straightforward to solve. A. Completeness of Partitioned Mixture Distribution This simulation was conducted to demonstrate that each partition of the PMD has sucient resources to construct a full MD within its input window. This is a type of completeness property. The network was a 16 16 toroidal array, each input window size was 9 9, each mixture window size was 5 5. The training data was normalised by applying the tilde-transform discussed earlier, the variance of the Gaussians was xed at 0.5, and (for convenience only)

4 An Adaptive Bayesian Network for Low-Level Image Processing B. Translation Invariant Partitions This simulation was conducted to demonstrate that a PMD network can compute MD models in an approximately translation invariant fashion, despite being nontranslation invariant at the level of its class probabilities. A toroidal 64 64 network was used. Figure 3: Partitioned mixture distribution network for image image processing. the Gaussian means became topographically ordered by including Kohonen-style neighbourhood updates. Figure 4: (a) Example of a training image. The superimposed square indicates the size of an input window. (b) Montage of Gaussian means after training. The superimposed square indicates the size of a mixture window. Note how a complete repertoire of Gaussian means can be found within each 5 5 mixture window, wherever the window is located. The topographic ordering ensures that the Gaussian mean varies smoothly across the montage. Topographic ordering is not necessary for a PMD to function correctly; it is introduced to make the montage easier to interpret visually. However, in a multilayer PMD network (not studied here) topographic ordering is actually needed for non-cosmetic reasons. In more sophisticated simulations (e.g. pairs of input images) the striations that occur in the montage can be directly related to the dominance columns that are observed in the visual cortex. Figure 5: Response patterns that occur as the input image is moved upwards past the network. (a) Class probabilities. (b) Mixture probabilities. In Figure 5a note that the class probabilities uctuate dramatically as the input data is moved past the network, but in Figure 5b the mixture probabilities move with the data and uctuate only slightly. This invariance property is important in image processing applications, where all parts of the image are (initially) treated on an equal footing. For obvious reasons, it is convenient to refer to the images in Figure 5 as probability images.

An Adaptive Bayesian Network for Low-Level Image Processing 5 C. Normalised Partitioned Mixture Distribution Inputs This simulation was conducted to demonstrate what happens when the input to each Gaussian is normalised. This forces the PMD to concentrate on the variation of the input data (rather than its absolute value) within each input window. A toroidal 20 20 network was used. various displacements and orientations. For convenience, the displacement of the centre of mass of each Gaussian mean (with respect to its central pixel) is shown in Figure 6b. This resembles an orientation map, as observed in the visual cortex. Each partition of the PMD contains the full repertoire of Gaussian means that is needed to construct an MD using normalised inputs. IV. CONCLUSIONS Figure 6: (a) Montage of Gaussian means after training with normalised input vectors. (b) Vector eld of displacements after training with normalised input vectors. In Figure 6a the PMD network develops Gaussian means that resemble small blob or bar-like features with Partitioned mixture distribution networks are a scalable generalisation of standard mixture distributions. Ultimately, as in all Bayesian networks, the power of these networks derives from the use of probabilities as the basic computational objects. The results of the numerical simulations demonstrate that PMD networks have many desirable properties in common with the visual cortex. Multilayer versions of PMD networks, Luttrell [4, 5], have the potential to extend this correspondence. PMD networks have a structure that is amenable to hardware implementation. This opens up the possibility of constructing a fast low-level vision engine based entirely on rigorous Bayesian principles. Appendix A This appendix contains the derivations of two prescriptions for maximising the average logarithmic likelihood that the set of models contained in a PMD ts the training data. The last line of each of these derivations appears in the main text without proof.

6 An Adaptive Bayesian Network for Low-Level Image Processing 1. Re-estimation Prescription L(s ; s) = L(s ) L(s) = dx P (x) N log( Qc(x s ) Q ) c(x s) = dx P (x) N log = dx P (x) N log dx P (x) N c =c c =c Q c(x s) c =c c =c Q(x,c s ) Q(c s ) Q c (c x, s) Q c (c x, s) log = ( dx P (x) N Q(c x, s) log Q(x, c s ) log +a term that is independent of s Q(x,c s ) n Q c(x,c 2 s) Q(c s ) c =c Q(x,c s ) c =c n Q(x,c 2 s) c =c ( c =c Q(c s) Q(c s ) Q(c s ) )) (A1) The penultimate step was obtained by using Jensen's inequality for convex functions, and the last step was obtained by using the result Q c (c x) = 1 and the result c =c N c =c Q c (c x, s) ( ) = N c=c + c =1 c=c = N c =1 Q c (c x, s) ( ) Q(c x, s) ( ) (A2) The maximisation of L(s ) can now be replaced by the maximisation of its lower bound L(s ; s) with respect to s, which immediately yields the re-estimation equation. 2. Gradient Ascent Prescription The gradient ascent prescription can be similarly derived by directly dierentiating the logarithmic likelihood with respect to the parameter vector. L(s) = = N dx P (x) log Q c (x s) N dx P (x) log = dx P (x) N = dx P (x) N c =c c =c c =c Q(x,c s) Q(c s) Q(x,c s) log Q(x,c s) c =c Q(x,c s) ( Q(c x, s) log Q(x,c s) dx P (x) N ) log Q(c s) Q(c s) c =c Q(c s) log Q(c s) c =c Q(c s) (A3) As in the re-estimation prescription, the c and c summations are interchanged to obtain the nal result.

An Adaptive Bayesian Network for Low-Level Image Processing 7 [1] R. T. Cox (1946). Probability, frequency and reasonable expectation. Am. J. Phys., 14(1), 1-13. [2] S. P. Luttrell (1992). Adaptive Bayesian networks. In Proc. SPIE Conf. on Adaptive Signal Processing, (SPIE, Orlando), 151-140. [3] S. P. Luttrell (1992). Partitioned mixture distributions: an introduction. DRA, Malvern. Technical Report 4671. [4] S. P. Luttrell (1990). A trainable texture anomaly detector using the adaptive cluster expansion (ACE) method. RSRE, Malvern. Technical Report 4437. [5] S. P. Luttrell (1991). A hierarchical network for clutter and texture modelling. In Proc. SPIE Conf. on Adaptive Signal Processing, (SPIE, San Diego), 518-528.