Radial Basis Function Networks: Algorithms

Radial Basis Function Networks: Algorithms Introduction to Neural Networks : Lecture 13 John A. Bullinaria, 2004 1. The RBF Maing 2. The RBF Network Architecture 3. Comutational Power of RBF Networks 4. Training an RBF Network 5. Unsuervised Otimization of the Basis Functions 6. Finding the Outut Weights

The Radial Basis Function (RBF) Maing We are working within the standard framework of function aroximation. We have a set of N data oints in a multi-dimensional sace such that every D dimensional inut vector x i = { x : i = 1,..., D} has a corresonding K dimensional target outut t = { tk : k = 1,..., K}. The target oututs will generally be generated by some underlying functions g k ( x ) lus random noise. The goal is to aroximate the g k ( x ) with functions y k ( x ) of the form y M ( x) = w φ ( x) k k = 0 We shall concentrate on the case of Gaussian basis functions φ ( x) = ex x 2σ µ 2 2 in which we have basis centres { µ } and widths { σ }. Naturally, the way to roceed is to develo a rocess for finding the aroriate values for M, { w k }, { µ i } and { σ }. L13-2

The RBF Network Architecture We can cast the RBF Maing into a form that resembles a neural network: 1 K oututs y k weights w k 1 M basis functions φ (x i, µ i, σ ) weights µ i 1 D inuts x i The hidden to outut layer art oerates like a standard feed-forward MLP network, with the sum of the weighted hidden unit activations giving the outut unit activations. The hidden unit activations are given by the basis functions φ ( x, µ, σ ), which deend on the weights { µ i, σ } and inut activations { x i } in a non-standard manner. L13-3

Comutational Power of RBF Networks Intuitively, we can easily understand why linear suerositions of localised basis functions are caable of universal aroximation. More formally: Hartman, Keeler & Kowalski (1990, Neural Comutation, vol.2,.210-215) rovided a formal roof of this roerty for networks with Gaussian basis functions in which the widths { σ } are treated as adustable arameters. Park & Sandberg (1991, Neural Comutation, vol.3,.246-257; and 1993, Neural Comutation, vol.5,.305-316) showed that with only mild restrictions on the basis functions, the universal function aroximation roerty still holds. As with the corresonding roofs for MLPs, these are existence roofs which rely on the availability of an arbitrarily large number of hidden units (i.e. basis functions). However, they do rovide a theoretical foundation on which ractical alications can be based with confidence. L13-4

Training RBF Networks The roofs about comutational ower tell us what an RBF Network can do, but nothing about how to find all its arameters/weights { w k, µ i, σ }. Unlike in MLPs, in RBF networks the hidden and outut layers lay very different roles, and the corresonding weights have very different meanings and roerties. It is therefore aroriate to use different learning algorithms for them. The inut to hidden weights (i.e. basis function arameters { µ i, σ }) can be trained (or set) using any of a number of unsuervised learning techniques. Then, after the inut to hidden weights are found, they are ket fixed while the hidden to outut weights are learned. Since this second stage of training involves ust a single layer of weights { w k } and linear outut activation functions, the weights can easily be found analytically by solving a set of linear equations. This can be done very quickly, without the need for a set of iterative weight udates as in gradient descent learning. L13-5

Basis Function Otimization One maor advantage of RBF networks is the ossibility of choosing suitable hidden unit/basis function arameters without having to erform a full non-linear otimization of the whole network. We shall now look at several ways of doing this: 1. Fixed centres selected at random 2. Orthogonal least squares 3. K-means clustering These are all unsuervised techniques, which will be articularly useful in situations where labelled data is in short suly, but there is lenty of unlabelled data (i.e. inuts without outut targets). Next lecture we shall look at how we might get better results by erforming a full suervised non-linear otimization of the network instead. With either aroach, determining a good value for M remains a roblem. It will generally be aroriate to comare the results for a range of different values, following the same kind of validation/cross validation methodology used for otimizing MLPs. L13-6

Fixed Centres Selected At Random The simlest and quickest aroach to setting the RBF arameters is to have their centres fixed at M oints selected at random from the N data oints, and to set all their widths to be equal and fixed at an aroriate size for the distribution of data oints. Secifically, we can use normalised RBFs centred at { µ } defined by φ ( x) = ex x 2σ µ 2 2 where { µ } { x } and the σ are all related in the same way to the maximum or average distance between the chosen centres µ. Common choices are σ d 2M = max or σ = 2d ave which ensure that the individual RBFs are neither too wide, nor too narrow, for the given training data. For large training sets, this aroach gives reasonable results. L13-7

Orthogonal Least Squares A more rinciled aroach to selecting a sub-set of data oints as the basis function centres is based on the technique of orthogonal least squares. This involves the sequential addition of new basis functions, each centred on one of the data oints. At each stage, we try out each otential Lth basis function by using the N L other data oints to determine the networks outut weights. The otential Lth basis function which leaves the smallest residual outut sum squared outut error is used, and we move on to choose which L+1th basis function to add. This sounds wasteful, but if we construct a set of orthogonal vectors in the sace S sanned by the vectors of hidden unit activations for each attern in the training set, we can calculate directly which data oint should be chosen as the next basis function. To get good generalization we generally use validation/cross validation to sto the rocess when an aroriate number of data oints have been selected as centres. L13-8

K-Means Clustering A otentially even better aroach is to use clustering techniques to find a set of centres which more accurately reflects the distribution of the data oints. The K-Means Clustering Algorithm icks the number K of centres in advance, and then follows a simle re-estimation rocedure to artition the data oints { x } into K disoint sub-sets S containing N data oints to minimize the sum squared clustering function J K x µ = 1 S = 2 where µ is the mean/centroid of the data oints in set S given by = 1 µ N S x Once the basis centres have been determined in this way, the widths can then be set according to the variances of the oints in the corresonding cluster. L13-9

Dealing with the Outut Layer Given the hidden unit activations φ ( x, µ, σ ) are fixed while we determine the outut weights { w k }, we essentially only have to find the weights that otimise a single layer linear network. As with MLPs we can define a sum-squared outut error measure 2 2 ( k k ) k 1 E = y ( x ) t but here the oututs are a simle linear combination of the hidden unit activations, i.e. y k M ( x ) = w φ ( x ) = 0 k. At the minimum of E the gradients with resect to all the weights w ki will be zero, so E w ki M = wkφ ( x ) tk φi( x ) = = 0 0 and linear equations like this are well known to be easy to solve analytically. L13-10

Comuting the Outut Weights Our equations for the weights are most conveniently written in matrix form by defining matrices with comonents (W) k = w k, (Φ) = φ (x ), and (T) k = {t k }. This gives Φ T T ( Φ W T) = 0 and the formal solution for the weights is W T = Φ T in which we have the standard seudo inverse of Φ T 1 Φ ( Φ Φ) Φ which can be seen to have the roerty Φ Φ = I. We see that the network weights can be comuted by fast linear matrix inversion techniques. In ractice we tend to use singular value decomosition (SVD) to avoid ossible ill-conditioning of Φ, i.e. Φ T Φ being singular or near singular. T L13-11

Overview and Reading 1. We began by defining Radial Basis Function (RBF) maings and the corresonding network architecture. 2. Then we considered the comutational ower of RBF networks. 3. We then saw how the two layers of network weights were rather different and different techniques were aroriate for training each of them. 4. We first looked at several unsuervised techniques for carrying out the first stage, namely otimizing the basis functions. 5. We then saw how the second stage, determining the outut weights, could be erformed by fast linear matrix inversion techniques. Reading 1. Bisho: Sections 5.2, 5.3, 5.9, 5.10, 3.4 2. Haykin: Sections 5.4, 5.9, 5.10, 5.13 L13-12