Te comparison of dierent models M i is based on teir relative probabilities, wic can be expressed, again using Bayes' teorem, in terms of prior probab

To appear in: Advances in Neural Information Processing Systems 9, eds. M. C. Mozer, M. I. Jordan and T. Petsce. MIT Press, 997 Bayesian Model Comparison by Monte Carlo Caining David Barber D.Barber@aston.ac.uk Cristoper M. Bisop C.M.Bisop@aston.ac.uk Neural Computing Researc Group Aston University, Birmingam, B4 7ET, U.K. ttp://www.ncrg.aston.ac.uk/ Abstract Te tecniques of Bayesian inference ave been applied wit great success to many problems in neural computing including evaluation of regression functions, determination of error bars on predictions, and te treatment of yper-parameters. However, te problem of model comparison is a muc more callenging one for wic current tecniques ave signicant limitations. In tis paper we sow ow an extended form of Markov cain Monte Carlo, called caining, is able to provide eective estimates of te relative probabilities of dierent models. We present results from te robot arm problem and compare tem wit te corresponding results obtained using te standard Gaussian approximation framework. Bayesian Model Comparison In a Bayesian treatment of statistical inference, our state of knowledge of te values of te parameters w in a model M is described in terms of a probability distribution function. Initially tis is cosen to be some prior distribution p(wjm), wic can be combined wit a likeliood function p(djw; M) using Bayes' teorem to give a posterior distribution p(wjd; M) in te form p(wjd; M) = p(djw; M)p(wjM) p(djm) () were D is te data set. Predictions of te model are obtained by performing integrations weigted by te posterior distribution.

Te comparison of dierent models M i is based on teir relative probabilities, wic can be expressed, again using Bayes' teorem, in terms of prior probabilities p(m i ) to give p(m i jd) p(m j jd) = p(djm i)p(m i ) () p(djm j )p(m j ) and so requires tat we be able to evaluate te model evidence p(djm i ), wic corresponds to te denominator in (). Te relative probabilities of dierent models can be used to select te single most probable model, or to form a committee of models, weiged by teir probabilities. It is convenient to write te numerator of () in te form expf E(w)g, were E(w) is an error function. Normalization of te posterior distribution ten requires tat p(djm) = Z expf E(w)g dw: (3) Generally, it is straigtforward to evaluate E(w) for a given value of w, altoug it is extremely dicult to evaluate te corresponding model evidence using (3) since te posterior distribution is typically very small except in narrow regions of te ig-dimensional parameter space, wic are unknown a-priori. Standard numerical integration tecniques are terefore inapplicable. One approac is based on a local Gaussian approximation around a mode of te posterior (MacKay, 99). Unfortunately, tis approximation is expected to be accurate only wen te number of data points is large in relation to te number of parameters in te model. In fact it is for relatively complex models, or problems for wic data is scarce, tat Bayesian metods ave te most to oer. Indeed, Neal, R. M. (996) as argued tat, from a Bayesian perspective, tere is no reason to limit te number of parameters in a model, oter tan for computational reasons. We terefore consider an approac to te evaluation of model evidence wic overcomes te limitations of te Gaussian framework. For additional tecniques and references to Bayesian model comparison, see Gilks et al. (995) and Kass and Raftery (995). Caining Suppose we ave a simple model M 0 for wic we can evaluate te evidence analytically, and for wic we can easily generate a sample w l (were l = ; : : : ; L) from te corresponding distribution p(wjd; M 0 ). Ten te evidence for some oter model M can be expressed in te form p(djm) p(djm 0 ) = ' Z L expf E(w) + E 0 (w)gp(wjd; M 0 ) dw LX l= expf E(w l ) + E 0 (w l )g: (4) Unfortunately, te Monte Carlo approximation in (4) will be poor if te two error functions are signicantly dierent, since te exponent is dominated by regions were E is relatively small, for wic tere will be few samples unless E 0 is also small in tose regions. A simple Monte Carlo approac will terefore yield poor results. Tis problem is equivalent to te evaluation of free energies in statistical pysics,

wic is known to be a callenging problem, and were a number of approaces ave been developed Neal (993). Here we discuss one suc approac to tis problem based on a cain of K successive models M i wic interpolate between M 0 and M, so tat te required evidence can be written as p(djm) = p(djm 0 ) p(djm ) p(djm ) p(djm 0 ) p(djm ) : : : p(djm) p(djm K ) : (5) Eac of te ratios in (5) can be evaluated using (4). Te goal is to devise a cain of models suc tat eac successive pair of models as probability distributions wic are reasonably close, so tat eac of te ratios in (5) can be evaluated accurately, wile keeping te total number of links in te cain fairly small to limit te computational costs. We ave cosen te tecnique of ybrid Monte Carlo (Duane et al., 987; Neal, 993) to sample from te various distributions, since tis as been sown to be eective for sampling from te complex distributions arising wit neural network models (Neal, R. M., 996). Tis involves introducing Hamiltonian equations of motion in wic te parameters w are augmented by a set of ctitious `momentum' variables, wic are ten integrated using te leapfrog metod. At te end of eac trajectory te new parameter vector is accepted wit a probability governed by te Metropolis criterion, and te momenta are replaced using Gibbs sampling. As a ceck on our software implementation of caining, we ave evaluated te evidence for a mixture of two non-isotropic Gaussian distributions, and obtained a result wic was witin 0% of te analytical solution. 3 Application to Neural Networks We now consider te application of te caining metod to regression problems involving neural network models. Te network corresponds to a function y(x; w), and te data set consists of N pairs of input vectors x n and corresponding targets t n were n = ; : : : ; N. Assuming Gaussian noise on te target data, te likeliood function takes te form p(djw; M) = N= exp( NX n= ky(x n ; w) t n k ) were is a yper-parameter representing te inverse of te noise variance. We consider networks wit a single idden layer of `tan' units, and linear output units. Following Neal, R. M. (996) we use a diagonal Gaussian prior in wic te weigts are divided into groups w k, were k = ; : : : ; 4 corresponding to input-to-idden weigts, idden-unit biases, idden-to-output weigts, and output biases. Eac group is governed by a separate `precision' yper-parameter k, so tat te prior takes te form ( ) p(wjf k g) = exp X k wk T w k (7) Z W were Z W is te normalization coecient. Te yper-parameters f k g and are temselves eac governed by yper-priors given by Gamma distributions of te form k (6) p() / s exp( s=!) (8)

in wic te mean! and variance! =s are cosen to give very broad yper-priors in reection of our limited prior knowledge of te values of te yper-parameters. We use te ybrid Monte Carlo algoritm to sample from te joint distribution of parameters and yper-parameters. For te evaluation of evidence ratios, owever, we consider only te parameter samples, and perform te integrals over yperparameters analytically, using te fact tat te gamma distribution is conjugate to te Gaussian. In order to apply caining to tis problem, we coose te prior as our reference distribution, and ten dene a set of intermediate distributions based on a parameter wic governs te eective contribution from te data term, so tat E(; w) = (w) + E 0 (w) (9) were (w) arises from te likeliood term (6) wile E 0 (w) corresponds to te prior (7). We select a set of 8 values of wic interpolate between te reference distribution ( = 0) and te desired model distribution ( = ). Te evidence for te prior alone is easily evaluated analytically. 4 Gaussian Approximation As a comparison against te metod of caining, we consider te framework of MacKay (99) based on a local Gaussian approximation to te posterior distribution. Tis approac makes use of te evidence approximation in wic te integration over yper-parameters is approximated by setting tem to specic values wic are temselves determined by maximizing teir evidence functions. Tis leads to a ierarcical treatment as follows. At te lowest level, te maximum bw of te posterior distribution over weigts is found for xed values of te yperparameters by minimizing te error function. Periodically te yper-parameters are re-estimated by evidence maximization, were te evidence is obtained analytically using te Gaussian approximation. Tis gives te following re-estimation formulae := N NX n= ky(x n ; bw) t n k k := k bw T k bw k (0) were k = W k k Tr k P (A ), W k is te total number of parameters in group k, A = rre(bw), = k k, and Tr k () denotes te trace over te kt group of parameters. Te weigts are updated in an inner loop by minimizing te error function using a conjugate gradient optimizer, wile te yper-parameters are periodically re-estimated using (0). Once training is complete, te model evidence is evaluated by making a Gaussian approximation around te converged values of te yper-parameters, and integrating over tis distribution analytically. Tis gives te model log evidence as ln p(djm) = E(bw) ln jaj + N ln + ln! + ln + X k X k W k ln k + ln (= k ) + ln (=(N )) : () Note tat we are assuming tat te yper-priors (8) are suciently broad tat tey ave no eect on te location of te evidence maximum and can terefore be neglected.

Here is te number of idden units, and te terms ln! + ln take account of te many equivalent modes of te posterior distribution arising from sign-ip and idden unit intercange symmetries in te network model. A derivation of tese results can be found in Bisop (995; pages 434{436). Te result () corresponds to a single mode of te distribution. If we initialize te weigt optimization algoritm wit dierent random values we can nd distinct solutions. In order to compute an overall evidence for te particular network model wit a given number of idden units, we make te assumption tat we ave found all of te distinct modes of te posterior distribution precisely once eac, and ten sum te evidences to arrive at te total model evidence. Tis neglects te possibility tat some of te solutions found are related by symmetry transformations (and terefore already taken into account) or tat we ave missed important modes. Wile some attempt could be made to detect degenerate solutions, it will be dicult to do muc better tan te above witin te framework of te Gaussian approximation. 5 Results: Robot Arm Problem As an illustration of te evaluation of model evidence for a larger-scale problem we consider te modelling of te forward kinematics for a two-link robot arm in a two-dimensional space, as introduced by MacKay (99). Tis problem was cosen as MacKay reports good results in using te Gaussian approximation framework to evaluate te evidences, and provides a good opportunity for comparison wit te caining approac. Te task is to learn te mapping (x ; x )! (y ; y ) given by y = :0 cos(x ) + :3 cos(x + x ) y = :0 sin(x ) + :3 sin(x + x ) () were te data set consists of 00 input-output pairs wit outputs corrupted by zero mean Gaussian noise wit standard deviation = 0:05. We ave used te original training data of MacKay, but generated our own test set of 000 points using te same prescription. Te evidence is evaluated using bot caining and te Gaussian approximation, for networks wit various numbers of idden units. In te caining metod, te particular form of te gamma priors for te precision variables are as follows: for te input-to-idden weigts and idden-unit biases,! =, s = 0:; for te idden-to-output weigts,! =, s = 0:; for te output biases,! = 0:, s =. Te noise level yper-parameters were! = 400, s = 0:. Tese settings follow closely tose used by Neal, R. M. (996) for te same problem. Te idden-to-output precision scaling was cosen by Neal suc tat te limit of an innite number of idden units is well dened and corresponds to a Gaussian process prior. For eac evidence ratio in te cain, te rst 00 samples from te ybrid Monte Carlo run, obtained wit a trajectory lengt of 50 leapfrog iterations, are omitted to give te algoritm a cance to reac te equilibrium distribution. Te next 600 samples are obtained using a trajectory lengt of 300 and are used to evaluate te evidence ratio. In Figure (a) we sow te error values of te sampling stage for 4 idden units, were we see tat te errors are largely uncorrelated, as required for eective Monte Carlo sampling. In Figure (b), we plot te values of lnfp(djm i )=p(djm i )g against i i = ::8. Note tat tere is a large cange in te evidence ratios at te beginning of te cain, were we sample close to te reference distribution. For tis

300 50 00 0 00 00 300 400 500 600 (a) 6 4 0 4 0 0. 0.4 0.6 0.8 Figure : (a) error E( = 0:6;w) for = 4, plotted for 600 successive Monte Carlo samples. (b) Values of te ratio lnfp(djm i )=p(djm i )g for i = ; : : : ; 8 for = 4. (b) reason, we coose te i to be dense close to = 0. We are currently researcing more principled approaces to te partitioning selection. Figure (a) sows te log model evidence against te number of idden units. Note tat te caining approac is computationally expensive: for =4, a complete cain takes 48 ours in a Matlab implementation running on a Silicon Grapics Callenge L. We see tat tere is no decline in te evidence as te number of idden units grows. Correspondingly, in Figure (b), we see tat te test error performance does not degrade as te number of idden units increases. Tis indicates tat tere is no over-tting wit increasing model complexity, in accordance wit Bayesian expectations. Te corresponding results from te Gaussian approximation approac are sown in Figure 3. We see tat tere is a caracteristic `Occam ill' wereby te evidence sows a peak at around =, wit a strong decrease for smaller values of and a slower decrease for larger values. Te corresponding test set errors similarly sow a minimum at around =, indicating tat te Gaussian approximation is becoming increasingly inaccurate for more complex models. 6 Discussion We ave seen tat te use of caining allows te eective evaluation of model evidences for neural networks using Monte Carlo tecniques. In particular, we nd tat tere is no peak in te model evidence, or te corresponding test set error, as te number of idden units is increased, and so tere is no indication of over- tting. Tis is in accord wit te expectation tat model complexity sould not be limited by te size of te data set, and is in marked contrast to te conventional 70 60 (a).4.3 (b) 50. 40. 30 0 5 0 0 5 0 Figure : (a) Plot of ln p(djm) for dierent numbers of idden units. (b) Test error against te number of idden units. Here te teoretical minimum value is.0. For = 64 te test error is.

850 (a) 3.5 (b) 800.5 750 5 0 5 0 5 5 0 5 0 5 Figure 3: (a) Plot of te model evidence for te robot arm problem versus te number of idden units, using te Gaussian approximation framework. Tis clearly sows te caracteristic `Occam ill' sape. Note tat te evidence is computed up to an additive constant, and so te origin of te vertical axis as no signicance. (b) Corresponding plot of te test set error versus te number of idden units. Individual points correspond to particular modes of te posterior weigt distribution, wile te line sows te mean test set error for eac value of. maximum likeliood viewpoint. It is also consistent wit te result tat, in te limit of an innite number of idden units, te prior over network weigts leads to a well-dened Gaussian prior over functions (Williams, 997). An important advantage of being able to make accurate evaluations of te model evidence is te ability to compare quite distinct kinds of model, for example radial basis function networks and multi-layer perceptrons. Tis can be done eiter by caining bot models back to a common reference model, or by evaluating normalized model evidences explicitly. Acknowledgements We would like to tank Cris Williams and Alastair Bruce for a number of useful discussions. Tis work was supported by EPSRC grant GR/J7545: Novel Developments in Learning Teory for Neural Networks. References Bisop, C. M. (995). Neural Networks for Pattern Recognition. Oxford University Press. Duane, S., A. D. Kennedy, B. J. Pendleton, and D. Rowet (987). Hybrid Monte Carlo. Pysics Letters B 95 (), 6{. Gilks, W. R., S. Ricardson, and D. J. Spiegelalter (995). Markov Cain Monte Carlo in Practice. Capman and Hall. Kass, R. E. and A. E. Raftery (995). Bayes factors. J. Am. Statist. Ass. 90, 773{795. MacKay, D. J. C. (99). A practical Bayesian framework for back-propagation networks. Neural Computation 4 (3), 448{47. Neal, R. M. (993). Probabilistic inference using Markov cain Monte Carlo metods. Tecnical Report CRG-TR-93-, Department of Computer Science, University of Toronto, Cananda. Neal, R. M. (996). Bayesian Learning for Neural Networks. New York: Springer. Lecture Notes in Statistics 8. Williams, C. K. I. (997). Computing wit innite networks. In "NIPS9". Tis volume.