Te comparison of dierent models M i is based on teir relative probabilities, wic can be expressed, again using Bayes' teorem, in terms of prior probab

Similar documents
A = h w (1) Error Analysis Physics 141

Widths. Center Fluctuations. Centers. Centers. Widths

Gaussian Processes for Regression. Carl Edward Rasmussen. Department of Computer Science. Toronto, ONT, M5S 1A4, Canada.

Regularized Regression

Polynomial Interpolation

Numerical Differentiation

Polynomial Interpolation

A MONTE CARLO ANALYSIS OF THE EFFECTS OF COVARIANCE ON PROPAGATED UNCERTAINTIES

Overdispersed Variational Autoencoders

Order of Accuracy. ũ h u Ch p, (1)

LIMITS AND DERIVATIVES CONDITIONS FOR THE EXISTENCE OF A LIMIT

MAT244 - Ordinary Di erential Equations - Summer 2016 Assignment 2 Due: July 20, 2016

Deep Belief Network Training Improvement Using Elite Samples Minimizing Free Energy

Lecture XVII. Abstract We introduce the concept of directional derivative of a scalar function and discuss its relation with the gradient operator.

LIMITATIONS OF EULER S METHOD FOR NUMERICAL INTEGRATION

Spike train entropy-rate estimation using hierarchical Dirichlet process priors

Chapter 5 FINITE DIFFERENCE METHOD (FDM)

Probabilistic Graphical Models Homework 1: Due January 29, 2014 at 4 pm

Yishay Mansour. AT&T Labs and Tel-Aviv University. design special-purpose planning algorithms that exploit. this structure.

Minimizing D(Q,P) def = Q(h)

2.3 Product and Quotient Rules

Exam 1 Review Solutions

lecture 26: Richardson extrapolation

The Verlet Algorithm for Molecular Dynamics Simulations

Long Term Time Series Prediction with Multi-Input Multi-Output Local Learning

Flavius Guiaş. X(t + h) = X(t) + F (X(s)) ds.

EDML: A Method for Learning Parameters in Bayesian Networks

Quantum Numbers and Rules

Inf sup testing of upwind methods

[db]

Continuity and Differentiability Worksheet

Notes on Neural Networks

Taylor Series and the Mean Value Theorem of Derivatives

Copyright c 2008 Kevin Long

Logarithmic functions

Teaching Differentiation: A Rare Case for the Problem of the Slope of the Tangent Line

3.1 Extreme Values of a Function

Solution. Solution. f (x) = (cos x)2 cos(2x) 2 sin(2x) 2 cos x ( sin x) (cos x) 4. f (π/4) = ( 2/2) ( 2/2) ( 2/2) ( 2/2) 4.

Fast Exact Univariate Kernel Density Estimation

CS522 - Partial Di erential Equations

Kernel Density Based Linear Regression Estimate

Efficient algorithms for for clone items detection

HOMEWORK HELP 2 FOR MATH 151

5.1 We will begin this section with the definition of a rational expression. We

Material for Difference Quotient

The Basics of Vacuum Technology

The Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Combining functions: algebraic methods

How to Find the Derivative of a Function: Calculus 1

EFFICIENCY OF MODEL-ASSISTED REGRESSION ESTIMATORS IN SAMPLE SURVEYS

Definition of the Derivative

Derivatives. By: OpenStaxCollege

NONLINEAR SYSTEMS IDENTIFICATION USING THE VOLTERRA MODEL. Georgeta Budura

5 Ordinary Differential Equations: Finite Difference Methods for Boundary Problems

Solution for the Homework 4

The derivative function

Differential Calculus (The basics) Prepared by Mr. C. Hull

Gaussian process for nonstationary time series prediction

4.2 - Richardson Extrapolation

(4.2) -Richardson Extrapolation

HOW TO DEAL WITH FFT SAMPLING INFLUENCES ON ADEV CALCULATIONS

HARMONIC ALLOCATION TO MV CUSTOMERS IN RURAL DISTRIBUTION SYSTEMS

estimate results from a recursive sceme tat generalizes te algoritms of Efron (967), Turnbull (976) and Li et al (997) by kernel smooting te data at e

Exercises for numerical differentiation. Øyvind Ryan

FINITE ELEMENT STOCHASTIC ANALYSIS

(a) At what number x = a does f have a removable discontinuity? What value f(a) should be assigned to f at x = a in order to make f continuous at a?

Artificial Neural Network Model Based Estimation of Finite Population Total

Two Spirals Two Gaussians Letters

1 Calculus. 1.1 Gradients and the Derivative. Q f(x+h) f(x)

Optimal parameters for a hierarchical grid data structure for contact detection in arbitrarily polydisperse particle systems

SECTION 1.10: DIFFERENCE QUOTIENTS LEARNING OBJECTIVES

Lecture 21. Numerical differentiation. f ( x+h) f ( x) h h

Preface. Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Chapter 2 Limits and Continuity

Bootstrap prediction intervals for Markov processes

CS340: Bayesian concept learning. Kevin Murphy Based on Josh Tenenbaum s PhD thesis (MIT BCS 1999)

Differentiation in higher dimensions

MVT and Rolle s Theorem

c [2016] Bud B. Coulson ALL RIGHTS RESERVED

Continuity and Differentiability of the Trigonometric Functions

Work and Energy. Introduction. Work. PHY energy - J. Hedberg

Robotic manipulation project

= 0 and states ''hence there is a stationary point'' All aspects of the proof dx must be correct (c)

Lecture 15. Interpolation II. 2 Piecewise polynomial interpolation Hermite splines

A Nonparametric Prior for Simultaneous Covariance Estimation

Financial Econometrics Prof. Massimo Guidolin

Digital Filter Structures

An Empirical Bayesian interpretation and generalization of NL-means

Estimating Peak Bone Mineral Density in Osteoporosis Diagnosis by Maximum Distribution

Math 31A Discussion Notes Week 4 October 20 and October 22, 2015

Sin, Cos and All That

Recall from our discussion of continuity in lecture a function is continuous at a point x = a if and only if

The Krewe of Caesar Problem. David Gurney. Southeastern Louisiana University. SLU 10541, 500 Western Avenue. Hammond, LA

New families of estimators and test statistics in log-linear models

Dedicated to the 70th birthday of Professor Lin Qun

Practice Problem Solutions: Exam 1

THE STURM-LIOUVILLE-TRANSFORMATION FOR THE SOLUTION OF VECTOR PARTIAL DIFFERENTIAL EQUATIONS. L. Trautmann, R. Rabenstein

Handling Missing Data on Asymmetric Distribution

1 2 x Solution. The function f x is only defined when x 0, so we will assume that x 0 for the remainder of the solution. f x. f x h f x.

Transcription:

To appear in: Advances in Neural Information Processing Systems 9, eds. M. C. Mozer, M. I. Jordan and T. Petsce. MIT Press, 997 Bayesian Model Comparison by Monte Carlo Caining David Barber D.Barber@aston.ac.uk Cristoper M. Bisop C.M.Bisop@aston.ac.uk Neural Computing Researc Group Aston University, Birmingam, B4 7ET, U.K. ttp://www.ncrg.aston.ac.uk/ Abstract Te tecniques of Bayesian inference ave been applied wit great success to many problems in neural computing including evaluation of regression functions, determination of error bars on predictions, and te treatment of yper-parameters. However, te problem of model comparison is a muc more callenging one for wic current tecniques ave signicant limitations. In tis paper we sow ow an extended form of Markov cain Monte Carlo, called caining, is able to provide eective estimates of te relative probabilities of dierent models. We present results from te robot arm problem and compare tem wit te corresponding results obtained using te standard Gaussian approximation framework. Bayesian Model Comparison In a Bayesian treatment of statistical inference, our state of knowledge of te values of te parameters w in a model M is described in terms of a probability distribution function. Initially tis is cosen to be some prior distribution p(wjm), wic can be combined wit a likeliood function p(djw; M) using Bayes' teorem to give a posterior distribution p(wjd; M) in te form p(wjd; M) = p(djw; M)p(wjM) p(djm) () were D is te data set. Predictions of te model are obtained by performing integrations weigted by te posterior distribution.

Te comparison of dierent models M i is based on teir relative probabilities, wic can be expressed, again using Bayes' teorem, in terms of prior probabilities p(m i ) to give p(m i jd) p(m j jd) = p(djm i)p(m i ) () p(djm j )p(m j ) and so requires tat we be able to evaluate te model evidence p(djm i ), wic corresponds to te denominator in (). Te relative probabilities of dierent models can be used to select te single most probable model, or to form a committee of models, weiged by teir probabilities. It is convenient to write te numerator of () in te form expf E(w)g, were E(w) is an error function. Normalization of te posterior distribution ten requires tat p(djm) = Z expf E(w)g dw: (3) Generally, it is straigtforward to evaluate E(w) for a given value of w, altoug it is extremely dicult to evaluate te corresponding model evidence using (3) since te posterior distribution is typically very small except in narrow regions of te ig-dimensional parameter space, wic are unknown a-priori. Standard numerical integration tecniques are terefore inapplicable. One approac is based on a local Gaussian approximation around a mode of te posterior (MacKay, 99). Unfortunately, tis approximation is expected to be accurate only wen te number of data points is large in relation to te number of parameters in te model. In fact it is for relatively complex models, or problems for wic data is scarce, tat Bayesian metods ave te most to oer. Indeed, Neal, R. M. (996) as argued tat, from a Bayesian perspective, tere is no reason to limit te number of parameters in a model, oter tan for computational reasons. We terefore consider an approac to te evaluation of model evidence wic overcomes te limitations of te Gaussian framework. For additional tecniques and references to Bayesian model comparison, see Gilks et al. (995) and Kass and Raftery (995). Caining Suppose we ave a simple model M 0 for wic we can evaluate te evidence analytically, and for wic we can easily generate a sample w l (were l = ; : : : ; L) from te corresponding distribution p(wjd; M 0 ). Ten te evidence for some oter model M can be expressed in te form p(djm) p(djm 0 ) = ' Z L expf E(w) + E 0 (w)gp(wjd; M 0 ) dw LX l= expf E(w l ) + E 0 (w l )g: (4) Unfortunately, te Monte Carlo approximation in (4) will be poor if te two error functions are signicantly dierent, since te exponent is dominated by regions were E is relatively small, for wic tere will be few samples unless E 0 is also small in tose regions. A simple Monte Carlo approac will terefore yield poor results. Tis problem is equivalent to te evaluation of free energies in statistical pysics,

wic is known to be a callenging problem, and were a number of approaces ave been developed Neal (993). Here we discuss one suc approac to tis problem based on a cain of K successive models M i wic interpolate between M 0 and M, so tat te required evidence can be written as p(djm) = p(djm 0 ) p(djm ) p(djm ) p(djm 0 ) p(djm ) : : : p(djm) p(djm K ) : (5) Eac of te ratios in (5) can be evaluated using (4). Te goal is to devise a cain of models suc tat eac successive pair of models as probability distributions wic are reasonably close, so tat eac of te ratios in (5) can be evaluated accurately, wile keeping te total number of links in te cain fairly small to limit te computational costs. We ave cosen te tecnique of ybrid Monte Carlo (Duane et al., 987; Neal, 993) to sample from te various distributions, since tis as been sown to be eective for sampling from te complex distributions arising wit neural network models (Neal, R. M., 996). Tis involves introducing Hamiltonian equations of motion in wic te parameters w are augmented by a set of ctitious `momentum' variables, wic are ten integrated using te leapfrog metod. At te end of eac trajectory te new parameter vector is accepted wit a probability governed by te Metropolis criterion, and te momenta are replaced using Gibbs sampling. As a ceck on our software implementation of caining, we ave evaluated te evidence for a mixture of two non-isotropic Gaussian distributions, and obtained a result wic was witin 0% of te analytical solution. 3 Application to Neural Networks We now consider te application of te caining metod to regression problems involving neural network models. Te network corresponds to a function y(x; w), and te data set consists of N pairs of input vectors x n and corresponding targets t n were n = ; : : : ; N. Assuming Gaussian noise on te target data, te likeliood function takes te form p(djw; M) = N= exp( NX n= ky(x n ; w) t n k ) were is a yper-parameter representing te inverse of te noise variance. We consider networks wit a single idden layer of `tan' units, and linear output units. Following Neal, R. M. (996) we use a diagonal Gaussian prior in wic te weigts are divided into groups w k, were k = ; : : : ; 4 corresponding to input-to-idden weigts, idden-unit biases, idden-to-output weigts, and output biases. Eac group is governed by a separate `precision' yper-parameter k, so tat te prior takes te form ( ) p(wjf k g) = exp X k wk T w k (7) Z W were Z W is te normalization coecient. Te yper-parameters f k g and are temselves eac governed by yper-priors given by Gamma distributions of te form k (6) p() / s exp( s=!) (8)

in wic te mean! and variance! =s are cosen to give very broad yper-priors in reection of our limited prior knowledge of te values of te yper-parameters. We use te ybrid Monte Carlo algoritm to sample from te joint distribution of parameters and yper-parameters. For te evaluation of evidence ratios, owever, we consider only te parameter samples, and perform te integrals over yperparameters analytically, using te fact tat te gamma distribution is conjugate to te Gaussian. In order to apply caining to tis problem, we coose te prior as our reference distribution, and ten dene a set of intermediate distributions based on a parameter wic governs te eective contribution from te data term, so tat E(; w) = (w) + E 0 (w) (9) were (w) arises from te likeliood term (6) wile E 0 (w) corresponds to te prior (7). We select a set of 8 values of wic interpolate between te reference distribution ( = 0) and te desired model distribution ( = ). Te evidence for te prior alone is easily evaluated analytically. 4 Gaussian Approximation As a comparison against te metod of caining, we consider te framework of MacKay (99) based on a local Gaussian approximation to te posterior distribution. Tis approac makes use of te evidence approximation in wic te integration over yper-parameters is approximated by setting tem to specic values wic are temselves determined by maximizing teir evidence functions. Tis leads to a ierarcical treatment as follows. At te lowest level, te maximum bw of te posterior distribution over weigts is found for xed values of te yperparameters by minimizing te error function. Periodically te yper-parameters are re-estimated by evidence maximization, were te evidence is obtained analytically using te Gaussian approximation. Tis gives te following re-estimation formulae := N NX n= ky(x n ; bw) t n k k := k bw T k bw k (0) were k = W k k Tr k P (A ), W k is te total number of parameters in group k, A = rre(bw), = k k, and Tr k () denotes te trace over te kt group of parameters. Te weigts are updated in an inner loop by minimizing te error function using a conjugate gradient optimizer, wile te yper-parameters are periodically re-estimated using (0). Once training is complete, te model evidence is evaluated by making a Gaussian approximation around te converged values of te yper-parameters, and integrating over tis distribution analytically. Tis gives te model log evidence as ln p(djm) = E(bw) ln jaj + N ln + ln! + ln + X k X k W k ln k + ln (= k ) + ln (=(N )) : () Note tat we are assuming tat te yper-priors (8) are suciently broad tat tey ave no eect on te location of te evidence maximum and can terefore be neglected.

Here is te number of idden units, and te terms ln! + ln take account of te many equivalent modes of te posterior distribution arising from sign-ip and idden unit intercange symmetries in te network model. A derivation of tese results can be found in Bisop (995; pages 434{436). Te result () corresponds to a single mode of te distribution. If we initialize te weigt optimization algoritm wit dierent random values we can nd distinct solutions. In order to compute an overall evidence for te particular network model wit a given number of idden units, we make te assumption tat we ave found all of te distinct modes of te posterior distribution precisely once eac, and ten sum te evidences to arrive at te total model evidence. Tis neglects te possibility tat some of te solutions found are related by symmetry transformations (and terefore already taken into account) or tat we ave missed important modes. Wile some attempt could be made to detect degenerate solutions, it will be dicult to do muc better tan te above witin te framework of te Gaussian approximation. 5 Results: Robot Arm Problem As an illustration of te evaluation of model evidence for a larger-scale problem we consider te modelling of te forward kinematics for a two-link robot arm in a two-dimensional space, as introduced by MacKay (99). Tis problem was cosen as MacKay reports good results in using te Gaussian approximation framework to evaluate te evidences, and provides a good opportunity for comparison wit te caining approac. Te task is to learn te mapping (x ; x )! (y ; y ) given by y = :0 cos(x ) + :3 cos(x + x ) y = :0 sin(x ) + :3 sin(x + x ) () were te data set consists of 00 input-output pairs wit outputs corrupted by zero mean Gaussian noise wit standard deviation = 0:05. We ave used te original training data of MacKay, but generated our own test set of 000 points using te same prescription. Te evidence is evaluated using bot caining and te Gaussian approximation, for networks wit various numbers of idden units. In te caining metod, te particular form of te gamma priors for te precision variables are as follows: for te input-to-idden weigts and idden-unit biases,! =, s = 0:; for te idden-to-output weigts,! =, s = 0:; for te output biases,! = 0:, s =. Te noise level yper-parameters were! = 400, s = 0:. Tese settings follow closely tose used by Neal, R. M. (996) for te same problem. Te idden-to-output precision scaling was cosen by Neal suc tat te limit of an innite number of idden units is well dened and corresponds to a Gaussian process prior. For eac evidence ratio in te cain, te rst 00 samples from te ybrid Monte Carlo run, obtained wit a trajectory lengt of 50 leapfrog iterations, are omitted to give te algoritm a cance to reac te equilibrium distribution. Te next 600 samples are obtained using a trajectory lengt of 300 and are used to evaluate te evidence ratio. In Figure (a) we sow te error values of te sampling stage for 4 idden units, were we see tat te errors are largely uncorrelated, as required for eective Monte Carlo sampling. In Figure (b), we plot te values of lnfp(djm i )=p(djm i )g against i i = ::8. Note tat tere is a large cange in te evidence ratios at te beginning of te cain, were we sample close to te reference distribution. For tis

300 50 00 0 00 00 300 400 500 600 (a) 6 4 0 4 0 0. 0.4 0.6 0.8 Figure : (a) error E( = 0:6;w) for = 4, plotted for 600 successive Monte Carlo samples. (b) Values of te ratio lnfp(djm i )=p(djm i )g for i = ; : : : ; 8 for = 4. (b) reason, we coose te i to be dense close to = 0. We are currently researcing more principled approaces to te partitioning selection. Figure (a) sows te log model evidence against te number of idden units. Note tat te caining approac is computationally expensive: for =4, a complete cain takes 48 ours in a Matlab implementation running on a Silicon Grapics Callenge L. We see tat tere is no decline in te evidence as te number of idden units grows. Correspondingly, in Figure (b), we see tat te test error performance does not degrade as te number of idden units increases. Tis indicates tat tere is no over-tting wit increasing model complexity, in accordance wit Bayesian expectations. Te corresponding results from te Gaussian approximation approac are sown in Figure 3. We see tat tere is a caracteristic `Occam ill' wereby te evidence sows a peak at around =, wit a strong decrease for smaller values of and a slower decrease for larger values. Te corresponding test set errors similarly sow a minimum at around =, indicating tat te Gaussian approximation is becoming increasingly inaccurate for more complex models. 6 Discussion We ave seen tat te use of caining allows te eective evaluation of model evidences for neural networks using Monte Carlo tecniques. In particular, we nd tat tere is no peak in te model evidence, or te corresponding test set error, as te number of idden units is increased, and so tere is no indication of over- tting. Tis is in accord wit te expectation tat model complexity sould not be limited by te size of te data set, and is in marked contrast to te conventional 70 60 (a).4.3 (b) 50. 40. 30 0 5 0 0 5 0 Figure : (a) Plot of ln p(djm) for dierent numbers of idden units. (b) Test error against te number of idden units. Here te teoretical minimum value is.0. For = 64 te test error is.

850 (a) 3.5 (b) 800.5 750 5 0 5 0 5 5 0 5 0 5 Figure 3: (a) Plot of te model evidence for te robot arm problem versus te number of idden units, using te Gaussian approximation framework. Tis clearly sows te caracteristic `Occam ill' sape. Note tat te evidence is computed up to an additive constant, and so te origin of te vertical axis as no signicance. (b) Corresponding plot of te test set error versus te number of idden units. Individual points correspond to particular modes of te posterior weigt distribution, wile te line sows te mean test set error for eac value of. maximum likeliood viewpoint. It is also consistent wit te result tat, in te limit of an innite number of idden units, te prior over network weigts leads to a well-dened Gaussian prior over functions (Williams, 997). An important advantage of being able to make accurate evaluations of te model evidence is te ability to compare quite distinct kinds of model, for example radial basis function networks and multi-layer perceptrons. Tis can be done eiter by caining bot models back to a common reference model, or by evaluating normalized model evidences explicitly. Acknowledgements We would like to tank Cris Williams and Alastair Bruce for a number of useful discussions. Tis work was supported by EPSRC grant GR/J7545: Novel Developments in Learning Teory for Neural Networks. References Bisop, C. M. (995). Neural Networks for Pattern Recognition. Oxford University Press. Duane, S., A. D. Kennedy, B. J. Pendleton, and D. Rowet (987). Hybrid Monte Carlo. Pysics Letters B 95 (), 6{. Gilks, W. R., S. Ricardson, and D. J. Spiegelalter (995). Markov Cain Monte Carlo in Practice. Capman and Hall. Kass, R. E. and A. E. Raftery (995). Bayes factors. J. Am. Statist. Ass. 90, 773{795. MacKay, D. J. C. (99). A practical Bayesian framework for back-propagation networks. Neural Computation 4 (3), 448{47. Neal, R. M. (993). Probabilistic inference using Markov cain Monte Carlo metods. Tecnical Report CRG-TR-93-, Department of Computer Science, University of Toronto, Cananda. Neal, R. M. (996). Bayesian Learning for Neural Networks. New York: Springer. Lecture Notes in Statistics 8. Williams, C. K. I. (997). Computing wit innite networks. In "NIPS9". Tis volume.