Bayesian Inference in Neural Networks

Size: px

Start display at page:

Download "Bayesian Inference in Neural Networks"

Kenneth Manning
6 years ago
Views:

1 1 Bayesian Inference in Neural Networks Bayesian Inference in Neural Networks Robert L. Paige Saturday, January 12, 2002 with Ronald W. Butler, Colorado State University Biometrika (2001) Vol. 88, Issue 3

2 2 Bayesian Inference in Neural Networks Neural networks are used extensively for classification and pattern recognition Pattern Recognition and Neural Networks, Ripley Neural Networks for Pattern Recognition, Bishop Bayesian Neural Network The expectation of y ij is a quasi-linear function of the vector variables x i 1,x i1,,x im for i 1,,n and j 1,,r. Let Y y ij be an n r matrix of responses and an n r matrix of additive errors. The model is Y X where 11, 21,, nr are independent identically distributed N0, 1, 0 is a scale parameter, X is an n p design matrix of the form X x 1 x 1 1 x 1 q x n x n 1 x n q and is a p r matrix of unknown regression parameters with p m1q. The elements of X are functions of the nonlinear parameter matrix 1, 2,, q where and i i0, i1,, im x 1 e x 1. Both y- and x-component values translated and scaled to be in the range 0, 1. Approximation to any continuous function uniformly over a compact set to an arbitrary degree of precision, Cybenko 1988 Ripley s S-Plus routine nnet as described in Venables and Ripley 1997 Prior distribution on,, 2 to avoid overfitting and encourage model parsimony Marginalization via Laplace s method Posterior Inequivalence,

3 3 Bayesian Inference in Neural Networks x 1 x Ey x x Equal weight from the data but not from the prior (different posterior weights) Translated sigmoidal x x tanh x 2 Ey x x Translate x s and y s to 1/2,1/2 Subtract 1 from every entry in X 2 11/2 Choice of Prior Express an a priori preference for smooth regression functions (Occam s razor) Plot of ax versus a,x a x , function is flat or smooth, a step function (saturation) x is smooth if the values of 0, 1, and fall within the range 4, 4

4 4 Bayesian Inference in Neural Networks, T0, 4I rpq m1 Support of each component is concentrated in the range 6,6 and, generate equivalence classes of local maxima (Bishop (1995) & tanh) Reduces the number of local maxima for Laplace s method Hierarchical Specification, 2 N 0, 2 / I rpq m1 with ; the midpoint of 10 4 /4, 10 2 /4 Prior for 2 is 1 /2,/2 with 3 and /21 exp 2 2 Vague prior knowledge E 2 4 and prior variance is 4. Choice of Compatible with Ripley s choice for nnet ; midpoint of 10 4,10 2. Weight decay Posterior depends on SSE, tryx YX tr. Weight decay discourages saturation and yields a smooth well-behaved marginal posterior Decreases the number of local maxima and false maxima (saddlepoints) (Ripley 1996) Good for quasi-newton routines and Laplace s method Posterior Calculation where,, 2 Y, M q 2 d 1 1 exp SSE, 2 2 d 1 rn rp qm 1/2 Completing the square in and integrating out yields

5 5 Bayesian Inference in Neural Networks with and, 2 Y, M q 2 d 2 1 X X I p r/2 exp E q 2 2 d 2 nrqm 1/2, E q B tr B tr Y I n X X X I p Marginalisation in 2 yields Y, M q Y,M q 1 X Y. c q X X I p r/2 E q d 2 E q d 2 is the dominant term where is the true posterior, is the unnormalised posterior from exact marginalisation in and 2, and Posterior Symmetry in c q d 2 prqm1/2 /2 /2 nrqm1/ x x x x Permutations and sign changes of the q sigmoidal parameter sets gives equivalence classes of expectation equivalent parameter values Identified by, MI, maximal invariant. O MI q m q0 Approximate Posterior Expectation Laplace s Approximation with Multiple Modes The posterior expectation of an arbitrary smooth positively-valued function g as

6 6 Bayesian Inference in Neural Networks g yd gl L 2q m1/2 c qe q d 2 X T X I p r H The values comprise the set of local maxima for the dominant portion of ln y taken to be d 2 ln E When g is invariant under permutations and sign changes of the q sigmoidal parameters g yd gl 2 q q! gl Marginal Inference Model Choice. g 1The NN model with q sigmoidal terms is denoted M q and has Bayes factor O PrM q Y q Y M qm q Y M q Y,M qd Prediction. g 1The posterior density for future observable Y f at y f given x f is computed by including y f and x f in the data and marginalizing as for model choice y f x f, Y, M q Y,y f M q Y M q Y,y f,m qd Y,M qd y f x f, Y q y f x f,y,m qy M q q Y M q q Y,y f M q q Y M q Bayes Estimation of the Regression Function.

7 7 Bayesian Inference in Neural Networks g X 1 X T X and 1 X T X I p X Y (conditional least squares),, 2 1 E y X Y, M q X Yd X 1 X T X Y,M qd, 2, Y N 1 X T X, 2 I r 1 E y X Y, M q X 1 X T X Y,M qd Y,M qd Marginal Distributions. Lack of identifiability form interchange of parameter sets Inference about marginal densities requires identifiability O is partitioned into 1, 2, where 1 is one-dimensional 1 Y, M q O 1 1, 2 Y,M qd 2 O Y,M qd where O 1 2 : 1, 2 O Approx. denominator is the Bayes factor Approx. numerator, 1 held fixed and Laplace s method in parameter 2 with g 1.

8 8 Bayesian Inference in Neural Networks Example (Univariate Regression) Nitrate Utilization (Bates & Watts) Utilization of nitrate,y, in bush beans as a function of light intensity,x. On two different days, the primary leaves of three 16-day-old bean plants were subjected to eight levels, 2.2, 5.5, 9.6, 17.5, 27.0, 46.0, 94. 0, , of light intensity, in E/m 2 s, and the nitrate utilizations, in nmol/g hr, were measured Y X

9 9 Bayesian Inference in Neural Networks Mode One sigmoid term mode ,5.17 mode ,5.41 Two sigmoid terms mode ,5.17 mode ,5.41 mode , ,3.13 Three sigmoid terms mode ,5.17 mode ,5.41 mode , ,3.13

10 10 Bayesian Inference in Neural Networks Mode Laplace s Bayespack Importance Method (Smallest PRE) Sampling One sigmoid term mode mode Total Two sigmoid terms mode mode mode Total Three sigmoid terms mode mode mode Total

11 y 11 Bayesian Inference in Neural Networks x Bayes estimates of the regression function: E y X Y,M q 1. Exact (solid) fitting a single sigmoid 2. Its Laplace approximation (dashed) 3. Laplace approximation to the estimate from mixing over models M 1 M 3 (dot-dashed)

12 12 Bayesian Inference in Neural Networks Monotonicity Results Definition Augmented Mode. If is a q m 1 mode for model M q, then the q1m 1 matrix :,0, obtained by adjoining a row of zeros, is said to be an augmented mode for M q1. Lemma 1. For any q, an augmented mode for model M q1 is a critical value for E q1. For sufficiently large n, it also locates a local maximum and is therefore a mode. Lemma 2. For any q, let be a mode of M q and its augmented mode for M q1. For large enough n, L q1, the contribution to M q1 from, is related to L q according to L q1 1. L q Likewise, the relative contribution to M q1 from the orbit of modes generated by is 2 q1 q 1!L q1 2 q 2q 1. q!l q Theorem 1. For sufficiently large n, the Laplace approximation to the Bayes factor comparing M q1 and M q has ratio Y M q1 2q 1. Y M q

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,