Bayesian Inference of Noise Levels in Regression

Similar documents
Regression with Input-Dependent Noise: A Bayesian Treatment

Variational Principal Components

Where now? Machine Learning and Bayesian Inference

Lecture : Probabilistic Machine Learning

Lecture 3: Pattern Classification. Pattern classification

Machine Learning Lecture 3

Parametric Techniques Lecture 3

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Parametric Techniques

Recent Advances in Bayesian Inference Techniques

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Notes on Discriminant Functions and Optimal Classification

Confidence Estimation Methods for Neural Networks: A Practical Comparison

Widths. Center Fluctuations. Centers. Centers. Widths

Machine Learning Lecture 2

STA 4273H: Sta-s-cal Machine Learning

Computer Vision Group Prof. Daniel Cremers. 3. Regression

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Multivariate Bayesian Linear Regression MLAI Lecture 11

STA 414/2104, Spring 2014, Practice Problem Set #1

Machine Learning using Bayesian Approaches

Reading Group on Deep Learning Session 1

Bayesian Machine Learning

INTRODUCTION TO PATTERN RECOGNITION

Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction

Variational Methods in Bayesian Deconvolution

Power EP. Thomas Minka Microsoft Research Ltd., Cambridge, UK MSR-TR , October 4, Abstract

A variational radial basis function approximation for diffusion processes

Density Estimation. Seungjin Choi

Variational Bayesian Logistic Regression

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

ECE521 week 3: 23/26 January 2017

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

PATTERN RECOGNITION AND MACHINE LEARNING

Lecture 1c: Gaussian Processes for Regression

COS513 LECTURE 8 STATISTICAL CONCEPTS

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Outline Lecture 2 2(32)

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Diagram Structure Recognition by Bayesian Conditional Random Fields

GWAS IV: Bayesian linear (variance component) models

Relevance Vector Machines for Earthquake Response Spectra

Ways to make neural networks generalize better

Bayesian Machine Learning

Lecture Note 2: Estimation and Information Theory

Bayesian methods in economics and finance

Gaussian Process Vine Copulas for Multivariate Dependence

GWAS V: Gaussian processes

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Linear Models for Regression

Pattern Recognition and Machine Learning

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

STA414/2104 Statistical Methods for Machine Learning II

Learning Gaussian Process Models from Uncertain Data

Bayesian Learning Features of Bayesian learning methods:

Lecture 4: Probabilistic Learning

Ch 4. Linear Models for Classification

Machine Learning Lecture 2

Radial Basis Functions: a Bayesian treatment

Expectation Propagation for Approximate Bayesian Inference

2 Statistical Estimation: Basic Concepts

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture Machine Learning (past summer semester) If at least one English-speaking student is present.

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Introduction to Gaussian Processes

MODULE -4 BAYEIAN LEARNING

y Xw 2 2 y Xw λ w 2 2

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Nonparametric Bayesian Methods (Gaussian Processes)

Linear Models A linear model is defined by the expression

Gaussian process for nonstationary time series prediction

STA 4273H: Statistical Machine Learning

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation

g-priors for Linear Regression

Variational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M

Machine Learning Techniques for Computer Vision

First Technical Course, European Centre for Soft Computing, Mieres, Spain. 4th July 2011

Large-scale Ordinal Collaborative Filtering

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Tokamak profile database construction incorporating Gaussian process regression

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Estimators as Random Variables

2. A Basic Statistical Toolbox

Lecture 3: Pattern Classification

Relevance Vector Machines

VIBES: A Variational Inference Engine for Bayesian Networks

Reading Group on Deep Learning Session 2

Introduction. Chapter 1

Bayesian Regression Linear and Logistic Regression

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

6.867 Machine learning

State Space and Hidden Markov Models

Machine Learning Basics III

An example to illustrate frequentist and Bayesian approches

Transcription:

Bayesian Inference of Noise Levels in Regression Christopher M. Bishop Microsoft Research, 7 J. J. Thomson Avenue, Cambridge, CB FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop Cazhaow S. Qazaz Neural Computing Research Group Aston University, Birmingham, B4 7ET, U.K. In Proceedings 996 International Conference on Artificial Neural Networks, ICANN 96, C. von der Malsburg et al. (eds) Springer (997) 59-64. Abstract In most treatments of the regression problem it is assumed that the distribution of target data can be described by a deterministic function of the inputs, together with additive Gaussian noise having constant variance. The use of maimum likelihood to train such models then corresponds to the minimization of a sum-of-squares error function. In many applications a more realistic model would allow the noise variance itself to depend on the input variables. However, the use of maimum likelihood for training such models would give highly biased results. In this paper we show how a Bayesian treatment can allow for an input-dependent variance while overcoming the bias of maimum likelihood. Introduction In regression problems it is important not only to predict the output variables but also to have some estimate of the error bars associated with those predictions. An important contribution to the error bars arises from the intrinsic noise on the data. In most conventional treatments of regression, it is assumed that the noise can be modelled by a Gaussian distribution with a constant variance. However, in many applications it will be more realistic to allow the noise variance itself to depend on the input variables. A standard maimum likelihood approach would, however, lead to a systematic underestimate of the noise variance. Here we adopt an approimate hierarchical Bayesian treatment (MacKay, 995) to find the most probable interpolant and most probable input-dependent noise variance. We compare our results with maimum likelihood and show how this approimate Bayesian treatment leads to a significantly reduced bias. In order to gain some insight into the limitations of the maimum likelihood approach, and to see how these limitations can be overcome in a Bayesian treatment, it is useful to consider first a much simpler problem involving a single random variable (Bishop, 995). Suppose that a variable z is known to have a Gaussian distribution, but with unknown mean and variance. Given a sample D z n drawn from that distribution, where n =,..., N, our goal is to infer values for the mean µ and variance σ. The likelihood function is given by p(d µ, σ ) = (πσ ep ) N/ σ (z n µ). ()

A non-bayesian approach to finding the mean and variance is to maimize the likelihood jointly over µ and σ, corresponding to the intuitive idea of finding the parameter values which are most likely to have given rise to the observed data set. This yields the standard result µ = N z n, σ = N (z n µ). () It is well known that the estimate σ for the variance given in () is biased since the epectation of this estimate is not equal to the true value E[ σ ] = N N σ () where σ is the true variance of the distribution which generated the data, and E[ ] denotes an average over data sets of size N. For large N this effect is small. However, in the case of regression problems there are generally much larger number of degrees of freedom in relation to the number of available data points, in which case the effect of this bias can be very substantial. By adopting a Bayesian viewpoint this bias can be removed. The marginal likelihood of σ should be computed by integrating over the mean µ. Assuming a flat prior p(µ) we obtain p(d σ ) = p(d σ, µ)p(µ) dµ (4) ep σn σ (z n µ). (5) Maimizing (5) with respect to σ then gives σ = N (z n µ) (6) which is unbiased. This result is illustrated in Figure which shows contours of p(d µ, σ ) together with the marginal likelihood p(d σ ) and the conditional likelihood p(d µ, σ ) evaluated at µ = µ..5.5 variance.5 variance.5.5.5 mean 4 6 likelihood Figure : The left hand plot shows contours of the likelihood function p(d µ, σ ) given by () for 4 data points drawn from a Gaussian distribution having zero mean and unit variance. The right hand plot shows the marginal likelihood function p(d σ ) (dashed curve) and the conditional likelihood function p(d µ, σ ) (solid curve). It can be seen that the skewed contours result in a value of σ which is smaller than σ.

Bayesian Regression Consider a regression problem involving the prediction of a noisy variable t given the value of a vector of input variables. Our goal is to predict both a regression function and an inputdependent noise variance. We shall therefore consider two networks. The first network takes the input vector and generates an output y(; w) which represents the regression function, and is governed by a vector of weight parameters w. The second network also takes the input vector, and generates an output function β(; u) representing the inverse variance of the noise distribution, and is governed by a vector of weight parameters u. The conditional distribution of target data, given the input vector, is then modelled by a normal distribution p(t, w, u) = N (t y, β ). From this we obtain the likelihood function where β n = β( n ; u), Z D = p(d w, u) = Z D ep β n E n N ( ) π /, E n = (y( n; w) t n ) (8) β n and D n, t n is the data set. Some simplification of the subsequent analysis is obtained by taking the regression function, and ln β, to be given by linear combinations of fied basis functions, as in MacKay (995), so that ( ) y(; w) = w T φ(), β(; u) = ep u T ψ() (9) and we assume that one basis function in each network is a constant φ = ψ = so that w and u correspond to bias parameters. The maimum likelihood procedure chooses values ŵ and û by finding a joint maimum over w and u. As we have already indicated, this will give a biased result since the regression function inevitably fits part of the noise on the data, leading to an over-estimate of β(). In etreme cases, where the regression curve passes eactly through a data point, the corresponding estimate of β can go to infinity, corresponding to an estimated noise variance of zero. The solution to this problem has already been indicated in Section and was first suggested in this contet by MacKay (99, Chapter 6). In order to obtain an unbiased estimate of β() we must find the marginal distribution of β, or equivalently of u, in which we have integrated out the dependence on w. This leads to a hierarchical Bayesian analysis. We begin by defining priors over the parameters w and u. Here we consider sphericallysymmetric Gaussian priors of the form ( ) / αw p(w α w ) = ep α w π w () ( ) / αu p(u α u ) = ep α u π u () where α w and α u are hyperparameters. At the first stage of the hierarchy, we assume that u is fied to its most probable value u MP, which will be determined shortly. The value of w MP is then found by maimizing the posterior distribution p(w D, u MP, α w ) = p(d w, u MP)p(w α w ) p(d u MP, α w ) For simplicity we consider a single output variable. The etension of this work to multiple outputs is straightforward. Note that the result will be dependent on the choice of parametrization since the maimum of a distribution is not invariant under a change of variable. (7) ()

where the denominator in () is given by p(d u MP, α w ) = p(d w, u MP )p(w α w ) dw. () Taking the negative log of (), and dropping constant terms, we see that w MP is obtained by minimizing S(w) = β n E n + α w w (4) where we have used (7) and (). For the particular choice of model (9) this minimization represents a linear problem which is easily solved (for a given u) by standard matri techniques. At the net level of the hierarchy, we find u MP by maimizing the marginal posterior distribution p(u D, α u, α w ) = p(d u, α w)p(u α u ). (5) p(d α w, α u ) The term p(d u, α w ) is just the denominator from () and is found by integrating over w as in (). For the model (9) and prior () this integral is Gaussian and can be performed analytically without approimation. Again taking logarithms and discarding constants, we have to minimize M(u) = β n E n + α u u where A denotes the determinant of the Hessian matri A given by ln β n + ln A (6) A = β n φ( n )φ( n ) T + α w I (7) and I is the unit matri. The function M(u) in (6) can be minimized using standard non-linear optimization algorithms. We use scaled conjugate gradients, in which the necessary derivatives of ln A are easily found in terms of the eigenvalues of A. In summary, the algorithm requires an outer loop in which the most probable value u MP is found by non-linear minimization of (6), using the scaled conjugate gradient algorithm. Each time the optimization code requires a value for M(u) or its gradient, for a new value of u, the optimum value for w MP must be found by minimizing (4). In effect, w is evolving on a fast time-scale, and u on a slow time-scale. The corresponding maimum (penalized) likelihood approach consists of a joint non-linear optimization over u and w of the posterior distribution p(w, u D) obtained from (7), () and (). Finally, the hyperparameters are given fied values α w = α u =. as this allows the maimum likelihood and Bayesian approaches to be treated on an equal footing. Results and Discussion As an illustration of this algorithm, we consider a toy problem involving one input and one output, with a noise variance which has an dependence on the input variable. Since the estimated quantities are noisy, due to the finite data set, we consider an averaging procedure as follows. We generate independent data sets each consisting of data points. The model is trained on each of the data sets in turn and then tested on the remaining 99 data sets. Both the y(; w) and β(; u) networks have 4 Gaussian basis functions (plus a bias) with width parameters chosen to equal the spacing of the centres. Results are shown in Figure. It is clear that the maimum likelihood results are biased and that the noise variance is systematically underestimated. By contrast, the Bayesian results show an improved estimate of the noise variance. 4

Maimum likelihood.8 Maimum likelihood output.5.5 noise variance.6.4. output 4 4.5.5 Bayesian 4 4 noise variance 4 4 Bayesian.8.6.4. 4 4 Figure : The left hand plots show the sinusoidal function (dashed curve) from which the data were generated, together with the regression function averaged over training sets. The right hand plots show the true noise variance (dashed curve) together with the estimated noise variance, again averaged over data sets. This is borne out by evaluating the log likelihood for the test data under the corresponding predictive distributions. The Bayesian approach gives a log likelihood per data point, averaged over the runs, of.8. Due to the over-fitting problem, maimum likelihood occasionally gives etremely large negative values for the log likelihood (when β has been estimated to be very large). Even omitting these etreme values, the maimum likelihood still gives an average log likelihood per data point of 7. which is substantially smaller than the Bayesian result. Recently, MacKay (995) has proposed a different approach to treating this model, in which the numerical optimization to find the most-probable parameters is replaced with a fully Bayesian treatment involving Gibbs sampling from the posterior distribution. It will be interesting to compare these two approaches. Acknowledgements: This work was supported by EPSRC grant GR/K579, Validation and Verification of Neural Network Systems. References Bishop, C. M. (995). Neural Networks for Pattern Recognition. Oford University Press. MacKay, D. J. C. (99). Bayesian Methods for Adaptive Models. Ph.D. thesis, California Institute of Technology. MacKay, D. J. C. (995). Probabilistic networks: new models and new methods. In F. Fogelman-Soulié and P. Gallinari (Eds.), Proceedings ICANN 95 International Conference on Artificial Neural Networks, pp. 7. Paris: EC & Cie. 5