CSC321: 2011 Introduction to Neural Networks and Machine Learning. Lecture 11: Bayesian learning continued. Geoffrey Hinton

Size: px

Start display at page:

Download "CSC321: 2011 Introduction to Neural Networks and Machine Learning. Lecture 11: Bayesian learning continued. Geoffrey Hinton"

Lynne Green
5 years ago
Views:

1 CSC31: 011 Introdution to Neural Networks and Mahine Learning Leture 11: Bayesian learning ontinued Geoffrey Hinton

2 Bayes Theorem, Prior robability of weight vetor Posterior robability of weight vetor given training data Probability of observed data given joint robability onditional robability Cost

3 Maximum A Posteriori Learning This trades-off the rior robabilities of the arameters against the robability of the data given the arameters. It looks for the arameters that have the greatest rodut of the rior term and the likelihood term. Minimizing the squared weights is equivalent to maximizing the robability of the weights under a zero-mean Gaussian rior. w w 0 w 1 w e w w k

4 The Bayesian interretation of weight deay i i i i w E C w d y C 1 1 * assuming a Gaussian rior for the weights assuming that the model makes a Gaussian redition onstant So the orret value of the weight deay arameter is the ratio of two varianes. Its not just an arbitrary hak.

5 Estimating the variane of the outut noise After we have learned a model that minimizes the squared error, we an find the best value for the outut noise. The best value is the one that maximizes the robability of roduing exatly the orret answers after adding Gaussian noise to the outut rodued by the neural net. The best value is found by simly using the variane of the residual errors.

6 Estimating the variane of the Gaussian rior on the weights After learning a model with some initial hoie of variane for the weight rior, we ould do a dirty trik alled emirial Bayes. Set the variane of the Gaussian rior to be whatever makes the weights that the model learned most likely. This is done by simly fitting a zero-mean Gaussian to the one-dimensional distribution of the learned weight values.

7 MaKay s quik and dirty method of hoosing the ratio of the noise variane to the weight rior variane. Start with guesses for both the noise variane and the weight rior variane o some learning Reset the noise variane to fit the residual errors Reset the weight rior varaine to fit the atual learned weights. Reeat until bored.

8 Full Bayesian Learning Instead of trying to find the best single setting of the arameters as in ML or MAP omute the full osterior distribution over arameter settings This is extremely omutationally intensive for all but the simlest models its feasible for a biased oin. To make reditions, let eah different setting of the arameters make its own redition and then ombine all these reditions by weighting eah of them by the osterior robability of that setting of the arameters. This is also omutationally intensive. The full Bayesian aroah allows us to use omliated models even when we do not have muh data

9 Overfitting: A frequentist illusion? If you do not have muh data, you should use a simle model, beause a omlex one will overfit. This is true. But only if you assume that fitting a model means hoosing a single best setting of the arameters. If you use the full osterior over arameter settings, overfitting disaears! ith little data, you get very vague reditions beause many different arameters settings have signifiant osterior robability

10 A lassi examle of overfitting hih model do you believe? The omliated model fits the data better. But it is not eonomial and it makes silly reditions. But what if we start with a reasonable rior over all fifth-order olynomials and use the full osterior distribution. Now we get vague and sensible reditions. There is no reason why the amount of data should influene our rior beliefs about the omlexity of the model.

11 Aroximating full Bayesian learning in a neural network If the neural net only has a few arameters we ould ut a grid over the arameter sae and evaluate at eah grid-oint. This is exensive, but it does not involve any gradient desent and there are no loal otimum issues. After evaluating eah grid oint we use all of them to make reditions on test data This is also exensive, but it works muh better than ML learning when the osterior is vague or multimodal this haens when data is sare. dtest inut test g dtest inut test, g g grid

12 An examle of full Bayesian learning Allow eah of the 6 weights or biases to have the 9 ossible values [- : 0.5 : ] So there are 9^6 grid-oints in arameter sae. For eah grid-oint omute the robability of the observed oututs of all the training ases. This is the likelihood term and is exlained on the next slide Multily the rior for eah grid-oint by the likelihood term and renormalize to get the osterior robability for eah grid-oint. Make reditions by using the osterior robabilities to average the reditions made by the different grid-oints. bias bias A neural net with inuts, 1 outut and 6 arameters

13 Comuting the likelihood term for a isti outut unit The outut of the isti unit is the robability that the network assigns to the answer 1. It assigns the omlementary robability to the answer 0. y f inut, g if d=1 if d=0 outut d inut, g d y 1 d 1 y all training oututs g outut d inut, g

14 hat an we do if there are too many arameters for a grid to be feasible? The number of grid oints is exonential in the number of arameters. So we annot deal with more than a few arameters using a grid. If there is enough data to make most arameter vetors very unlikely, only a tiny fration of the grid oints make a signifiant ontribution to the reditions. Maybe we an just evaluate this tiny fration It might be good enough to just samle weight vetors aording to their osterior robabilities. ytest inut test, i ytest inut test, i i Samle weight vetors with this robability

15 One method for samling weight vetors In standard bakroagation we kee moving the weights in the diretion that dereases the ost i.e. the diretion that inreases the likelihood lus the rior, summed over all training ases. Suose we add some Gaussian noise to the weight vetor after eah udate. So the weight vetor never settles down. It kees wandering around, but it tends to refer low ost regions of the weight sae. Amazing fat: If we use just the right amount of noise, and if we let the weight vetor wander around for long enough before we take a samle, we will get a samle from the true osterior over weight vetors. This is alled a Markov Chain Monte Carlo method and it makes it feasible to use full Bayesian learning with hundreds or thousands of arameters. There are related MCMC methods that are more omliated but more effiient we don t need to let the weights wander around for so long before we get samles from the osterior.

CSC321: 2011 Introduction to Neural Networks and Machine Learning. Lecture 10: The Bayesian way to fit models. Geoffrey Hinton

CSC321: 2011 Introduction to Neural Networks and Machine Learning. Lecture 10: The Bayesian way to fit models. Geoffrey Hinton CSC31: 011 Introdution to Neural Networks and Mahine Learning Leture 10: The Bayesian way to fit models Geoffrey Hinton The Bayesian framework The Bayesian framework assumes that we always have a rior