CSC321: 2011 Introduction to Neural Networks and Machine Learning. Lecture 10: The Bayesian way to fit models. Geoffrey Hinton

CSC31: 011 Introdution to Neural Networks and Mahine Learning Leture 10: The Bayesian way to fit models Geoffrey Hinton

The Bayesian framework The Bayesian framework assumes that we always have a rior distribution for everything. The rior may be very vague. hen we see some data, we ombine our rior distribution with a likelihood term to get a osterior distribution. The likelihood term takes into aount how robable the observed data is given the arameters of the model. It favors arameter settings that make the data likely. It fights the rior ith enough data the likelihood terms always win.

A oin tossing examle Suose we know nothing about oins exet that eah tossing event rodues a head with some unknown robability and a tail with robability 1-. Our model of a oin has one arameter,. Suose we observe 100 tosses and there are 53 heads. hat is? The frequentist answer: Pik the value of that makes the observation of 53 heads and 47 tails most robable. P dp d 53 1 53 53 0 if 5 47 1 47 1.53 47 47 53 robability of a artiular sequene 1 53 1 47 46

Some roblems with iking the arameters that are most likely to generate the data hat if we only tossed the oin one and we got 1 head? Is =1 a sensible answer? Surely =0.5 is a muh better answer. Is it reasonable to give a single answer? If we don t have muh data, we are unsure about. Our omutations of robabilities will work muh better if we take this unertainty into aount.

Using a distribution over arameter values Start with a rior distribution over. In this ase we used a uniform distribution. Multily the rior robability of eah arameter value by the robability of observing a head given that value. robability density robability density 1 area=1 0 1 1 Then sale u all of the robability densities so that their integral omes to 1. This gives the osterior distribution. robability density area=1

Lets do it again: Suose we get a tail Start with a rior distribution over. robability density area=1 1 Multily the rior robability of eah arameter value by the robability of observing a tail given that value. 0 1 Then renormalize to get the osterior distribution. Look how sensible it is! area=1

Lets do it another 98 times After 53 heads and 47 tails we get a very sensible osterior distribution that has its eak at 0.53 assuming a uniform rior. area=1 robability density 1 0 1

Bayes Theorem, Prior robability of weight vetor Posterior robability of weight vetor given training data Probability of observed data given joint robability onditional robability

A hea trik to avoid omuting the osterior robabilities of all weight vetors Suose we just try to find the most robable weight vetor. e an do this by starting with a random weight vetor and then adjusting it in the diretion that imroves. It is easier to work in the log domain. If we want to minimize a ost we use negative log robabilities: / Cost log log log log

hy we maximize sums of log robs e want to maximize the rodut of the robabilities of the oututs on all the different training ases Assume the outut errors on different training ases,, are indeendent. d Beause the log funtion is monotoni, it does not hange where the maxima are. So we an maximize sums of log robabilities log log d

A even heaer trik Suose we omletely ignore the rior over weight vetors This is equivalent to giving all ossible weight vetors the same rior robability density. Then all we have to do is to maximize: log log This is alled maximum likelihood learning. It is very widely used for fitting models in statistis.

Suervised Maximum Likelihood Learning Minimizing the squared residuals is equivalent to maximizing the log robability of the orret answer under a Gaussian entered at the model s guess. y outut log f inut d outut, inut d, inut d, k y d = the orret answer d 1 y = model s estimate of most robable value y e d y

Suervised Maximum Likelihood Learning Finding a set of weights,, that minimizes the squared errors is exatly the same as finding a that maximizes the log robability that the model would rodue the desired oututs on all the training ases. e imliitly assume that zero-mean Gaussian noise is added to the model s atual outut. e do not need to know the variane of the noise beause we are assuming it s the same in all ases. So it just sales the squared error.