Lecture 3a: The Origin of Variational Bayes

CSC535: 013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton

The origin of variational Bayes In variational Bayes, e approximate the true posterior across parameters by a much simpler, factorial distribution. Since e are being Bayesian, e need a prior for this posterior distribution. When e use standard L eight decay e are implicitly assuming a Gaussian prior ith zero mean. Could e have a more interesting prior?

Types of eight penalty Sometimes it orks better to use a eight penalty that has negligible effect on large eights. We can easily make up a heuristic cost function for this. But e get more insight if e vie it as the negative log probability under a mixture of to zero-mean Gaussians C( ) 0 λ = 1+ k 0

Soft eight-sharing Le Cun shoed that netorks generalize better if e constrain subsets of the eights to be equal. This removes degrees of freedom from the parameters so it simplifies the model. But for most tasks e do not kno in advance hich eights should be the same. Maybe e can learn hich eights should be the same.

Modeling the distribution of the eights The values of the eights form a distribution in a onedimensional space. If the eights are tightly clustered, they have high probability density under a mixture of Gaussians model. To raise the probability density move each eight toards its nearest cluster center. p() W à

Fitting the eights and the mixture prior together We can alternate beteen to types of update: Adjust the eights to reduce the error in the output and to increase the probability density of the eights under the mixture prior. Adjust the means and variances and mixing proportions in the mixture prior to fit the posterior distribution of the eights better. This is called empirical Bayes. This automatically clusters the eights. We do not need to specify in advance hich eights should belong to the same cluster.

A different optimization method Alternatively, e can just apply conjugate gradient descent to all of the parameters in parallel (hich is hat e did). To keep the variance positive, use the log variances in the optimization (these are the natural parameters for a scale variable). To ensure that the mixing proportions of the Gaussians sum to 1, use the parameters of a softmax in the optimization. x π i = j e i e x j

The cost function and its derivatives ) ( ) ( ), ( log ) ( 1 j i j j i c out i j j i i j j c c c out j p i c y c t c y k C p t y k C σ µ σ σ µ π σ = = Probability of eight i under Gaussian j negative log probability of desired output under a Gaussian hose mean is the output of the net posterior probability of Gaussian j given eight i

The sunspot prediction problem Predicting the number of sunspots next year is important because they affect eather and communications. The hole time series has less than 400 points and there is no obvious ay to get any more data. So it is orth using computationally expensive methods to get good predictions. The best model produced by statisticians as a combination of to linear autoregressive models that sitched at a particular threshold value. Heavy-tailed eight decay orks better. Soft eight-sharing using a mixture of Gaussians prior orks even better.

The eights learned by the eight hidden units for predicting the number of sunspots 1 1 3: 4 fin de siècle Rule 1: (uses units): High if high last year. Rule : High if high 6, 9, or 11 years ago. Rule 3: Lo if lo 1 or 8 ago & high or 3 ago. Rule 4: Lo if high 9 ago and lo 1 or 3 ago

The Toronto distribution The mixture of five Gaussians learned for clustering the eights. the skydome Weights near zero are very cheap because they have high density under the empirical prior.

Predicting sunspot numbers far into the future by iterating the single year predictions. The net ith soft eightsharing gives the loest errors

The problem ith soft eight-sharing It constructs a sensible empirical prior for the eights. But it ignores the fact that some eights need to be coded accurately and others can be very imprecise ithout having much effect on the squared error. A coding frameork needs to model the number of bits required to code the value of a eight and this depends on the precision as ell as the value.

Using the variational approach to make Bayesian learning efficient Consider a standard backpropagation netork ith one hidden layer and the squared error function. The full Bayesian approach to learning is: Start ith a prior distribution across all possible eight vectors Multiply the prior for each eight vector by the probability of the observed outputs given that eight vector and then renormalize to get the posterior distribution. Use this posterior distribution over all possible eight vectors for making predictions. This is not feasible for large nets. Can e use a tractable approximation to the posterior?

An independence assumption We can approximate the posterior distribution by assuming that it is an axis-aligned Gaussian in eight space. i.e. e give each eight its on posterior variance. Weights that are not very important for minimizing the squared error ill have big variances. This can be interpreted nicely in terms of minimum description length. Weights ith high posterior variances can be communicated in very fe bits This is because e can use lots of entropy to pick a precise value from the posterior, so e get lots of bits back.

Communicating a noisy eight First pick a precise value for the eight from its posterior. We ill get back a number of bits equal to the entropy of the eight We could imagine quantizing ith a very small quantization idth to eliminate the infinities. Then code the precise value under the Gaussian prior. This costs a number of bits equal to the crossentropy beteen the posterior and the prior. ( Q( )log P( ) d) ( Q( )logq( ) d) expected number of bits to send eight expected number of bits back

The cost of communicating a noisy eight If the sender and receiver agree on a prior distribution, P, for the eights, the cost of communicating a eight ith posterior distribution Q is: KL( Q P) = Q( )log Q( ) P( ) d If the distributions are both Gaussian this cost becomes: KL( Q P) = σ log σ P Q + 1 σ P [ σ σ + ( µ µ ) ] Q P P Q

What do noisy eights do to the expected squared error? Consider a linear neuron ith a single input. Let the eight on this input be stochastic The noise variance for the eight gets multiplied by the squared input value and added to the squared error: y < Error < = y x > = < ( x) Error x = ( t Stochastic output of neuron < > = mean of eight distribution x y) > = µ > = ( t xµ xµ + x ) σ + x σ Extra squared error caused by noisy eight

Ho to deal ith the non-linearity in the hidden units The noise on the incoming connections to a hidden unit is independent so its variance adds. This Gaussian input noise to a hidden unit turns into non-gaussian output noise, but e can use a big table to find the mean and variance of this non-gaussian noise. The non-gaussian noise coming out of each hidden unit is independent so e can just add up the variances coming into an output unit. σ σ 4 σ σ 1 3

The mean and variance of the output of a logistic hidden unit non-gaussian noise out Gaussian noise in

The forard table The forard table is indexed by the mean and the variance of the Gaussian total input to a hidden unit. µ in It returns the mean and variance of the non- Gaussian output. This non-gaussian mean and variance is all e need to compute the expected squared error. σin µ σ out out

The backard table The backard table is indexed by the mean and variance of the total input. It returns four partial derivatives hich are all e need for backpropagating the derivatives of the squared error to the input à hidden eights. µ µ out in, σ µ out in, µ σ out in, σ σ out in

Empirical Bayes: Fitting the prior We can no trade off the precision of the eights against the extra squared error caused by noisy eights. even though the residuals are non-gaussian e can choose to code them using a Gaussian. We can also learn the idth of the prior Gaussian used for coding the eights. We can even have a mixture of Gaussians prior This allos the posterior eights to form clusters Very good for coding lots of zero eights precisely ithout using many bits. Also makes large eights cheap if they are the same as other large eights.

Some eights learned by variational bayes output eight bias It learns a fe big positive eights, a fe big negative eights, and lots of zeros. It has found four rules that ork ell. 18 input units Only 105 training cases to train 51 eights

The learned empirical prior for the eights The posterior for the eights needs to be Gaussian to make it possible to figure out the extra squared error caused by noisy eights and the cost of coding the noisy eights. The learned prior can be a mixture of Gaussians. This learned prior is a mixture of 5 Gaussians ith 14 parameters