Where now? Machine Learning and Bayesian Inference

Size: px

Start display at page:

Download "Where now? Machine Learning and Bayesian Inference"

Evelyn Joseph
5 years ago
Views:

1 Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone etension 67 wwwclcamacuk/ sbh/ Where now? There are some simple take-home messages from the study of SVMs: You can get state-of-the-art performance You can do this using the kernel trick to obtain a non-linear model You can do this without invoking the full machinery of the Bayes-optimal classifier Part III: back to Bayes Bayesian neural networks Gaussian processes BUT: You don t have anything keeping you honest regarding which assumptions you re making As we shall see, by using the full-strength probabilistic framework we gain some useful etras In particular, the ability to assign confidences to our predictions Copyright c Sean Holden -7 We re now going to see how the idea of the Bayes-optimal classifier can be applied to neural networks We re only going to consider regression Classification can also be done this way, but it s a bit more complicated For classification we derived the Bayes-optimal classifier as the maimizer of: Pr C, s) = Pr C w, ) pw s) dw For regression the Bayes-optimal classifier ends up having the same epression as we ve already seen We want to compute: We have: A neural network computing a function h w ) In fact this can be pretty much any parameterized function we like) A training sequence s T = [, y ) m, y m ) ], split into and y = y y y m ) X = m ) py, s) = py w, ) pw s) dw }{{}}{{} Likelihood Posterior s is the training set is a new eample to be classified Y is the RV representing the prediction for

2 It turns out that if you try to incorporate the density p) modelling how feature vectors are generated, things can get complicated So: We regard all input vectors as fied: they are not treated as random variables This means that, strictly speaking, they should no longer appear in epressions like py w, ) However, this seems to be uniformly disliked writing py w) for an epression that still depends on seems confusing Solution: write py w; ) instead Note the semi-colon! So we re actually going to look at py y;, X) = py w; ) pw y; X) dw }{{}}{{} Likelihood Posterior NOTE: this is a notational hack There s nothing new, just an attempt at clarity What s going on? Turning prior into posterior Let s make a brief sidetrack into what s going on with the posterior density pw y; X) py w; X)pw) Typically, the prior starts wide and as we see more data the posterior narrows pw y;x) and pw) 8 6 The posterior density pw y; X) becomes more localised 6 wmap Prior Posterior What s going on? Turning prior into posterior This can be seen very clearly if we use real numbers: So now we have three things to do: Eamples Prior density pw) - STEP : remind ourselves what py w; ) is STEP : remind ourselves what pw y; X) is STEP : do the integral This is the fun bit ) - - Likelihood py w; X) w - - w Posterior density pw y;x) - The first two steps are straightforward as we ve already derived them when looking at maimum-likelihood and MAP learning w - - w w - - w 7 8

3 STEP : assuming Gaussian noise is added to the labels so y = h w ) + ɛ where ɛ N, σn) we have the usual likelihood py w; ) = ep ) Y h πσ n σn w )) Here, the subscript in σ n reminds us that it s the variance of the noise Traditionally this is re-written using the hyperparameter so the likelihood is β = σ n py w; ) ep β ) Y h w)) STEP : the posterior is also eactly as it was when we derived the MAP learning algorithms pw y; X) py w; X)pw) and as before, the likelihood is py w; X) ep β ) m y i h w i )) i= = ep βew)) and using a Gaussian prior with mean and covariance Σ = σ I gives pw) ep α ) w where traditionally the second hyperparameter is α = /σ Combining these pw y; X) = )) α w Zα, β) ep + βew) 9 What s going on? Turning prior into posterior Considering the central part of pw y; X): α w + βew) What happens as the number m of eamples increases? The first term corresponding to the prior remains fied The second term corresponding to the likelihood increases Step : putting together steps and, the integral we need to evaluate is: I ep β ) Y h w)) } {{ } Likelihood )) α w ep + βew) }{{} Posterior Obviously this gives us all a sad face because there is no solution So what can we do now? dw So for small training sequences the prior dominates, but for large ones w ML is a good approimation to w MAP

4 In order to make further progress it s necessary to perform integrals of the general form F w)pw y; X) dw I Method : approimation to pw y; X) ep β ) Y h w)) } {{ } Likelihood py w;) )) α w ep + βew) }{{} Posterior pw y;x) dw for various functions F and this is generally not possible There are two ways to get around this: We can use an approimate form for pw y; X) We can use Monte Carlo methods We ll be taking a look at both possibilities The first approach introduces a Gaussian approimation to pw y; X) by using a Taylor epansion of Sw) = α w at the maimum a posteriori weights w MAP This allows us to use a standard integral + βew) The result will be approimate but we hope it s good! Let s recall how Taylor series work Reminder: Taylor epansion Reminder: Taylor epansion In one dimension the Taylor epansion about a point R for a function f : R R is The functions of interest look like this: 6 The function f) 6 The function ep f)) f) f ) +! )f ) +! ) f ) f) ep f)) + + k! ) k f k ) - - What does this look like for the kinds of function we re interested in? As an eample We can try to approimate where ep f)) f) = 7 + This has a form similar to Sw), but in one dimension By replacing f) with its Taylor epansion about its maimum, which is at ma = 7 we can see what the approimation to ep f)) looks like Note that the ep hugely emphasises peaks 6

5 Reminder: Taylor epansion Here are the approimations for k =, k = and k = Reminder: Taylor epansion In multiple dimensions the Taylor epansion for k = is - - Taylor epansion for k = -6 - ep f)) eact The use of k = looks promising Taylor epansion for k = ep f)) using Taylor epansion for k = 6 - Taylor epansion for k = where denotes gradient f) f ) +! ) T f) f) = +! ) T f) ) f) and f) is the matri with elements f) M ij = f) i j ) f) n Looks complicated, but it s just the obvious etension of the -dimensional case) 7 8 Method : approimation to pw y; X) Applying this to Sw) and epanding around w MAP Sw) = α w + βew) Sw MAP ) + w w MAP) T Aw w MAP ) As w MAP minimises the function the first derivatives are zero and the corresponding term in the Taylor epansion disappears The quantity A = Sw) wmap can be simplified This is because Method : approimation to pw y; X) We actually already know something about how to get w MAP : A method such as backpropagation can be used to compute Sw) The vector w MAP can then be obtained using any standard optimisation method such as gradient descent) It s also likely to be straightforward to compute Ew): The quantity Ew) can be evaluated using an etended form of backpropagation α w A = + βew)) = αi + β Ew MAP ) wmap 9

6 A useful integral Dropping for this slide only the special meaning usually given to the vector, here is a useful standard integral: If A R n n is symmetric then for b R n and c R ep T A + T b + c )) d R n = π) n/ A / ep c bt A )) b You re not epected to know how to evaluate this, but see the handout on the course web page if you re curious To make this easy to refer to, let s call it the BIG INTEGRAL Defining we now have an approimation Using the BIG INTEGRAL Method : approimation to pw y; X) w = w w MAP pw y; X) Sw Z ep MAP ) ) wt A w where W is the number of weights Z = π) W/ A / ep Sw MAP )) Let s plug this approimation back into the epression for the Bayes-optimum and see what we get No, I won t ask you to evaluate it in the eam Method : approimation to pw y; X) Method : second approimation I ep β ) Y h w)) } {{ } Likelihood py w;) ep ) wt A w }{{} Approimation to pw y;x) dw This leads to py y;, X) ep β Y hwmap ) g T w ) ) wt A w dw There is still no solution! We need another approimation We can introduce a linear approimation of h w ) at w MAP : h w ) h wmap ) + g T w where g = h w ) wmap By linear approimation we just mean the Taylor epansion for k = ) SUCCESS!!! This integral can be evaluated this is an eercise) using the BIG INTEGRAL to give THE ANSWER py y;, X) ep y h w MAP )) ) πσ Y where σ Y σ Y = β + gt A g We really are making assumptions here this is OK if we assume that pw y; X) is narrow, which depends on A

7 Method : final epression Hooray! But what does it mean? This is a Gaussian density, so we can now see that: py y;, X) peaks at h wmap ) That is, the MAP solution Method : final epression Hooray! But what does it mean? Interpreted graphically: Typical behaviour of the Bayesian solution The variance σ Y can be interpreted as a measure of certainty: The first term of σy is /β and corresponds to the noise The second term of σ Y is gt A g and corresponds to the width of pw y; X) Plotting ±σ Y around the prediction gives a measure of certainty 6 Method II: Markov chain Monte Carlo MCMC) methods The second solution to the problem of performing integrals I = F w)pw y; X)dw is to use Monte Carlo methods The basic approach is to make the approimation I N N F w i ) i= where the w i have distribution pw y; X) Unfortunately, generating w i with a given distribution can be non-trivial MCMC methods A simple technique is to introduce a random walk, so w i+ = w i + ɛ where ɛ is zero mean spherical Gaussian and has small variance Obviously the sequence w i does not have the required distribution However, we can use the Metropolis algorithm, which does not accept all the steps in the random walk: If pw i+ y; X) > pw i y; X) then accept the step Else accept the step with probability pw i+ y;x) pw i y;x) In practice, the Metropolis algorithm has several shortcomings, and a great deal of research eists on improved methods, see: R Neal, Probabilistic inference using Markov chain Monte Carlo methods, University of Toronto, Department of Computer Science Technical Report CRG-TR-9-,

8 A very) brief introduction to how to learn hyperparameters So far in our coverage of the Bayesian approach to neural networks, the hyperparameters α and β were assumed to be known and fied But this is not a good assumption because α corresponds to the width of the prior and β to the noise variance So we really want to learn these from the data as well How can this be done? We now take a look at one of several ways of addressing this problem Note: from now on I m going to leave out the dependencies on and X as leaving them in starts to make everything cluttered The prior and likelihood depend on α and β respectively so we now make this clear and write pw y, α, β) = py w, β)pw α) py α, β) Don t worry about recalling the actual epressions for the prior and likelihood we re not going to delve deep enough to need them Let s write down directly something that might be useful to know: pα, β y) = py α, β)pα, β) py) 9 Hierarchical Bayes and the evidence If we know pα, β y) then a straightforward approach is to use the values for α and β that maimise it: argma pα, β y) α,β Here is a standard trick: assume that the prior pα, β) is flat, so that we can just maimise py α, β) Hierarchical Bayes and the evidence The quantity py α, β) is called the evidence or marginal likelihood When we re-wrote our earlier equation for the posterior density of the weights, making α and β eplicit, we found pw y, α, β) = py w, β)pw α) py α, β) This is called type II maimum likelihood and is one common way of doing the job So the evidence is the denominator in this equation This is the common pattern and leads to the idea of hierarchical Bayes: the evidence for the hyperparameters at one level is the denominator in the relevant application of Bayes theorem

9 There is an alternative approach to Bayesian regression and classification: The fundamental idea is to not think in terms of weights w that specify functions Instead the idea is to deal with functions directly Fundamental to this is the concept of a Gaussian process We will continue to omit the dependencies on and X to keep the notation simple We have seen that inference can be performed by: Computing the posterior density pw y) of the parameters given the observed labels Computing the Bayes-optimal prediction py y) = py w)pw y) dw which is the epected value of the likelihood for a new point Choosing any hyperparameters p using the evidence py p) But shouldn t we deal with functions directly, not via parameters? What happens if we deal directly with functions f, rather than choosing them via parameters? f) Can we change the equation for prediction to py y) = py f)pf y) df in any sensible way? This obviously requires us to talk about probability densities over functions That is probably not something you have ever seen before In the diagram: four samples f pf ) from a probability density defined on functions f) Can we change the equation for prediction to py y) = py f)pf y) df in any sensible way? This is quite straightforward, using the concept of a Gaussian process GP) 6

10 Definition: say we have a set of RVs This set forms a Gaussian process if any finite subset of them is jointly Gaussian distributed The same four samples f pf ), where F is in fact a GP The crosses mark the values of the sampled functions at four different values of f) - - What happens when we randomly select a function that is a GP? We are only ever interested in a finite number of its values This is because we only need to deal with the values in the training set and for any new points we want to predict Consequently we can use a GP as a prior rather than having a prior pw) Note again the key point: we are randomly selecting functions and we can say something about their behaviour for any finite collection of arguments And that is enough, as we only ever have finite quantities of data - - Because F is a GP any such finite set of values has a jointly Gaussian distribution 7 8 GP priors To specify a GP on vectors in R n, we just need: Squared eponential prior, l = / γ-eponential prior, l = γ = / A mean function m : R n R A covariance function k : R n R n R f) - f) - - m) = E f F [f)] k, ) = E f F [f ) m ))f ) m ))] Rational quadratic prior, l = α = Neural network prior, σ =, σ = We then write F GPm, k) to denote that F is a GP By specifying m and k we get different kinds of function when sampling F f) f)

11 Polynomial; Eponential: Squared eponential: Gamma eponential: Covariance functions k, ) = c + T ) k k, ) = ep ) l k, ) = ep ) l k, ) = ep l ) γ ) Covariance functions Rational quadratic; k, ) = + ) α αl Eponential: k, ) = sin ) T Σ ) + )T Σ ) + )T Σ ))/ where ) T = [ T] As usual these have associated hyperparameters These have to be dealt with correctly as always Gaussian processes: generating data Gaussian processes: generating data with noise Say we have some data y i = f i ) for i =,, m and f GPm, k) Remember, the i are fied, not RVs) Any finite set of points must be jointly Gaussian So py) = N m, K) where m T = [ m ) m m ) ] and K is the Gram matri K ij = k i, j ) Note : this is not py f) We can completely remove the need for integration! Note : from now on we will assume m) = It is straightforward to incorporate a non-zero mean) Now add noise to the data Say we add Gaussian noise so y i = f i ) + ɛ i Again, i =,, m and f GPm, k), but now we also have ɛ i N, σ ) As we are adding Gaussian RVs, we have py) = N, K + σ I) BUT: in order to do prediction we actually need to involve a new point, for which we want to predict the corresponding value y

12 Gaussian processes: prediction SO: we incorporate, for which we want to predict the corresponding value y By eactly the same argument py, y) = N, K ) where [ ] k k K T = k K + σ I k T = [ k, ) k, m ) ] k = k, ) + σ Note : all we ve done here is to epand the Gram matri by an etra row and column to get K Note : whether or not you include σ in k is a matter of choice What difference does it make? This is an Eercise) Split so and correspondingly Gaussian density: marginals and conditionals For a normal RV N µ, Σ) p) = π)d Σ ep ) µ)t Σ µ) µ = What are p ) and p )? [ µ µ ] = [ ] [ ] Σ Σ Σ = Σ Σ 6 Gaussian density: marginals and conditionals Define the precision matri [ ] Λ = Σ Λ Λ = Λ Λ It is possible to show that p ) = N µ, Σ ) p ) = N µ Λ Λ µ ), Λ ) Inverting a block matri In the last slide, we see: [ ] Σ Λ Λ = Λ Λ Re-writing Σ as [ ] A B Σ = C D it is possible to show it is an Eercise to do this) that Λ = A Λ = A BD Λ = D CA Λ = D + D CA BD where A = A BD C) 7 8

13 GP regression GP regression To do prediction all that s left is to compute py y) Squared eponential prior, l = / γ-eponential prior, l = γ = / Because everything is Gaussian this turns out to be easy: py, y) = N, K ) [ ] k k K T = k L L = K + σ I From these we want to know py y) f) f) Rational quadratic prior, l = a = Neural network prior, σ =, σ = Only two things are needed: the inverse formula for a block matri and the formula for obtaining a conditional from a joint Gaussian Using these we can show it is an Eercise to derive this) that f) f) py y) = N k } T L {{ y }, } k k{{ T L k} ) Mean Variance Learning the hyperparameters A nice side-effect of this formulation is that we get a usable epression for the marginal likelihood If we incorporate the hyperparameters p, which in this case are any parameters associated with k, ) along with σ, then we ve just computed py y, p) = py, y p) py p) The denominator is the marginal likelihood, and we computed it above on slide : py p) = N, L) = π)m L ep ) yt L y Learning the hyperparameters As usual this looks nicer if we consider its log This is a rare beast: log py p) = log L yt L y d log π It s a sensible formula that tells you how good a set p of hyperparameters is That means you can use it as an alternative to cross-validation to search for hyperparameters As a bonus you can generally differentiate it so it s possible to use gradientbased search

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori