Bayesian Inference (A Rough Guide)

Size: px

Start display at page:

Download "Bayesian Inference (A Rough Guide)"

Morgan Sharp
5 years ago
Views:

1 Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 1 Bayesian Inference (A Rough Guide) Anil C. Kokaram anil.kokaram@tcd.ie Electrical and Electronic Engineering Dept., University of Dublin, Trinity College. See for more information.

2 Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 2 Bayesian Inference The best way to think about problem solving (my opinion) Well established in Signal Processing since late 1980 s Grew with the rapid increse in computational power of computers: enabled diabolical Monte Carlo techniques to become practical Well known in Statistics (see Rev. Bayes) and in fact just another rule of probability Bayesian Inference sometimes used as a euphamism for Using Probability to solve your problem. Important Texts : [1] Numerical Bayesian Inference, J. O Ruanaidh, [2] Numerical Recipes, [3] Image Analysis, Random Fields... by Wilson 1 Is extremely good.

3 Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 3 Probability and Marbles Blue Green A B 40 Red 60 Yellow Red Yellow A B Outcomes or Realisations of the random variables A,B Consider Two random variables A, B, which each have two outcomes or two realisations (blue/green), (red/yellow) respectively Outcomes are generated by first selecting a marble from box A, then selecting a marble from a box B. But box B depends on what colour you select from box A.

4 Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 4 Probability and Marbles 40 Blue Green 60 A Probability is just a number. A probability density function expresses the probability of all outcomes of a R.V. and must obey the following equation. Z p(a)da = 1 Red 10 Yellow Red Yellow B In our example Z p(a)da = p(a = blue) + p(a = green) = = 1.0 What is p(b = red A = blue)? The probability of realising or observing a red marble from r.v. B GIVEN that a blue marble was drawn from the probability distribution for r.v. A (i.e. a blue marble was observed or realised as the outcome from r.v. A). p(b = red A = blue) is the Conditional probability of observing a red marble from box B GIVEN a particular outcome from A. It is the probability of B conditioned on A. p(b = red, A = blue) is the JOINT probability of observing a red marble from B AND a blue marble from A p(b, A) is the JOINT probability distribution of A AND B

5 Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 5 Probability and Marbles Outcomes or Realisations of the random variables A,B A B We can measure p(b = red A = blue), or p(b = red, A = blue), by simulating the system (with Matlab say) and observing the frequencies of the various outcomes. In this example of a set of simulated outcomes, p(b = red A = blue) = 3/5, p(b = red, A = blue) = 3/10. We would need lots of example outcomes to be sure of our measurements. But we can calculate these values using laws of probability

6 Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 6 Probability and Marbles Blue Green A Red 10 Yellow Red Yellow B p(b = red A = blue) = 0.1 Just by reading it off from our model p(b = red, A = blue) = p(b = red A = blue)p(a = blue) = = 0.04 p(b = red A = blue) = p(b = red, A = blue) p(a = blue) = 0.04/.4 = 0.1 This is an important equation in conditional probability p(b A) = p(b, A) p(a) = p(a, B) p(a) p(a B) = p(a, B) p(b) = p(b, A) p(b)

7 Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 7 Probability and Bayes Theorem 40 Blue Green 60 A Red 10 Yellow Red Yellow B Bayes theorem turns out to be just another law of conditional probability What is p(a = blue B = red)? Hmmm... we can t just read that off easily now eh? Its sort of upside down... Given that you have observed that B = red what is the probability that a blue marble was drawn from A? This is the kind of thing that you end up having to answer alot in signal processing. Given that the corrupted speech signal at this point is 0.6Volts, what is the probability that the actual signal is 1.0 Volts?

8 Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 8 Probability and Bayes Theorem Blue Green A Red 10 Yellow Red Yellow B p(a = blue B = red) = p(b = red, A = blue) p(b = red) = p(b = red A = blue)p(a = blue) p(b = red) We can find p(b = red A = blue) easily now = p(b = red) This is Bayes Theorem p(b A) = p(a B)p(B) p(a) (1) It turns the potentially tricky problem of estimating p(b A) into the hopefully easier problem of estimating P (A B).

9 Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 9 Probability and Bayes Theorem This is Bayes Theorem p(b A) = p(a B)p(B) p(a) (2) p(a B) is the Likelihood. p(b) is the Prior. p(a) is the normalising factor or sometimes used as Evidence. p(b A) is the Posterior distribution because you are asking questions after the fact.

10 Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 10 Probability and Bayes Theorem 40 Blue Green 60 A Red 10 Yellow Red Yellow B p(a = blue B = red) = = p(b = red) = p(b = red A = blue)p(a = blue) p(b = red) p(b = red) = Z p(b = red, A)dA = p(b = red, A = blue) + p(b = red, A = green) = p(b = red A = blue)p(a = blue) + p(b = red A = green)p(a = green) This is marginalisation, or integrating out one of the variables from a joint distribution. Z p(b) = p(b, A)dA

11 Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 11 Line Fitting Want to expose all details using v. simple example Given N (time) samples of observed data y n Observation model : y n = mn + c + e n e n N (0, σ 2 e), i.e. e n is Normal distributed Gaussian variable, zero mean, variance σ 2 e Want to estimate m assuming we know c, σ 2 e and given the observations, y n as well as the model above. Remember can assemble data into vector form y = mx + c + e Where x is [ ].

12 Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 12 Line Fitting: Bayes Vs LS Bayes : Big Picture Choose m to maximimse p(m y, θ) θ = [c, σ 2 e] p(m y, θ) = p(y m, θ)p(m θ) p(y θ) p(y m, θ)p(m θ) Least Squares: Big Picture (for m, c) Minimise y mx c wrt m, c Differentiate, set to 0. 2 P Pk k2 k k Pk k P 5 4 m 5 k 1 = 4 c P k ky k P k y k Gives you m, c in one shot Doesn t give you σ 2 e unless you measure VAR(y ˆmx ĉ) 3 5

13 Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 13 Likelihood p(m y, θ) p(y m, θ)p(m θ) y n = mn + c + e n The Likelihood typically arises directly from the observation model. The likelihood connects our parameters to the observations. Knowing m, θ it is e n that determines y n so p(y m, θ) = p(e) = p(e 1, e 2, e 3, e 4, e 5,...) But the noise at each sample is independent of all the other samples, so... = p(e 1 )p(e 2 )p(e 3 )p(e 4 )... = Π k p(e k )» 1 (yk mk c) 2 «= Π k p exp 2πσ 2 e = 1 (2πσ 2 e) N 2 2σ 2 e Pk exp (y k mk c) 2 «2σ 2 e

14 Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 14 Prior for m The Prior reflects what we know about our parameters before we observe anything. It captures our life-knowledge or intuition or bias about what we feel the answer should be. It is a powerful idea because Bayes theorem is a recipe that shows us how to incorporate this information quantitatively into problem solving So there are many choices for the prior and they all depend on what you know anout your problem If you know nothing (and you never know nothing) then you may choose a UNIFORM prior i.e. p(m) = α say. If you feel m is betwen 0.1 and 0.4 then you may choose p(m) = 3(u(m 0.1) u(m 0.4)) Maybe you know m should be near some value m, so p(m) = 1 (m m)2 p exp ( 2πσ 2 m 2σm 2 )

15 Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 15 Using a uniform prior for m Lets stupidly say we know nothing : then we are just doing Maximim Likelihood (ML) estimation p(m y, θ) p(y m, θ)p(m θ) = p(y m, θ)α 1 Pk exp (y k mk c) 2 «(2πσe 2) N 2 2σe 2 Pk exp (y k mk c) 2 «2σ 2 e

16 Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 16 Choosing m by maximizing p(m y, θ) implies choosing m to minimise the following expression Pk E(m) = (y k mk c) 2 «2σe 2 E(m) m = X 2(y k ˆmk c)k = 0 k ˆm X k k 2 = X k k(y k c) ˆm = P k k(y k c) P k k2

17 Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 17 Using a pulse prior for m Now let s say we know the range of values for m i.e. a box prior... Maximum A-Posterior (MAP) Estimation p(m y, θ) p(y m, θ)p(m θ) = p(y m, θ)f(m) Pk exp (y k mk c) 2 «2σ 2 e For m =.1 :.4 Choosing m by maximizing p(m y, θ) implies choosing m to minimise the following expression E(m) = Pk (y k mk c) 2 «2σ 2 e For m =.1 :.4 Tricky to impose that constraint in a closed form solution. So instead how about jus exhaustive search? Choose m =.1 :.001 :.4 say, and pick the m that gives the smallest E(m). Note how computationally expensive it is to evaluate the likelihood. This is because you typically have alot of data, and hence this is usually the main drain in solutions of this kind.

18 Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 18 Using a Gaussian Prior for m Now let s say we already roughly know what m is.. up to some variance σ 2 m, Maximum A-Posteriori (MAP) Estimation p(m y, θ) p(y m, θ)p(m θ) P 1 = p(y m, θ) p 2πσ 2 (N/2) exp m Pk exp (y k mk c) 2 «2σ 2 e Pk exp (y k mk c) 2 2σ 2 e + k «(m m)2 σ 2 m Pk (m m)2 exp σ 2 m Pk (m m)2 σm 2 Choosing m by maximizing p(m y, θ) implies choosing m to minimise the following expression E(m) = σ 2 m X (y k mk c) 2 + X k k σ 2 e(m m) 2! ««

19 Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 19 Using a Gaussian Prior for m Differentiating and solving... E(m) = σ 2 m! X X (y k mk c) 2 + σe 2 (m m) 2 k E(m) m = X X 2σ2 m (y k ˆmk c)k + 2σe 2 ( ˆm m) = 0 k k» X X ˆm σm 2 k 2 + kσe 2 = σm 2 k(y k c) + kσe 2 m k k P ˆm = σ2 m k (y k c)k + kσe 2 m» σm 2 P k k2 + kσe 2 k

20 Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 20 Some experiments Matlab code available Experiment uses c = 3, m = tan(40 ), σ e = 10.0 Can set what you like for priors and see effect of priors on solution Derive an expression for p(c m, θ) using a Gaussian prior on c with variance σc 2 c. Derive the closed form solution for the estimate of c given the other parameters by maximizing p(c m, θ) that you have derived. and mean Write Matlab code that simulates a corrupted signal and calculates the estimate for c using c = 3, m = tan(20 ), σ e = 10.0, σ c = 3, c = 3 Generate many realisations for your estimate ĉ by using many realisations of the noise e k and solving for the parameter c. Hence generate a histogram of your estimates and show how close that distribution is to p(c m, θ) that you have derived. Using a laplacian prior on c, derive an expression for p(c m, θ) i.e. p(c) = 1 Z exp( k c ). The noise in the model is now laplacian with p(e n ) = 1 Z exp( k e n ), derive an expression for p(c m, θ) using the same prior as above.

21 Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 21 Final Comments Bayesian inference is only really useful if you are using a non-uniform prior The success of your solution always depends on your models and priors. Bayes cannot help you if you make bad choices. If you are making only Gaussian choices for your distributions then you might just be doing the same thing as a straightforward least squares/lagrange multiplier approach. In pictures, laplacian prior for noise (like dfd s and wavelet coefficients) is almost always better but inevitably very difficult to manipulate. We need to look at marginalisation, sampling, MCMC and priors suitable for 2D next

Markov Random Fields (A Rough Guide)

Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 1 Markov Random Fields (A Rough Guide) Anil C. Kokaram anil.kokaram@tcd.ie Electrical and Electronic Engineering Dept., University of Dublin,