Introduction to Bayesian thinking

Size: px

Start display at page:

Download "Introduction to Bayesian thinking"

Garry Bell
5 years ago
Views:

1 Introduction to Bayesian thinking Statistics seminar Rodrigo Díaz Geneva Observatory, April 11th, 2016

2 Agenda (I) Part I. Basics. General concepts & history of Bayesian statistics. Bayesian inference. Part II. Likelihoods. Constructing likelihoods. Part III. Priors. General concepts and problems. Subjective vs objective priors. Assigning priors.

3 Agenda (II) Part IV. Posterior. Conjugate priors. The right prior for a given likelihood. MCMC. Part V. Model comparison. Model evidence and Bayes factor. The philosophy of integrating over parameter space. Estimating the model evidence. Dos and Don ts Point estimate: BIC. Importance sampling and a family of methods. Nested sampling.

4 Part I basics

5 Statistical inference Fig. adapted from Gregory (2005) Deductive inference (predictions) Testable hypothesis (theory) Observations (data) Statistical inference (hyp. testing, parameter estimation)

6 Statistical inference requires a probability theory Frequentist Bayesian

Thomas Bayes (1701 1761) First appearance of the product rule (the base for the Bayes theorem; An Essay towards solving a

p(h i I, D) = p(d H i,i) p(d I) p(h i I) Pierre-Simon Laplace (1749 1827) Wide application of the Bayes' rule.

7 Thomas Bayes ( ) First appearance of the product rule (the base for the Bayes theorem; An Essay towards solving a Problem in the Doctrine of Chances). p(h i I, D) = p(d H i,i) p(d I) p(h i I) Pierre-Simon Laplace ( ) Wide application of the Bayes' rule. Principle of insufficient reason (non-informative priors). Primitive version of the Bernstein von Mises theorem. Laplace s inverse probability is largely rejected for ~100 years. The reign of frequentist probability. Fischer, Pearson, etc.

Harold Jeffreys (1891 1989) Objective Bayesian

(1940s - 1960s) R. T. Cox George Pólya E. T. Jaynes Plausible reasoning.

Probability theory as an extension of Aristotelian

The product and sum rules deduced for basic

8 Harold Jeffreys ( ) Objective Bayesian probability revived. Jeffreys rule for priors. (1940s s) R. T. Cox George Pólya E. T. Jaynes Plausible reasoning. Reasoning with uncertainty. Probability theory as an extension of Aristotelian logic. The product and sum rules deduced for basic principles. MAXENT priors. See E.T Jaynes. Probability Theory: The Logic of Science.

9 Statistical inference Fig. adapted from Gregory (2005) Deductive inference (predictions) Testable hypothesis (theory) Observations (data) Statistical inference (hyp. testing, parameter estimation)

10 The three desiderata Rules of deductive reasoning: strong syllogisms from Aristotelian logic. Brain works using plausible inference. Woman in wood cabin story. The rules for plausible reasoning are deduced from three simple desiderata on how the mind of a thinking robot should work (Cox- Póyla-Jaynes). I. Degrees of Plausibility are represented by real numbers. II. Qualitative Correspondence with common sense. III. If a conclusion can be reasoned out in more than one way; then every possible way must lead to the same result.

11 if degrees of plausibility are represented by real numbers, then there is a uniquely determined set of quantitative rules for conducting inference That is any, other rules whose results conflict with them will necessarily violate an elementary -and nearly inescapable- desideratum of rationality or consistency. But the final result was just the standard rules of probability theory, given already by Bernoulli and Laplace, so why all the fuss? The important new feature was that these rules were now seen as uniquely valid principles of logic in general, making no reference to chance or random variables ; so their range of application is vastly greater than had been supposed in the conventional probability theory that was developed in the early 20th century. As a result, the imaginary distinction between probability theory and statistical inference disappears and the field achieves not only logical unity and simplicity but far greater technical power and flexibility in applications.

12 The basic rules of probability theory. Sum p(a + B) =p(a)+p(b) p(ab) A B Product p(ab) =p(a B)p(B) =p(b A)p(A) Once these rules are recognised as general ways of manipulating any preposition, a world of applications opens up. Aristotelian deductive logic is the limiting form of our rules for plausible reasoning E.T. Jaynes

if degrees of plausibility are represented by real numbers, then there is a uniquely determined set of quantitative rules for conducting inference That is any, other rules whose results conflict with

13 if degrees of plausibility are represented by real numbers, then there is a uniquely determined set of quantitative rules for conducting inference That is any, other rules whose results conflict with them will necessarily violate an elementary -and nearly inescapable- desideratum of rationality or consistency. But the final result was just the standard rules of probability theory, given already by Bernoulli and Laplace, so why all the fuss? The important new feature was that these rules were now seen as uniquely valid principles of logic in general, making no reference to chance or random variables ; so their range of application is vastly greater than had been supposed in the conventional probability theory that was developed in the early 20th century. As a result, the imaginary distinction between probability theory and statistical inference disappears and the field achieves not only logical unity and simplicity but far greater technical power and flexibility in applications.

14 The Bayesian recipe Sum p(a + B) =p(a)+p(b) p(ab) Product p(ab) =p(a B)p(B) =p(b A)p(A) Bayes theorem p(a B) = p(b A) p(a) p(b)

15 The Bayesian recipe Law of Total Probability p(a I) = X i p(a B i, I)p(B i I) for {Bi} exclusive and exhaustive B2 B3 B1 B4 A B5 B6

16 The Bayesian recipe Sum p(a + B) =p(a)+p(b) p(ab) Product p(ab) =p(a B)p(B) =p(b A)p(A) Bayes theorem Law of Total Probability p(b A) p(a) p(a B) = p(b) X p(a I) = p(a B i, I)p(B i I) i for {Bi} exclusive and exhaustive

17 The Bayesian recipe Sum p(a + B) =p(a)+p(b) p(ab) Product p(ab) =p(a B)p(B) =p(b A)p(A) Bayes theorem Law of Total Probability p(b A) p(a) p(a B) = p(b) X p(a I) = p(a B i, I)p(B i I) i for {Bi} exclusive and exhaustive Normalisation X i p(b i I) =1 for {Bi} exclusive and exhaustive

18 "Bayes", "Bayesian", MCMC

20 Two basic tasks of statistical inference Learning process (parameter estimation) Decision making (model comparison)

21 Learning process Bayesian probability represents a state of knowledge : parameter vector Hi: hypothesis I: information p( H i, I) Prior D: data Discrete space (hypothesis space) p( D, H i, I) Posterior p(h i I) p(h i I, D)

22 Learning process : parameter vector Hi: hypothesis I: information D: data Enter the likelihood function p( H i, I, D) = p(d, H i, I) p(d H i, I) Posterior p( H i, I) Prior p( D, H i, I) / L (H i ) p( H i, I) The proportionality constant has many names: marginal likelihood, global likelihood, model evidence, prior predictive. Hard to compute. Important.

23 Optimising the learning process The likelihood needs to be selective for the learning process to be effective. Prior p(h 0 M 1,I ) Likelihood p(d H 0,M 1,I ) Likelihood p(d H 0,M 1,I ) Prior p(h 0 M 1,I ) Dominated by prior information Posterior p(h 0 D,M 1,I ) Parameter H 0 P. Gregory (2005) Posterior p(h 0 D,M 1,I ) Parameter H 0 Dominated by data

24 Decision making Bayes theorem is also the base for model comparison p(h i I, D) = p(d H i, I) p(d I) p(h i I) Z but now p(d H i, I) = p(d, H i, I)p( H i, I) d n Computation of the evidence cannot be escaped.

25 Decision making Model comparison consists in computing the ratio of the posteriors (odds ratio) of two competing hypotheses. p(h i D, I) p(h i I,D) = p(d H i,i) p(h i I) p(h j I,D) p(d H j,i) p(h j I) p(h j D, I) Bayes factor Prior odds

26 Part II likelihoods

27 The likelihood function : parameter vector Hi: hypothesis I: information D: data p( H i, I, D) = p(d, H i, I) p(d H i, I) Posterior p( H i, I) Prior p( D, H i, I) / L (H i ) p( H i, I) The likelihood is the probability of obtaining data D, for a given prior information I and a set of parameters θ. Remember, likelihood is not a probability for parameter vector θ (for that you need the prior)

28 Optimising the learning process The likelihood needs to be selective for the learning process to be effective. Prior p(h 0 M 1,I ) Likelihood p(d H 0,M 1,I ) Likelihood p(d H 0,M 1,I ) Prior p(h 0 M 1,I ) Dominated by prior information Posterior p(h 0 D,M 1,I ) Parameter H 0 P. Gregory (2005) Posterior p(h 0 D,M 1,I ) Parameter H 0 Dominated by data

29 Likelihood function : parameter vector Hi: hypothesis I: information D: data Ingredients Physical model Analytic model Simulations Statistical (non-deterministic) model Unknown errors (jitter) Instrument systematics Complex physics (activity, ) Error statistics Covariances Non-Gaussianity p(d, H i, I) =L (H i ) indep.,gauss. / exp 2 2

30 Constructing the likelihood The data: D = D 1 D 2... D n = {D i } Di: the i-th measurement is in the infinitesimal range yi to yi + dyi The errors: Ei: the i-th error is in the infinitesimal range ei to ei + dei p(e i, H, I) =f E (e i ) The probability distribution of statement Ei Most used f E The model: f E (e i )=N(0, 2 i ) Mi: the i-th error is in the infinitesimal range mi to mi + dmi p(m i, H, I) =f M (m i ) The probability distribution of statement Mi

31 Constructing the likelihood The data: D = D 1 D 2... D n = {D i } We want to build the probability distribution: p(d, H, I) =p(d 1,D 2,...,D n, H, I) Write: y i = m i + e i It can be shown that: p(d i, H, I) = Z dm i f M (m i ) f E (y i m i ) Convolution equation

32 Constructing the likelihood Z p(d i, H, I) = dm i f M (m i ) f E (y i m i ) But for a deterministic model, mi is obtained from a (usually analytically) function f without any uncertainty (say, a Keplerian curve for RV measurements) Then: p(d i, H, I) = m i = f(x i ) f M (m i )= (m i f(x i )) Z dm i (m i f(x i )) f E (y i m i ) = f E (y i f(x i ) =p(e i, H, I) p(d, H, I) =p(d 1,D 2,...,D n, H, I) =p(e 1,E 2,...,E n, H, I)

33 Constructing the likelihood p(d, H, I) =p(d 1,D 2,...,D n, H, I) =p(e 1,E 2,...,E n, H, I) Now, for independent errors p(d, H, I) =p(e 1,E 2,...,E n, H, I) = p(e 1, H, I)...p(E n, H, I) ny = p(e i, H, I) i=1 And, if in addition, the errors are distributed normally: p(d, H, I) indep.,gauss. / exp 2 2

34 Constructing the likelihood Back to the convolution equation Z p(d i, H, I) = dm i f M (m i ) f E (y i m i ) For a non-deterministic model, Mi is distributed: Mi: the i-th error is in the infinitesimal range mi to mi + dmi p(m i, H, I) =f M (m i ) The probability distribution of statement Mi E.g. adding instrumental error / resolution: f M (m i )=N(f(x i ), 2 inst)

35 Part III priors

36 xkcd.com

37 Prior probabilities p(h i I) Hi: hypothesis (can be continuous). I: information Prior information I is always present: The term prior does not necessarily mean earlier in time. Philosophical controversy on how to assign priors. Subjective vs. objective views. No single universal rule, but a few accepted methods. Informative priors. Usually based on the output from previous observations. (What was the prior of the first analysis?). Ignorance priors. Required as a starting point for the theory.

38 Assigning ignorance priors 1. Principle of indifference. Given n mutually exclusive, exhaustive hypothesis, {H i }, with i = 1,, n, the PoI states: p(h i I) =1/n

39 Assigning ignorance priors 2. Transformation groups. Location and scale parameters. For a certain type of parameters (location and scale), total ignorance can we represented as invariance under certain (group of) transformation. Location: position of highest tree along a river. Problem must be invariant under a translation. X 0 = X + c p(x I)dX = p(x 0 I)dX 0 = p(x 0 I)d(X + c) =p(x 0 I)dX p(x I) = constant Uniform prior.

40 Assigning ignorance priors 2. Transformation groups. Location and scale parameters. For a certain type of parameters (location and scale), total ignorance can we represented as invariance under certain (group of) transformation. Scale: life time of a new bacteria or Poisson rate Problem must be invariant under scaling. X 0 = ax p(x I)dX = p(x 0 I)dX 0 = p(x 0 I)d(aX) =ap(x 0 I)dX p(x I) = constant x Jeffreys prior.

41 Assigning ignorance priors 3. Jeffreys rule. Besides location and scale parameters, little more can be said using transformation invariance. Jeffreys priors use the Fisher information; parameterisation invariant, but strange behaviour in many dimensions. Observed Fischer information: I D = d2 log L D d 2 But D is not known when we have to define a prior. Use expectation value over D. I( ) = E D apple d 2 log L D d 2

42 Assigning ignorance priors 3. Jeffreys rule says: Examples: p( I) / p I( ) Mean of Normal distribution (μ) with known variance σ^2. p(µ 2, I) / constant Rate λ of Poisson distribution. p( I) / 1/ p Exercise: Scale of Normal with known mean value?

43 Assigning ignorance priors 3. Jeffreys rule: p( I) / p I( ) I( ) = E D apple d 2 log L D d 2 Favours parts of parameter space where data provides more information. Is invariant under reparametrisation. Works fine only in one dimension See more examples here: en.wikipedia.org/wiki/jeffreys_prior

44 Assigning ignorance priors 4. Maximum entropy. Uses information theory to define a measurement of the distribution information. S = nx i=0 p i log(p i ) Maximise entropy subject to constraints (use Lagrange multipliers). Defined strictly for discrete distributions.

45 Assigning ignorance priors 5. Reference priors. Recent development in the field (Bernardo; Bernardo, Berger, & Sun). Similar to MAXENT, but maximise the Kullback-Leibler divergence between prior and posterior. Priors which maximise the expected difference with the posterior. Rigorous definition is complicated and subtle. For 1-D, reference priors are the Jeffreys priors. In multi-d, reference prior behaves better than Jeffreys priors.

46 Agenda (II) Part IV. Posterior. Conjugate priors. The right prior for a given likelihood. MCMC. Part V. Model comparison. Model evidence and Bayes factor. The philosophy of integrating over parameter space. Estimating the model evidence. Dos and Don ts BIC, Chib & Jeliazkov, etc. Importance sampling and a family of methods.

47 Part IV sampling the posterior / MCMC

48 Sampling from the posterior p( D, H i, I) / L (H i ) p( H i, I) The posterior distribution is proportional to the likelihood times the prior. The normalising constant (called model evidence, marginal likelihood, etc.) is of importance when comparing different models. The posterior contains all the information on a given model a Bayesian statistician can get for a given set of priors and data. Posterior is only analytical in few cases: Conjugate priors. Other methods needed to sample from posterior. Remember, most Bayesian computations can be reduced to expectation values with respect to the posterior.

49 Markov Chain Monte Carlo : parameter vector Hi: hypothesis I: information D: data p( D, H i, I) / L (H i ) p( H i, I) Metropolis-Hastings 0 * 0 q( 0, ) * 1. L 0 p( 0 I) 2. Create proposal point L 0 p( 0 I) r = L 0 p( 0 I) L 0 p( 0 I)

50 Markov Chain Monte Carlo : parameter vector Hi: hypothesis I: information D: data p( D, H i, I) / L (H i ) p( H i, I) Metropolis-Hastings 0 * q( 0, ) 0 * q( 0, ) 1 * 1. L 0 p( 0 I) 2. Create proposal point L 0 p( 0 I) r = L 0 p( 0 I) L 0 p( 0 I) 5. Accept proposal with probability min(1, r)

51 Markov Chain Monte Carlo : parameter vector Hi: hypothesis I: information D: data p( D, H i, I) / L (H i ) p( H i, I) q( 0, ) Metropolis-Hastings 1 * 0 * 1. L 0 p( 0 I) 2. Create proposal point L 0 p( 0 I) r = L 0 p( 0 I) L 0 p( 0 I) 5. Accept proposal with probability min(1, r)

52 Markov Chain Monte Carlo : parameter vector Hi: hypothesis I: information D: data p( D, H i, I) / L (H i ) p( H i, I) Metropolis-Hastings 0 * q( 0, ) 0 * 1 * q( 0, ) Algorithms Metropolis-Hastings Gibbs sampling Slice sampling Hybrid Monte Carlo Codes pymc emcee kombine cobmcmc

53 5 minutes to LEARN a lifetime to MASTER

54 But beware. Nonconvergence, bad mixing. The dark side of MCMC are they. Problems with correlations and multi-modal distributions.

55 The problem with Correlations If parameters exhibit correlations, then step size must be small to reach the demanded fraction of accepted jumps. Rejected proposal Accepted proposal Need a very long chain to explore the entire posterior. Or, more relevant, the entire posterior will not be explored thoroughly (i.e. reduced error bars!)

56 MCMC: the Good, the Bad, and the Ugly Visual inspection of traces. Good Marginal mixing No convergence

57 MCMC: the Good, the Bad, and the Ugly Evaluation of the chain auto-correlations Image credit: E. Ford

58 Multi-modal posteriors. Difficult for sampler to move across modes if separated by a low-probability barrier. Image credit: E. Ford

59 Multi-modal posteriors. Run as many chains as possible starting from significantly different places in the prior space. Be paranoid! You can always be missing modes.

61 Part V model comparison

62 Decision making Bayes theorem is also the base for model comparison p(h i I, D) = p(d H i, I) p(d I) p(h i I) Z but now p(d H i, I) = p(d, H i, I)p( H i, I) d n Computation of the model evidence (a.k.a. marginal likelihood, global likelihood, prior predictive, ) cannot be escaped.

63 Decision making Model comparison consists in computing the ratio of the posteriors (odds ratio) of two competing hypotheses. p(h i D, I) p(h i I,D) = p(d H i,i) p(h i I) p(h j I,D) p(d H j,i) p(h j I) p(h j D, I) Bayes factor Prior odds

64 A built-in Occam s razor The Bayes factor naturally punishes models with more parameters Z p(d H i, I) = p(d, H i, I)p( H i, I) d E.g.: model M0, without free parameters. Model M1, with one parameter. M 0 = M 1 ( = 0 ) p(d M 0, I) =p(d 0, M 1, I) θ L(θ) = p(d θ, M 1, I ) p(d θ, M 1, I ) = L(θ) Z p(d M 1, I) = p(d, M 1, I) p( M 1, I) d = 1 Z p(d, M 1, I) d δθ p(d ˆ, M 1, I) p(θ M 1,I ) = 1 θ θ B 10 p(d ˆ, M 1, I) p(d 0, M 1, I) Occam s factor One per parameter Gregory (2005) Parameter θ

65 Estimating the marginal likelihood p(d H i, I) = Z p(d, H i, I)p( H i, I) d n Marginal likelihood is a k-dimensional integral over parameter space. Accounting for the size of parameter space bring many good things to Bayesian statistics, but the integral is intractable Large number of techniques to estimate the value of the integral: Asymptotic estimates (Laplace approximation, BIC). Importance sampling. Chib & Jeliazkov. Nested sampling.

66 Estimating the marginal likelihood Point Estimations BIC 2ln L max + k ln N Bayesian Information Criterion Some nice properties are lost, such as reasonable parameter penalisation (see Liddle 2007). The error term of the BIC is generally O(1). Even for large samples, it does not produce the right value.

67 Importance Sampling (I) Used to obtain moments of distributions using samples from another distribution. The evidence is the expectation value of the likelihood over the prior space. p(d H i, I) = Z p(d, H i, I)p( H i, I) d n The simplest estimation of the evidence is to take samples from the prior and compute the average of the p(d θ, Hi, I): ˆp(D H i, I) = 1 S SX p(d (i), H i, I) But this estimator is extremely inefficient if the likelihood is concentrated relative to the prior. A large number of samples are needed, which is computationally expensive. i

68 Importance Sampling (II) More generally, we can choose a distribution g(θ) (from which we can sample) and express the integral as an expectation value over that distribution (dropping conditional I for notation simplicity) Z p( H i ) p(d, H i )d = Z w g ( ) z } { p( H i ) g( ) p(d, H i ) g( )d = =E g( ) [w g ( ) p(d, H i )] 1 SX w g ( (i) ) p(d (i), H i ) S i where θ (i) is sampled from g(θ), the importance sampling function.

69 Importance Sampling (III) Different choices of g(θ) give different estimates: g( ) =p( H i, I) g( ) =p( D, H i, I) ˆp(D H i, I) = 1 S ˆ p(d H i, I) =S " S X i SX i p(d (i), H i, I) 1 p(d (i), H i, I) # 1 The Harmonic Mean: does not satisfy a Gaussian central limit theorem (Kass & Raftery 1995). The sum is dominated by the occasional small terms in the likelihood. Its variance can be infinite. Insensitive to diffuse priors (i.e. priors much larger than likelihood). Harmonic mean

70 Importance Sampling (IV) Perrakis et al. (2014) use the product of the posterior marginal distributions. g( ) = ky i=1 p( i D, H i, I) Which leads to the estimator ˆ p(d H, I) = 1 S SX i L( (i) )p( (i) H, I) Q k j=1 p( (i) D, H, I) Behaves much better than the harmonic mean. Requires estimating the marginal densities in the denominator. Different techniques available (moment-matching, kernel estimation, etc.) Samples from the marginal distributions are readily obtained from MCMC samples

71 Estimating the marginal likelihood Estimations based on importance sampling (IS) ISF Leads to... But... Prior Mean Estimate Very inefficient Posterior Harmonic Mean Estimate Dominated by points with low likelihood. Posterior & Prior Mixture Estimate Requires drawing from both posterior and prior Kaas & Raferty (1995) Posterior x 2 Truncated- Posterior Mixture Estimate Inconsistent or as good as HM. Tuomi & Jones (2012)

72 Nested sampling (Skilling 2007) p(d H i, I) = Z p(d, H i, I)p( H i, I) d n A non-trivial change of variables. Define prior volume X dx = p( I, H)d n X( )= R L( )> p( I, H)dn Then, the integral is a simple 1-D integral over prior volume p(d H, I) = Z 1 0 L(X)dX

73 Nested sampling (Skilling 2007) p(d H, I) = Z 1 0 L(X)dX

74 Nested sampling (Skilling 2007) Algorithm: p(d H, I) = Z 1 0 L(X)dX Start with N points θ 1,..., θ N from prior; initialise Z = 0, X 0 = 1. Repeat for i = 1, 2,..., j; record the lowest of the current likelihood values as L i, set X i = exp( i/n) (crude) or sample it to get uncertainty, set w i = X i 1 X i (simple) or (X i 1 X i+1 )/2 (trapezoidal), increment Z by L i w i, then replace point of lowest likelihood by new one drawn from within L(θ) > L i, in proportion to the prior π(θ). Increment Z by N 1 (L(θ 1 ) L(θ N )) X j. Two keys: - Assign Xi. - Draw with condition L > Li.

75 Nested sampling (Skilling 2007) Draw new point with condition. Elipsoidal nested sampling (Mukherjee et al. 2006). Uses ellipsoidal contours around active points. Becomes inefficient at high dimensions or multi-modal. MultiNest algorithm (Feroz, Hobson, Bridges, 2009). Ellipsoids break up as needed to improve efficiency. Still problems at high dimensions (the curse goes on).

76 Nested sampling (Skilling 2007) Assigning Xi. p(d H, I) = Z 1 0 L(X)dX Use statistical properties of randomly distributed points. X 0 =1, X i = t i X i 1, p(t i )=Nt N 1 i E [log t] = 1/N Take mean value i/n) t i =exp( OR Draw from distribution to account for uncertainty.

Bayesian Inference in Astronomy & Astrophysics A Short Course

Bayesian Inference in Astronomy & Astrophysics A Short Course Tom Loredo Dept. of Astronomy, Cornell University p.1/37 Five Lectures Overview of Bayesian Inference From Gaussians to Periodograms Learning