Introduction to Bayesian Inference Lecture 2: Key Examples

Size: px

Start display at page:

Download "Introduction to Bayesian Inference Lecture 2: Key Examples"

Helen Gordon
5 years ago
Views:

1 Introduction to Bayesian Inference Lecture 2: Key Examples Tom Loredo Dept. of Astronomy, Cornell University CASt Summer School 10 June /76

2 Lecture 2: Key Examples 1 Simple examples Binary Outcomes Normal Distribution Poisson Distribution 2 Multilevel models for measurement error 3 Bayesian calculation 4 Bayesian Software 5 Closing Reflections 2/76

3 Key Examples 1 Simple examples Binary Outcomes Normal Distribution Poisson Distribution 2 Multilevel models for measurement error 3 Bayesian calculation 4 Bayesian Software 5 Closing Reflections 3/76

4 Binary Outcomes: Parameter Estimation M = Existence of two outcomes, S and F; for each case or trial, the probability for S is α; for F it is (1 α) H i = Statements about α, the probability for success on the next trial seek p(α D,M) D = Sequence of results from N observed trials: Likelihood: FFSSSSFSSSFS (n = 8 successes in N = 12 trials) p(d α,m) = p(failure α,m) p(failure α,m) = α n (1 α) N n = L(α) 4/76

5 Prior Starting with no information about α beyond its definition, use as an uninformative prior p(α M) = 1. Justifications: Intuition: Don t prefer any α interval to any other of same size Bayes s justification: Ignorance means that before doing the N trials, we have no preference for how many will be successes: P(nsuccess M) = 1 N +1 p(α M) = 1 Consider this a convention an assumption added to M to make the problem well posed. 5/76

6 Prior Predictive p(d M) = dα α n (1 α) N n = B(n+1,N n+1) = n!(n n)! (N +1)! A Beta integral, B(a,b) dx x a 1 (1 x) b 1 = Γ(a)Γ(b) Γ(a+b). 6/76

7 Posterior p(α D,M) = (N +1)! n!(n n)! αn (1 α) N n A Beta distribution. Summaries: Best-fit: ˆα = n n+1 N = 2/3; α = N Uncertainty: σ α = (n+1)(n n+1) 0.12 (N+2) 2 (N+3) Find credible regions numerically, or with incomplete beta function Note that the posterior depends on the data only through n, not the N binary numbers describing the sequence. n is a (minimal) sufficient statistic. 7/76

8 8/76

9 Binary Outcomes: Model Comparison Equal Probabilities? M 1 : α = 1/2 M 2 : α [0,1] with flat prior. Maximum Likelihoods M 1 : p(d M 1 ) = 1 2 N = M 2 : L(ˆα) = ( ) 2 n ( ) 1 N n = p(d M 1 ) p(d ˆα,M 2 ) = 0.51 Maximum likelihoods favor M 2 (failures more probable). 9/76

10 Bayes Factor (ratio of model likelihoods) p(d M 1 ) = 1 2 N; and p(d M n!(n n)! 2) = (N +1)! B 12 p(d M 1) p(d M 2 ) (N +1)! = n!(n n)!2 N = 1.57 Bayes factor (odds) favors M 1 (equiprobable). Note that for n = 6, B 12 = 2.93; for this small amount of data, we can never be very sure results are equiprobable. If n = 0, B 12 1/315; if n = 2, B 12 1/4.8; for extreme data, 12 flips can be enough to lead us to strongly suspect outcomes have different probabilities. (Frequentist significance tests can reject null for any sample size.) 10/76

11 Binary Outcomes: Binomial Distribution Suppose D = n (number of heads in N trials), rather than the actual sequence. What is p(α n, M)? Likelihood Let S = a sequence of flips with n heads. p(n α,m) = S p(s α,m)p(n S,α,M) α n (1 α) N n # successes = n = α n (1 α) N n C n,n C n,n = # of sequences of length N with n heads. p(n α,m) = N! n!(n n)! αn (1 α) N n The binomial distribution for n given α, N. 11/76

12 Posterior p(α n,m) = N! n!(n n)! αn (1 α) N n p(n M) p(n M) = = N! n!(n n)! 1 N +1 dα α n (1 α) N n p(α n,m) = (N +1)! n!(n n)! αn (1 α) N n Same result as when data specified the actual sequence. 12/76

13 Another Variation: Negative Binomial Suppose D = N, the number of trials it took to obtain a predifined number of successes, n = 8. What is p(α N,M)? Likelihood p(n α,m) is probability for n 1 successes in N 1 trials, times probability that the final trial is a success: p(n α,m) = = (N 1)! (n 1)!(N n)! αn 1 (1 α) N n α (N 1)! (n 1)!(N n)! αn (1 α) N n The negative binomial distribution for N given α, n. 13/76

14 Posterior p(α D,M) = C n,n α n (1 α) N n p(d M) p(d M) = C n,n dα α n (1 α) N n p(α D,M) = Same result as other cases. (N +1)! n!(n n)! αn (1 α) N n 14/76

15 Final Variation: Meteorological Stopping Suppose D = (N,n), the number of samples and number of successes in an observing run whose total number was determined by the weather at the telescope. What is p(α D,M )? (M adds info about weather to M.) Likelihood p(d α,m ) is the binomial distribution times the probability that the weather allowed N samples, W(N): p(d α,m N! ) = W(N) n!(n n)! αn (1 α) N n Let C n,n = W(N) ( N n). We get the same result as before! 15/76

16 Likelihood Principle To define L(H i ) = p(d obs H i,i), we must contemplate what other data we might have obtained. But the real sample space may be determined by many complicated, seemingly irrelevant factors; it may not be well-specified at all. Should this concern us? Likelihood principle: The result of inferences depends only on how p(d obs H i,i) varies w.r.t. hypotheses. We can ignore aspects of the observing/sampling procedure that do not affect this dependence. This happens because no sums of probabilities for hypothetical data appear in Bayesian results; Bayesian calculations condition on D obs. This is a sensible property that frequentist methods do not share. Frequentist probabilities are long run rates of performance, and depend on details of the sample space that are irrelevant in a Bayesian calculation. 16/76

17 Goodness-of-fit Violates the Likelihood Principle Theory (H 0 ) The number of A stars in a cluster should be 0.1 of the total. Observations 5 A stars found out of 96 total stars observed. Theorist s analysis Calculate χ 2 using n A = 9.6 and n X = Significance level is p(> χ 2 H 0 ) = 0.12 (or 0.07 using more rigorous binomial tail area). Theory is accepted. 17/76

18 Observer s analysis Actual observing plan was to keep observing until 5 A stars seen! Random quantity is N tot, not n A ; it should follow the negative binomial dist n. Expect N tot = 50±21. p(> χ 2 H 0 ) = Theory is rejected. Telescope technician s analysis A storm was coming in, so the observations would have ended whether 5 A stars had been seen or not. The proper ensemble should take into account p(storm)... Bayesian analysis The Bayes factor is the same for binomial or negative binomial likelihoods, and slightly favors H 0. Include p(storm) if you want it will drop out! 18/76

19 Inference With Normals/Gaussians Gaussian PDF p(x µ,σ) = 1 σ (x µ) 2 2π e 2σ 2 over [, ] Common abbreviated notation: x N(µ,σ 2 ) Parameters µ = x dx x p(x µ,σ) σ 2 = (x µ) 2 dx (x µ) 2 p(x µ,σ) 19/76

20 Gauss s Observation: Sufficiency Suppose our data consist of N measurements, d i = µ+ǫ i. Suppose the noise contributions are independent, and ǫ i N(0,σ 2 ). p(d µ,σ,m) = i p(d i µ,σ,m) = p(ǫ i = d i µ µ,σ,m) i = 1 [ σ 2π exp (d i µ) 2 ] 2σ 2 i 1 = σ N (2π) N/2e Q(µ)/2σ2 20/76

21 Find dependence of Q on µ by completing the square: Q = i (d i µ) 2 [Note: Q/σ 2 = χ 2 (µ)] = i ( = di 2 + i ) i d 2 i = N(µ d) 2 + µ 2 2 i d i µ +Nµ 2 2Nµd ( i d 2 i ) Nd 2 where d 1 N = N(µ d) 2 +Nr 2 where r 2 1 (d i d) 2 N i i d i 21/76

22 Likelihood depends on {d i } only through d and r: ) ) 1 L(µ,σ) = σ N exp ( Nr2 (2π) N/2 2σ 2 exp ( N(µ d)2 2σ 2 The sample mean and variance are sufficient statistics. This is a miraculous compression of information the normal dist n is highly abnormal in this respect! 22/76

23 Estimating a Normal Mean Problem specification Model: d i = µ+ǫ i, ǫ i N(0,σ 2 ), σ is known I = (σ,m). Parameter space: µ; seek p(µ D, σ, M) Likelihood ) ) 1 p(d µ,σ,m) = σ N exp ( Nr2 (2π) N/2 2σ 2 exp ( N(µ d)2 2σ 2 ) exp ( N(µ d)2 2σ 2 23/76

24 Uninformative prior Translation invariance p(µ) C, a constant. This prior is improper unless bounded. Prior predictive/normalization ) p(d σ,m) = dµ C exp ( N(µ d)2 2σ 2 = C(σ/ N) 2π... minus a tiny bit from tails, using a proper prior. 24/76

25 Posterior ) 1 p(µ D,σ,M) = ( (σ/ N) 2π exp N(µ d)2 2σ 2 Posterior is N(d,w 2 ), with standard deviation w = σ/ N. 68.3% HPD credible region for µ is d ±σ/ N. Note that C drops out limit of infinite prior range is well behaved. 25/76

26 Informative Conjugate Prior Use a normal prior, µ N(µ 0,w 2 0 ). Conjugate because the posterior is also normal. Posterior Normal N( µ, w 2 ), but mean, std. deviation shrink towards prior. Define B = w2, so B < 1 and B = 0 when w w 2 +w0 2 0 is large. Then µ = d +B (µ 0 d) w = w 1 B Principle of stable estimation The prior affects estimates only when data are not informative relative to prior. 26/76

27 Conjugate normal examples: Data have d = 3, σ/ N = 1 Priors at µ 0 = 10, with w = {5,2} Prior L Post ) 0.25 p(x 0.20 ) 0.25 p(x x x /76

28 Estimating a Normal Mean: Unknown σ Problem specification Model: d i = µ+ǫ i, ǫ i N(0,σ 2 ), σ is unknown Parameter space: (µ, σ); seek p(µ D, M) Likelihood ) ) 1 p(d µ,σ,m) = σ N exp ( Nr2 (2π) N/2 2σ 2 exp ( N(µ d)2 2σ 2 1 σ Ne Q/2σ2 where Q = N [ r 2 +(µ d) 2] 28/76

29 Uninformative Priors Assume priors for µ and σ are independent. Translation invariance p(µ) C, a constant. Scale invariance p(σ) 1/σ (flat in logσ). Joint Posterior for µ, σ p(µ,σ D,M) 1 σ N+1e Q(µ)/2σ2 29/76

30 Marginal Posterior p(µ D,M) dσ 1 σ N+1e Q/2σ2 Let τ = Q Q so σ = 2σ 2 2τ and dσ = τ 3/2 Q 2 dτ p(µ D,M) 2 N/2 Q N/2 dτ τ N 2 1 e τ Q N/2 30/76

31 Write Q = Nr 2 [1+ ( ) ] 2 µ d r and normalize: p(µ D,M) = ( N 2 1) [! 1 ( N 2 3 ) 1+ 1 ( ) 2 ] N/2 µ d 2! π r N r/ N Student s t distribution, with t = (µ d) r/ N A bell curve, but with power-law tails Large N: p(µ D,M) e N(µ d)2 /2r 2 This is the rigorous way to adjust σ so χ 2 /dof = 1. It doesn t just plug in a best σ; it slightly broadens posterior to account for σ uncertainty. 31/76

32 Student t examples: 1 p(x) ( )n+1 1+ x2 2 n Location = 0, scale = 1 Degrees of freedom = {1,2,3,5,10, } normal p(x) x 32/76

33 Gaussian Background Subtraction Measure background rate b = ˆb ±σ b with source off. Measure total rate r =ˆr ±σ r with source on. Infer signal source strength s, where r = s +b. With flat priors, [ ] (b ˆb) 2 p(s,b D,M) exp 2σ 2 b exp [ ] (s +b ˆr)2 2σr 2 33/76

34 Marginalize b to summarize the results for s (complete the square to isolate b dependence; then do a simple Gaussian integral over b): ] (s ŝ)2 p(s D,M) exp [ 2σs 2 ŝ = ˆr ˆb σ 2 s = σ2 r +σ2 b Background subtraction is a special case of background marginalization; i.e., marginalization told us to subtract a background estimate. Recall the standard derivation of background uncertainty via propagation of errors based on Taylor expansion (statistician s Delta-method). Marginalization provides a generalization of error propagation without approximation! 34/76

35 Bayesian Curve Fitting & Least Squares Setup Data D = {d i } are measurements of an underlying function f(x;θ) at N sample points {x i }. Let f i (θ) f(x i ;θ): d i = f i (θ)+ǫ i, ǫ i N(0,σ 2 i ) We seek learn θ, or to compare different functional forms (model choice, M). Likelihood [ N 1 p(d θ,m) = exp 1 ( ) ] di f i (θ) 2 σ i=1 i 2π 2 σ i [ exp 1 ( ) ] di f i (θ) 2 2 σ i i [ ] = exp χ2 (θ) 2 35/76

36 Bayesian Curve Fitting & Least Squares Posterior For prior density π(θ), p(θ D,M) π(θ)exp If you have a least-squares or χ 2 code: Think of χ 2 (θ) as 2logL(θ). [ ] χ2 (θ) 2 Bayesian inference amounts to exploration and numerical integration of π(θ)e χ2 (θ)/2. 36/76

37 Important Case: Separable Nonlinear Models A (linearly) separable model has parameters θ = (A, ψ): Linear amplitudes A = {A α } Nonlinear parameters ψ f(x;θ) is a linear superposition of M nonlinear components g α (x;ψ): M d i = A α g α (x i ;ψ)+ǫ i or α=1 d = A α g α (ψ)+ ǫ. α Why this is important: You can marginalize over A analytically Bretthorst algorithm ( Bayesian Spectrum Analysis & Param. Est n 1988) Algorithm is closely related to linear least squares/diagonalization/svd. 37/76

38 Poisson Dist n: Infer a Rate from Counts Problem: Observe n counts in T; infer rate, r Likelihood L(r) p(n r,m) = p(n r,m) = (rt)n e rt n! Prior Two simple standard choices (or conjugate gamma dist n): r known to be nonzero; it is a scale parameter: p(r M) = 1 ln(r u /r l ) r may vanish; require p(n M) Const: 1 r p(r M) = 1 r u 38/76

39 Prior predictive p(n M) = 1 r u 1 n! = Posterior A gamma distribution: ru 0 rut dr(rt) n e rt 1 1 r u T n! 0 1 r u T for r u n T p(r n,m) = T(rT)n e rt n! d(rt)(rt) n e rt 39/76

40 Gamma Distributions A 2-parameter family of distributions over nonnegative x, with shape parameter α and scale parameter s: p Γ (x α,s) = 1 ( x ) α 1e x/s sγ(α) s Moments: E(x) = sα Var(x) = s 2 α Our posterior corresponds to α = n+1, s = 1/T. Mode ˆr = n n+1 T ; mean r = T (shift down 1 with 1/r prior) Std. dev n σ r = n+1 T ; credible regions found by integrating (can use incomplete gamma function) 40/76

41 41/76

42 The flat prior Bayes s justification: Not that ignorance of r p(r I) = C Require (discrete) predictive distribution to be flat: p(n I) = dr p(r I)p(n r,i) = C Useful conventions p(r I) = C Use a flat prior for a rate that may be zero Use a log-flat prior ( 1/r) for a nonzero scale parameter Use proper (normalized, bounded) priors Plot posterior with abscissa that makes prior flat 42/76

43 Basic problem The On/Off Problem Look off-source; unknown background rate b Count N off photons in interval T off Look on-source; rate is r = s +b with unknown signal s Count N on photons in interval T on Infer s Conventional solution ˆb = N off /T off ; σ b = N off /T off ˆr = N on /T on ; σ r = N on /T on ŝ = ˆr ˆb; σ s = σr 2 +σb 2 But ŝ can be negative! 43/76

44 Examples Spectra of X-Ray Sources Bassani et al Di Salvo et al /76

45 Spectrum of Ultrahigh-Energy Cosmic Rays Nagano & Watson 2000 HiRes Team 2007 Flux*E 3 /10 24 (ev 2 m -2 s -1 sr -1 ) 10 1 HiRes-2 Monocular HiRes-1 Monocular AGASA log 10 (E) (ev) 45/76

46 N is Never Large Sample sizes are never large. If N is too small to get a sufficiently-precise estimate, you need to get more data (or make more assumptions). But once N is large enough, you can start subdividing the data to learn more (for example, in a public opinion poll, once you have a good estimate for the entire country, you can estimate among men and women, northerners and southerners, different age groups, etc etc). N is never enough because if it were enough you d already be on to the next problem for which you need more data. Andrew Gelman (blog entry, 31 July 2005) 46/76

47 N is Never Large Sample sizes are never large. If N is too small to get a sufficiently-precise estimate, you need to get more data (or make more assumptions). But once N is large enough, you can start subdividing the data to learn more (for example, in a public opinion poll, once you have a good estimate for the entire country, you can estimate among men and women, northerners and southerners, different age groups, etc etc). N is never enough because if it were enough you d already be on to the next problem for which you need more data. Similarly, you never have quite enough money. But that s another story. Andrew Gelman (blog entry, 31 July 2005) 46/76

48 Bayesian Solution to On/Off Problem First consider off-source data; use it to estimate b: p(b N off,i off ) = T off(bt off ) N offe bt off N off! Use this as a prior for b to analyze on-source data. For on-source analysis I all = (I on,n off,i off ): p(s,b N on ) p(s)p(b)[(s +b)t on ] Non e (s+b)ton I all p(s I all ) is flat, but p(b I all ) = p(b N off,i off ), so p(s,b N on,i all ) (s +b) Non b N off e ston e b(ton+t off) 47/76

49 Now marginalize over b; p(s N on,i all ) = db p(s,b N on,i all ) db (s +b) Non b N off e ston e b(ton+t off) Expand (s +b) Non and do the resulting Γ integrals: p(s N on,i all ) = C i N on T on (st on ) i e ston C i i! i=0 ( 1+ T ) i off (N on +N off i)! T on (N on i)! Posterior is a weighted sum of Gamma distributions, each assigning a different number of on-source counts to the source. (Evaluate via recursive algorithm or confluent hypergeometric function.) 48/76

50 Example On/Off Posteriors Short Integrations T on = 1 49/76

51 Example On/Off Posteriors Long Background Integrations T on = 1 50/76

52 Recap of Key Findings From Examples Sufficiency: Model-dependent summary of data Conjugate priors Likelihood principle Marginalization: Generalizes background subtraction, propagation of errors Student s t for handling σ uncertainty Exact treatment of Poisson background uncertainty (don t subtract!) 51/76

53 Key Examples 1 Simple examples Binary Outcomes Normal Distribution Poisson Distribution 2 Multilevel models for measurement error 3 Bayesian calculation 4 Bayesian Software 5 Closing Reflections 52/76

54 Complications With Survey Data Selection effects (truncation, censoring) obvious (usually) Typically treated by correcting data Most sophisticated: product-limit estimators Scatter effects (measurement error, etc.) insidious Typically ignored (average out???) 53/76

55 Many Guises of Measurement Error Auger data above GZK cutoff (PAO 2007; Soiaporn ) QSO hardness vs. luminosity (Kelly 2007, 2011) 54/76

56 Accounting For Measurement Error Introduce latent/hidden/incidental parameters Suppose f(x θ) is a distribution for an observable, x. From N precisely measured samples, {x i }, we can infer θ from L(θ) p({x i } θ) = i f(x i θ) 55/76

57 Graphical representation Nodes/vertices = uncertain quantities Edges specify conditional dependence Absence of an edge denotes conditional independence θ x 1 x 2 x N L(θ) p({x i } θ) = i f(x i θ) 56/76

58 But what if the x data are noisy, D i = {x i +ǫ i }? We should somehow incorporate l i (x i ) = p(d i x i ) L(θ,{x i }) p({d i } θ,{x i }) = l i (x i )f(x i θ) i Marginalize over {x i } to summarize inferences for θ. Marginalize over θ to summarize inferences for {x i }. Key point: Maximizing over x i and integrating over x i can give very different results! 57/76

59 To estimate x 1 : L m (x 1 ) = dθ l 1 (x 1 )f(x 1 θ) = dθ l 1 (x 1 )f(x 1 θ)l 1 (θ) dx 2... N dx N l i (x i )f(x i θ) i=2 l 1 (x 1 )f(x 1 ˆθ) with ˆθ determined by the remaining data. f(x 1 ˆθ) behaves like a prior that shifts the x 1 estimate away from the peak of l 1 (x i ). This generalizes the corrections derived by Malmquist and Lutz-Kelker. Landy & Szalay (1992) proposed adaptive Malmquist corrections that can be viewed as an approximation to this. 58/76

60 Graphical representation θ x 1 x 2 x N D 1 D 2 D N L(θ,{x i }) p({d i } θ,{x i }) = i p(d i x i )f(x i θ) = i l i (x i )f(x i θ) A two-level multi-level model (MLM). 59/76

61 Bayesian MLMs in Astronomy Surveys (number counts/ log N log S /Malmquist): GRB peak flux dist n (Loredo & Wasserman 1998) TNO/KBO magnitude distribution (Gladman ; Petit ) MLM tutorial; Malmquist-type biases in cosmology (Loredo & Hendry 2009 in BMIC book) Extreme deconvolution for proper motion surveys (Bovy, Hogg, & Roweis 2011) Directional & spatio-temporal coincidences: GRB repetition (Luo ; Graziani ) GRB host ID (Band 1998; Graziani ) VO cross-matching (Badavári & Szalay 2008) 60/76

62 Time series: SN 1987A neutrinos, uncertain energy vs. time (Loredo & Lamb 2002) Multivariate Bayesian Blocks (Dobigeon, Tourneret & Scargle 2007) SN Ia multicolor light curve modeling (Mandel , 2011) Linear regression with measurement error: QSO hardness vs. luminosity (Kelly 2007, 2011) See last year s Surveys School notes for more! 61/76

63 Key Examples 1 Simple examples Binary Outcomes Normal Distribution Poisson Distribution 2 Multilevel models for measurement error 3 Bayesian calculation 4 Bayesian Software 5 Closing Reflections 62/76

64 Statistical Integrals Inference with independent data Consider N data, D = {x i }; and model M with m parameters. Suppose L(θ) = p(x 1 θ)p(x 2 θ) p(x N θ). Frequentist integrals Find long-run properties of procedures via sample space integrals: I(θ) = dx 1 p(x 1 θ) dx 2 p(x 2 θ) dx N p(x N θ)f(d,θ) Rigorous analysis must explore the θ dependence; rarely done in practice. Plug-in approximation: Report properties of procedure for θ = ˆθ. Asymptotically accurate (for large N, expect ˆθ θ). Plug-in results are easy via Monte Carlo (due to independence). 63/76

65 Bayesian integrals d m θ g(θ)p(θ M)L(θ) = d m θ g(θ)q(θ) p(θ M) L(θ) g(θ) = 1 p(d M) (norm. const., model likelihood) g(θ) = box credible region g(θ) = θ posterior mean for θ Such integrals are sometimes easy if analytic (especially in low dimensions), often easier than frequentist counterparts (e.g., normal credible regions, Student s t). Asymptotic approximations: Require ingredients familiar from frequentist calculations. Bayesian calculation is not significantly harder than frequentist calculation in this limit. Numerical calculation: For large m (> 4 is often enough!) the integrals are often very challenging because of structure (e.g., correlations) in parameter space. This is usually pursued without making any procedural approximations. 64/76

66 Bayesian Computation Large sample size: Laplace approximation Approximate posterior as multivariate normal det(covar) factors Uses ingredients available in χ 2 /ML fitting software (MLE, Hessian) Often accurate to O(1/N) Modest-dimensional models (d< 10 to 20) Adaptive cubature Monte Carlo integration (importance & stratified sampling, adaptive importance sampling, quasirandom MC) High-dimensional models (d> 5) Posterior sampling create RNG that samples posterior MCMC is most general framework Murali s lab 65/76

67 See SCMA 5 Bayesian Computation tutorial notes for more! 66/76

68 Key Examples 1 Simple examples Binary Outcomes Normal Distribution Poisson Distribution 2 Multilevel models for measurement error 3 Bayesian calculation 4 Bayesian Software 5 Closing Reflections 67/76

69 Tools for Computational Bayes Astronomer/Physicist Tools BIE Bayesian Inference Engine: General framework for Bayesian inference, tailored to astronomical and earth-science survey data. Built-in database capability to support analysis of terabyte-scale data sets. Inference is by Bayes via MCMC. CIAO/Sherpa On/off marginal likelihood support, and Bayesian Low-Count X-ray Spectral (BLoCXS) analysis via MCMC via the pyblocxs extension XSpec Includes some basic MCMC capability CosmoMC Parameter estimation for cosmological models using CMB, etc., via MCMC MultiNest Bayesian inference via an approximate implementation of the nested sampling algorithm ExoFit Adaptive MCMC for fitting exoplanet RV data extreme-deconvolution Multivariate density estimation with measurement error, via a multivariate normal finite mixture model; partly Bayesian; Python & IDL wrappers 68/76

70 Astronomer/Physicist Tools, cont d... root/roostats Statistical tools for particle physicists; Bayesian support being incorporated CDF Bayesian Limit Software Limits for Poisson counting processes, with background & efficiency uncertainties SuperBayeS Bayesian exploration of supersymmetric theories in particle physics using the MultiNest algorithm; includes a MATLAB GUI for plotting CUBA Multidimensional integration via adaptive cubature, adaptive importance sampling & stratification, and QMC (C/C++, Fortran, and Mathematica; R interface also via 3rd-party R2Cuba) Cubature Subregion-adaptive cubature in C, with a 3rd-party R interface; intended for low dimensions (< 7) APEMoST Automated Parameter Estimation and Model Selection Toolkit in C, a general-purpose MCMC environment that includes parallel computing support via MPI; motivated by asteroseismology problems Inference Forthcoming at Several self-contained Bayesian modules; Parametric Inference Engine 69/76

71 Python PyMC A framework for MCMC via Metropolis-Hastings; also implements Kalman filters and Gaussian processes. Targets biometrics, but is general. SimPy SimPy (rhymes with Blimpie ) is a process-oriented public-domain package for discrete-event simulation. RSPython Bi-directional communication between Python and R MDP Modular toolkit for Data Processing: Current emphasis is on machine learning (PCA, ICA...). Modularity allows combination of algorithms and other data processing elements into flows. Orange Component-based data mining, with preprocessing, modeling, and exploration components. Python/GUI interfaces to C + + implementations. Some Bayesian components. ELEFANT Machine learning library and platform providing Python interfaces to efficient, lower-level implementations. Some Bayesian components (Gaussian processes; Bayesian ICA/PCA). 70/76

72 R and S CRAN Bayesian task view Overview of many R packages implementing various Bayesian models and methods; pedagogical packages; packages linking R to other Bayesian software (BUGS, JAGS) Omega-hat RPython, RMatlab, R-Xlisp BOA Bayesian Output Analysis: Convergence diagnostics and statistical and graphical analysis of MCMC output; can read BUGS output files. CODA Convergence Diagnosis and Output Analysis: Menu-driven R/S plugins for analyzing BUGS output R2Cuba R interface to Thomas Hahn s Cuba library (see above) for deterministic and Monte Carlo cubature 71/76

73 Java Hydra HYDRA provides methods for implementing MCMC samplers using Metropolis, Metropolis-Hastings, Gibbs methods. In addition, it provides classes implementing several unique adaptive and multiple chain/parallel MCMC methods. YADAS Software system for statistical analysis using MCMC, based on the multi-parameter Metropolis-Hastings algorithm (rather than parameter-at-a-time Gibbs sampling) Omega-hat Java environment for statistical computing, being developed by XLisp-stat and R developers 72/76

74 C/C++/Fortran BayeSys 3 Sophisticated suite of MCMC samplers including transdimensional capability, by the author of MemSys fbm Flexible Bayesian Modeling: MCMC for simple Bayes, nonparametric Bayesian regression and classification models based on neural networks and Gaussian processes, and Bayesian density estimation and clustering using mixture models and Dirichlet diffusion trees BayesPack, DCUHRE Adaptive quadrature, randomized quadrature, Monte Carlo integration BIE, CDF Bayesian limits, CUBA (see above) 73/76

75 Other Statisticians & Engineers Tools BUGS/WinBUGS Bayesian Inference Using Gibbs Sampling: Flexible software for the Bayesian analysis of complex statistical models using MCMC OpenBUGS BUGS on Windows and Linux, and from inside the R JAGS Just Another Gibbs Sampler; MCMC for Bayesian hierarchical models XLisp-stat Lisp-based data analysis environment, with an emphasis on providing a framework for exploring the use of dynamic graphical methods ReBEL Library supporting recursive Bayesian estimation in Matlab (Kalman filter, particle filters, sequential Monte Carlo). 74/76

76 Key Examples 1 Simple examples Binary Outcomes Normal Distribution Poisson Distribution 2 Multilevel models for measurement error 3 Bayesian calculation 4 Bayesian Software 5 Closing Reflections 75/76

77 Philip Dawid (2000) Closing Reflections What is the principal distinction between Bayesian and classical statistics? It is that Bayesian statistics is fundamentally boring. There is so little to do: just specify the model and the prior, and turn the Bayesian handle. There is no room for clever tricks or an alphabetic cornucopia of definitions and optimality criteria. I have heard people use this dullness as an argument against Bayesianism. One might as well complain that Newton s dynamics, being based on three simple laws of motion and one of gravitation, is a poor substitute for the richness of Ptolemy s epicyclic system. All my experience teaches me that it is invariably more fruitful, and leads to deeper insights and better data analyses, to explore the consequences of being a thoroughly boring Bayesian. Dennis Lindley (2000) The philosophy places more emphasis on model construction than on formal inference... I do agree with Dawid that Bayesian statistics is fundamentally boring...my only qualification would be that the theory may be boring but the applications are exciting. 76/76

Introduction to Bayesian inference: Key examples

Introduction to Bayesian inference: Key examples Tom Loredo Dept. of Astronomy, Cornell University http://www.astro.cornell.edu/staff/loredo/bayes/ IAC Winter School, 3 4 Nov 2014 1/49 Key examples: 3