Treatment and analysis of data Applied statistics Lecture 6: Bayesian estimation

Size: px

Start display at page:

Download "Treatment and analysis of data Applied statistics Lecture 6: Bayesian estimation"

Blake Lane
5 years ago
Views:

Treatment and analysis o data Applied statistics Lecture 6: Bayesian estimation Topics covered: Bayes' Theorem again Relation to Likelihood Transormation o pd A trivial example Wiener ilter Malmquist

1 Treatment and analysis o data Applied statistics Lecture 6: Bayesian estimation Topics covered: Bayes' Theorem again Relation to Likelihood Transormation o pd A trivial example Wiener ilter Malmquist bias Lutz-Kelker bias Bayes versus Likelihood The bus problem Sept-Oct 2006 Statistics or astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 1 Bayesian estimation Thomas Bayes ( ) Thomas Bayes, British mathematician, Presbyterian minister and Fellow o the Royal Society. The manuscript Essay towards solving a problem in the doctrine o chances was ound ater his death and published It establishes a mathematical basis or probability inerence by: treating model parameters as random variables with a prior distribution prescribing how the distribution is modiied by data (Bayes theorem) basing the inerence on the resulting posterior distribution Sept-Oct 2006 Statistics or astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 2

2 Bayes' theorem P(A&B) = P(A)P(B A) = P(B)P(A B) A P(A B) = P(A)P(B A)/P(B) A = model (M), B = data (D) A&B P(M D) = P(M)P(D M)/P(D) B P(M) = prior probability o M (beore D) P(M D) = posterior probability o M (in light o D) P(D M) = likelihood o M (given D) P(D) = ixed [only needed to normalize P(M D)] Θ ( θ x ) = ( xθ ) ( xθ ) Θ Θ ( θ ) ( θ )dθ L( θ x ) Θ ( θ ) Sept-Oct 2006 Statistics or astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 3 Relation to Likelihood Θ ( θ x ) Θ( θ ) L( θ x ) posterior prior likelihood Treating θ as a random variable is still (ater 240 years!) a somewhat controversial issue. Also the choice o prior distribution is seen as problematic ( subjective ). I the prior pd is lat (over a reasonable interval o θ) then maximum a posteriori (MAP) is equivalent to maximum likelihood (ML). A slanted or peaked prior may push the MAP away rom the ML. I the data do not determine the parameter well (wide likelihood unction), then the posterior depends strongly on the prior. Conversely, or well-determined problems, the prior has little inluence. Note that Bayes theorem gives a pd or θ, not a value (point estimate). Sept-Oct 2006 Statistics or astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 4

3 Transormation o pd Let be a random variable with pd (x) and Y another random variable obtained by the transormation Y = g() where g is some known unction. What is the pd Y (y) o the transormed variable? The general case is complex, and Y (y) may not even exist. However, i g is continuous and monotone, the answer is simple: dx = Y ( y) dy Y ( y) = dx dy = g 1 Note that the is needed in case g is decreasing. Multivariate case: Y = g( ) Y ( y) = g det x T 1 The determinant is the Jacobian o the transormation. Sept-Oct 2006 Statistics or astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 5 A trivial (?) example (1/2) Suppose we want to measure the intensity λ o a source by counting the number o photons, n, detected in a certain time interval. We assume that n ~ Poisson(λ). Given n = 10, what is the estimate o λ? L(λ n) = λ n exp( λ) / n! => MLE λ* = n = 10 (reasonable!) Bayesian estimation (MAP) gives the same result i the prior distribution is lat between (say) 0 and 100. But i the prior state o knowledge is that we have no idea even about the order o magnitude o λ, then it can be argued that the prior pd is lat in log λ rather than λ. (C. requency o irst digit in natural constants!) This implies a prior pd inversely proportional to λ; thus posterior pd λ 1 L(λ n) = λ n 1 exp( λ) which has a maximum at λ = n 1. Thus, the Bayesian MAP estimate is 9. (This gets really weird when n = 1.) Sept-Oct 2006 Statistics or astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 6

4 A trivial (?) example (2/2) But is the MAP (maximum a posteriori) estimate really what we want? An alternative in Bayesian estimation is to compute the posterior mean. Using that λ n exp( λ) dλ = n! we ind or prior λ 0 : E(λ n) = n + 1 or prior λ 1 : E(λ n) = n The MAP estimate and the posterior mean are not invariant to transormation o λ (while the MLE is). There is yet another Bayesian estimate which is invariant to transormation, namely the posterior median. It is more complicated to compute, but to a good approximation we have or prior λ 0 : median(λ n) = n + 2/3 or prior λ 1 : median(λ n) = n 1/3 (n > 0) With dierent estimators we get anything between n 1 and n+1, but does it matter? Sept-Oct 2006 Statistics or astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 7 A more interesting example: Wiener ilter (1/2) Suppose we observe a continuous variable y = x + ε, where x and ε are independent Gaussian random variables with zero mean and s.d. s (or signal) and n (or noise): x ~ N(0, s 2 ), ε ~ N(0, n 2 ) Given the value y, what is the estimate o x? Prior pd or x : (x) = (2π) 1/2 s 1 exp[ x 2 /2s 2 ] Likelihood unction : L(x y) = (2π) 1/2 n 1 exp[ (y x) 2 /2n 2 ] Posterior pd or x : (x y) exp[ x 2 /2s 2 (y x) 2 /2n 2 ] 2 2 s n Completing the square, we ind that (x y) is Gaussian with variance = 2 2 s + n s and mean value xˆ = y, which is the Bayesian estimate o x. 2 2 s + n (A Wiener ilter has the transer unction R 2 /(1 + R 2 ), where R = S/N.) 2 Sept-Oct 2006 Statistics or astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 8

5 Wiener ilter (2/2) Sometimes it is helpul to visualize Bayes theorem by means o the joint pd o the parameter (x) and data (y): y = x+ε y = x observed y xˆ x equiprobability curves Sept-Oct 2006 Statistics or astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 9 Another example: Malmquist bias (1/3) I a class o objects has an intrinsic spread in luminosity (or absolute magnitude M), and we pick at random an object on the sky with apparent magnitude m, then that object is likely to be more luminous than is typical or the class. Malmquist (Lund Medd. Ser. II, No. 22, 1920) derived the required correction to the mean observed M as unction o the intrinsic spread in M and the observed distribution o m. In Bayesian terms, the eect can be understood as the dierence between the prior (intrinsic or true) distribution o M, and the posterior (apparent) distribution o M or given m. For a certain class o objects, assume or simplicity: 1. that the intrinsic luminosity unction is M ~ N(M 0, σ 2 ); 2. that the objects are on average uniormly distributed in space; 3. that there is no extinction. Let x = m M = 5 log (r/10 pc) denote the distance modulus. Sept-Oct 2006 Statistics or astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 10

6 Malmquist bias (2/3) Assumptions 2 and 3 imply that the distance modulus x = m M = 5 log (r/10 pc) has the (improper) pd (x) x. Thus: (m M) = (m M) (m M) = exp[ γ (m M) ] ( γ = 0.6 ln 10 = ) (M) exp[ (M M 0 ) 2 /2σ 2 ] (M m) exp[ γ (m M) ] exp[ (M M 0 ) 2 /2σ 2 ] thus (M m) ~ N(M 0 γσ 2, σ 2 ) = exp[ γ (m M) (M M 0 ) 2 /2σ 2 ] = exp[ γ (m M 0 ) + γ 2 σ 2 /2 ] exp[ (M M 0 + γσ 2 ) 2 /2σ 2 ] so the mean abs. mag. o the objects with apparent magnitude m is M = M 0 γσ 2 Sept-Oct 2006 Statistics or astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 11 Malmquist bias (3/3) m equiprobability curves x = constant M M 0 M Sept-Oct 2006 Statistics or astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 12

7 Yet another example: Lutz-Kelker bias (1/3) Let p 0 be the true parallax o a star and p the measured value. Assume that the measurement errors are Gaussian with zero mean and s.d. σ. Then, or any given star (with true parallax = p 0 ), P( p < p 0 p 0 ) = P( p > p 0 p 0 ) i.e., positive and negative errors are equally probable. Now consider instead any given measured parallax value p. Then, in general, P( p 0 < p p ) P( p 0 > p p ) that is, positive and negative errors are not equally probable! This may at irst seem paradoxical, but a single example may be enough to make the statement credible: it is possible to obtain a negative value o the measured parallax, in which case P( p 0 < p p ) = 0 and P( p 0 > p p ) = 1. Sept-Oct 2006 Statistics or astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 13 Lutz-Kelker bias (2/3) Lutz & Kelker (PASP 85, 573, 1973) discussed the use o trigonometric parallaxes or luminosity calibration and derived a systematic correction depending on the relative parallax error (σ/p). In a stellar sample selected according to a lower limit on the observed parallax, the sample mean parallax is systematically too large, because the random errors will scatter more stars into the volume (with positive errors) than out o it (with negative errors). The eect can be ormulated in Bayesian terms. Let us assume 1. that the observed parallax has the distribution p ~ N( p 0, σ 2 ) 2. that the number density (n, in pc 3 ) o stars o a given class decreases exponentially with the height z above the Galactic plane: n = n 0 exp( z /H), H = scale height 3. that there is no extinction. As unction o distance r = 1/p 0 we have n(r) = n 0 exp( βr), where β = sin b /H. Sept-Oct 2006 Statistics or astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 14

8 Lutz-Kelker bias (3/3) The number o stars in solid angle ω with distance r to r+dr is dn = ω r 2 n(r) dr. r = p 1 0 dr = p 2 0 dp 0 dn/dp 0 p 4 0 exp( β/p 0 ), thus: ( p 0 ) p 4 0 exp( β/p 0 ) (prior) ( p p 0 ) exp[ (p p 0 ) 2 /2σ 2 ] (likelihood) ( p 0 p ) p 4 0 exp( β/p 0 (p p 0 ) 2 /2σ 2 ] (posterior) Lp0 ( ) P( p0) 0.2 B( p0) 0.15 Lp0 ( ) P( p0) B( p0) p0 p = 10, σ = 1, β = p0 p = 10, σ = 2, β = 10 Sept-Oct 2006 Statistics or astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 15 Bayes versus Likelihood Let D 1 and D 2 be two independent data sets relevant to the same model M. Since the data sets are independent, the total likelihood is L(M D 1,D 2 ) = L(M D 1 ) L(M D 2 ) I we regard D 1 as representing the knowledge about M beore introducing D 2, we have, ater renormalization, essentially Bayes' theorem. The nice thing about Bayesian theory is that it gives a ramework or treating the prior and posterior knowledge on exactly the same ooting. The Bayesian approach also encourages us to think about the a priori assumptions in any experiment, which is probably a good thing. Acknowledgement: In this lecture I have made use o some good ideas rom Ned Wright's Journal Club Talk on Statistics, Sept-Oct 2006 Statistics or astronomers (L. Lindegren, Lund Observatory) Lecture 6, p. 16

Treatment and analysis of data Applied statistics Lecture 4: Estimation

Treatment and analysis of data Applied statistics Lecture 4: Estimation Topics covered: Hierarchy of estimation methods Modelling of data The likelihood function The Maximum Likelihood Estimate (MLE) Confidence