Statistics for Data Analysis a toolkit for the (astro)physicist

Size: px

Start display at page:

Download "Statistics for Data Analysis a toolkit for the (astro)physicist"

Eileen West
5 years ago
Views:

1 Statistics for Data Analysis a toolkit for the (astro)physicist Likelihood in Astrophysics Denis Bastieri Padova, February 21 st 2018

2 RECAP The main product of statistical inference is the pdf of the model parameters given the data, aka posterior probability. Under wide hypotheses, it is proportional to the likelihood. Maximizing the likelihood is then equivalent to finding the most probable value for the parameters of the model that most likely fit collected data. The analysis gets much simpler if the parameters of the model are independent, such that their probabilities could be factorized (much easier for the prior distribution) ( ) P(α β,i) = P(α I) = Π α;α min,α max α max α min Given the huge dynamics of the likelihood, it is better to work with the logarithm: the loglike. 2

3 RECAP If you think that the average of the draws here is a good estimator think again! The average was not as good as the direct computation of the likelihood! 2$ 1.5$ 1$ The average does not have really any sense here Nor the σ as an estimate of the uncertainty! Nevertheless, we can predict reasonably well the output of a draw! 0.5$ 0$ 64$ 128$ 192$ 256$ 320$ 384$ 448$ 512$ 3

4 Revisiting Confidence Intervals If we can obtain the posterior pdf, the maximal probability for the model s parameter will be the one that maximizes the post. pdf: The CI is related to the width of the p.pdf around θ 0. We should rather deal with the log(p.pdf), that is way smoother that the p.pdf itself. Better to expand everything in Taylor s series: ( ) dθ dp θ {data},i L = ln P θ {data},i θ 0 = 0 and ( ) = L(θ 0 ) + dl ( ) dθ 2 θ 0 < 0 d 2 P θ {data},i = 0 dθ θ d 2 L dθ 2 θ 0 (θ θ 0 ) 2 + 4

5 Finding again sigma L = ln P( θ {data},i) L(θ 0 ) Taking into account that the linear term is 0 and truncating after the second order, we obtain: ( ) N exp 1 2 P θ {data},i and by comparison with the normal distribution we see that a good guess for the width could be σ, or: d 2 L dθ 2 θ 0 (θ θ 0 ) 2 d 2 L dθ 2 θ 0 (θ θ 0 ) 2 σ = d2 L dθ 2 θ and θ [ θ 0 σ,θ 0 +σ ] 5

6 A gaussian measurement During the measurement of a quantity, collected data will be affected by gaussian noise, as the errors follow a gaussian distribution. The i-th datum will come out with the probability: P( x i µ,σ,i) = 1 σ 2π exp 1 2 x i µ σ Let us suppose for now that σ is known, so it could be subsumed into I, but we will still leave it out so that the formula could be adapted more easily to the case where σ is not known. 2 6

7 Inferring μ The most probable value of μ can be found by looking at its posterior pdf: Assuming the independence of x i : and with the further assumption of maximum ignorance on μ: P ( µ {x i },σ,i) P ({x i } µ,σ,i) P ( µ σ,i) L(µ,{x i }) = P {x i } µ,σ,i P(µ σ,i) = P(µ I) = ( ) ln P µ {x i },σ,i N i=1 ( ) = P( x i µ,σ,i) 1 (µ max µ min ) µ min µ µ max 0 else = const + lnl = const + P x i µ,σ,i n i=1 ( ) 7

8 Inferring μ as the sample average! Making explicit the gaussian dependence: The maximum likelihood is in correspondence of the stationary points or: and the CI is: lnl = const d(lnl) dµ µ0 = 0 0 = d 2 (lnl) dµ 2 µ 0 = N i=1 1 σ 2 N i=1 N i=1 1 2 x i µ 0 σ 2 x i µ σ 2 µ 0 = 1 N N i=1 x i µ µ 0 σ N,µ + σ 0 N 8

9 No info about σ The i-th datum will come out with the probability: 2 1 P( x i µ,σ,i) = σ 2π exp x 1 i µ 2 σ Being interested to the actual value of the measurement we should integrate out the σ-dependence And a uniform prior could be: constant σ > 0 P(µ,σ I) = 0 else Leveraging event independence the likelihood will be: P ({x i } µ,σ,i) = ( σ 2π ) N exp 1 N ( x 2σ 2 i µ ) 2 i=1 ( ) = dσ P ( µ,σ {x i },I) P µ {x i },I 0 9

10 No info about σ The posterior probability will be described by (s=1/σ): P ( µ {x i },I) ds s N 2 exp s2 N ( x i µ ) i=1 Substituting finally with x = s[ i (x i μ) 2 ] ½, and absorbing into the proportionality relation the definite integral in x: N 1 2 N P ( µ {x i },I) ( x i µ ) 2 i=1 Finding the maximum and the width: d(ln P µ ) = 0 0 = (N 1) (x µ ) i 0 dµ µ 0 d 2 (ln P µ ) dµ 2 µ 0 = µ (x i µ 0 ) 2 0 = 1 N i=1 N(N 1) (x i µ) µ µ 2 0 ± S N, (x i µ) 2 S2 = N 1 N x i 10

11 Signal and background The typical experiment in astrophysics is photon-counting: In a given direction from the sky, in a given energy bin we detect O i photons, partly due to a source (the signal) and partly due to a diffuse component (the background). How to disentangle the two? The index i, could as well run on angular and/or energy bins. Assuming the signal is following a gaussian distribution: 2 x E i = N 0 S exp 1 i µ 2 σ + B N 0 is a constant related to the measurement time, S represent the signal, B the background, x i is the position in terms of energy or direction and E i is the expected number of photons. 11

12 Poisson (or Cash) statistics The probability of observing O i events when we expect E i follows a Poisson distribution. Assuming the independence of events (here more tricky): and a uniform distribution of the models (even more tricky): P(S, B I) = ( ) = E i P O i S, B,I P ({O i } S, B,I) = O i e E i the logarithm of the post.pdf is now: L = ln P S, B {O i },I O i N i=1 E i O i e E i O i constant S 0 B 0 0 else N i=1 ( ) = const + O i ln(e i ) E i [ ] 12

13 Posterior pdf μ = 0, FWHM = data (= 15 bins) Contour plots at 10%, 30%, 50%, 70% and 90%. Row 1: N 0 such as max{e i } = 100 Row 2: T Row2 = T Row1 /10 Truncation! Row 3: 31 data (max E i = 100) Row 4: 7 data (max E i = 100) Correlation! 13

14 Likelihood ratio test The likelihood ratio test is the likelihood of the null hypothesis divided by the alt. hypothesis Λ = sup{l(h 0, {data})}/sup{l(h 1, {data})}. No dependence on P({data} I)! Neyman-Pearson lemma holds: the likelihood ratio test has the highest power at a given Significance Level. Rejects the null hypothesis if it is too small. 14

15 Likelihood ratio Hypotheses may be nested: the one with more parameters can be transformed into the simpler model by imposing a set of constraints. In a way, this simplifies the ratio: P({model 0 } I)/P({model 1 } I)! If they are nested, the Wilks theorem holds: the test statistic TS = 2ln(Λ) is distributed like a χ 2 (dof 1 dof 0 ). Null hypothesis rejected when TS is large! 15

16 Likelihood ratio (example) Let us fit our data with a power-law and with a power-law with exponential cutoff: dn de = A PL E α dn PL PL de = A PLE E α PLE e β PLEE PLE The data should be fitted better with the PLE (otherwise no need for an additional parameter: Occam s razor) and the likelihood should increase. They are nested, since we can set β PLE = 0 and the PLE becomes a PL and the difference in dof is 1. We should compute the two likelihood suprema (for the two hypotheses), their ratio Λ (<1) and the TS = 2ln(Λ), then compare against the statistics of χ 2 (1). 16

17 17

18 How likely is a source? A point-like source should produce some flux, so some of the detected gamma rays should be ascertained to it. If the model of the flux describing the new, hypothetical source is described by n parameters (example: PL = 2 parms, PLE = 3 parms, logp = 3 parms, ), then we could compare the max-likelihood of H 0 (no new source) with H 1 (H 0 + additional source) Λ, converted to a TS value, will follow a χ 2 (n) dist. Quite typically, TS σ 2. 18

19 Unbinned Likelihood Many bins in (E, Ω): the probability of having in ΔT a γ in bin i is low (Gauß à Poisson) If there are n i observed γ in a given bin i, and the model predicts m i γ: p i = m i n i e m i/(n i!) Likelihood: L = exp[ N model ] i mi n i/n i! grows with loglike: lnl = i n i ln(m i ) i ln(n i!) i ln(m i ) correct prediction constant UNBINNED: bins are small! n i = 0 or 1 grows with model counts L unbinned = exp[ N model ] i m i (i: non-empty bins) 19

20 EG_v02: Normalization: / Npred: Flux: e-05 +/ e-05 photons/cm^2/s gtlike output GAL_v02: Value: / Npred: Flux: / e-05 photons/cm^2/s 3c454: Integral: / Index: / LowerLimit: 100 UpperLimit: Npred: ROI distance: TS value: Flux: e-05 +/ e-07 photons/cm^2/s Total number of observed counts: 5330 Total number of model events: log(likelihood): Elapsed CPU time: gtlike creates two output files: 1) results.dat: fit results 2) counts_spectra.fits: the counts in a proper energy binning 20

21 gtlike output Solid lines follows the order as they are listed in the file results.dat: black) ROI fit red) 1 st source (3c454) green) 2 nd source (isotropic) blue) 3 rd source (galactic)... 21

22 RECAP The likelihood is what the experimentalist supplies. Under wide hypotheses, it coincides with the posterior pdf. Maximizing the likelihood will find the set of the parameters of the model that most likely fit the data. The 1 st and 2 nd derivative of the posterior pdf represent, respectively, the best estimate of the parameter and its uncertainty. The likelihood ratio can be used to compare the likelihood of different hypotheses. Leveraging Neyman-Pearson lemma, the likelihood ratio is the most powerful method to reject H 0. Leveraging Wilks theorem, under not-so-strict hypotheses, we are able to compute the significance of the likelihood ratio. 22

Fermi LAT data analysis

Fermi LAT data analysis PSR J0633+1746 4 VER J0648+152 0 50 100 150 200 Fermi Gamma-ray Space Telescope Large Area Telescope (LAT) ~20 MeV to >300 GeV FOV: 2.4 sr Credit: NASA E/PO, Sonoma State University,