Bayesian Modeling of log N log S

Size: px

Start display at page:

Download "Bayesian Modeling of log N log S"

Lorraine McDonald
5 years ago
Views:

1 Bayesian Modeling of log N log S Irina S Udaltsova Vasileios Stampoulis Department of Statistics University of California, Davis July 26th, / 44

2 Introduction Scientific Objectives: Develop a comprehensive method to infer (properties of) the distribution of source fluxes for a wide variety source populations. Statistical Objectives: Inference: Account for non-ignorable missing data Model Checking: Evaluate the adequacy of a given model Model Selection: Select the best model for a given dataset Collaborators: Paul D. Baines (UCD), Andreas Zezas (University of Crete & CfA), Vinay Kashyap (CfA). 2 / 44

3 CHANDRA:

4 Introduction: log N log S Cumulative number of sources detectable at a given sensitivity: N(> S) = i I {Si >S} or the number of source fluxes brighter than a threshold, S. The name log N log S refers to the relationship between (or plot of) log 10 N(> S) and log 10 S. In many cosmological applications there is strong theory that expects the log N log S to obey a Power law: N(> S) = N I {Si >S} αs θ, S > S min i=1 Taking the logarithm gives the linear log(n) log(s) relationship. 4 / 44

5 The Rationale for log N log S Fitting Knowing the specific relationship for different objects (e.g., stars, galaxies, pulsars) gives a lot of information about the underlying physics (e.g., the mass of galaxies). Helps in tests for cosmological parameters, constrains evolutionary models, etc. Primary Goal: Estimate θ, the power law slope, while properly accounting for detector uncertainties and biases. Note: There is uncertainty on both x- and y-axes (i.e., N and S). 5 / 44

6 Flux Distribution Probabilistic Connection: Under independent sampling, linearity on the log N log S scale is equivalent to the flux distribution being a Pareto distribution. S i τ 1, θ iid Pareto (θ, τ1 ), i = 1,..., N. Piecewise-linear power-law extends to broken power-law model for flux distribution, subject to continuity constraint at breakpoint τ 2 : { α0 θ log 10 (1 F G (s)) = 0 log 10 (s) τ 1 < s τ 2 α 1 θ 1 log 10 (s) s > τ 2 ( [ ( ) ]) θ1 τ1 Y I X 1 + (1 I ) X 2, where: I Bin 1, 1 τ 2 X 1 Truncated-Pareto (τ 1, θ 1, τ 2 ), X 2 Pareto (τ 2, θ 2 ). Can be extended to a multiple broken power-law with arbitrary number, m 1, of break-points. 6 / 44

7 Missing Data There are many potential causes of uncertainty and missing data in astronomical data: Low-count sources (below detection threshold) Detector schedules (source not within detector range) Background contamination (e.g., total=source+background) Foreground contamination (other objects between the source and detector) etc. Important: Whether a source is observed is a function of its source count (intensity), which is unobserved for unobserved sources missing data mechanism is non-ignorable, so it needs to be carefully accounted for in the analysis. 7 / 44

8 The Data Flux is not measured directly; instead, the data are a list of photon counts, with some extra information about the background and detector properties. Src_ID Counts Bgr_intensity Src_area Off_axis_L Effective_area and an incompleteness function (selection function), specifying the probability of source detection under a range of conditions: P( Detecting a source with flux S, background intensity B, location L and effective area E ) g(s, B, L, E) 8 / 44

9 Single Power-Law Model Standard power-law flux distribution: S i τ, θ iid Pareto (θ, τ), i = 1,..., N. Source and background photon counts: Yi tot indep S i, B i, L i, E i Pois (λ(s i, B i, L i, E i ) + k(b i, L i, E i )), i = 1,..., N. Incompleteness, missing data indicators: Prior distributions: I i Bernoulli (g (S i, B i, L i, E i )). N Neg-Bin (a N, b N ), θ Gamma(a, b), τ Gamma(a m, b m ). 9 / 44

10 Broken Power-Law Model Broken power-law flux distribution: S i τ, θ iid Broken-Pareto Source and background photon counts: ( θ, τ ), i = 1,..., N. Yi tot indep S i, B i, L i, E i Pois (λ(s i, B i, L i, E i ) + k(b i, L i, E i )), i = 1,..., N. Incompleteness, missing data indicators: Prior distributions: τ j = τ 1 + I i Bernoulli (g (S i, B i, L i, E i )). N Neg-Bin (a N, b N ), θ j indep Gamma(a j, b j ), j = 1,..., m, j k=2 τ 1 Gamma(a m, b m ) e η k, η j indep Normal(µ j, c j ), j = 2,..., m. 10 / 44

11 Model Overview Unusual points and important notes: The dimension of the missing data is unknown (care must be taken with conditioning) The incompleteness function g must be known well; can take any form; is problem-specific The number of flux populations in sample, m, must be specified The flux lower limit and break-points, τ j, j = 1,..., m, can be estimated Prior parameters can be science-based, i.e., weakly informative 11 / 44

12 Posterior Inference (Single Pareto) Inference about θ, N, S, τ is based on the observed data posterior distribution. Care must be taken with the variable dimension marginalization over the unobserved fluxes. The single power-law posterior can be shown to be: p ( N, θ, τ, S obs, Yobs n, src Yobs tot ), B obs, L obs, E obs ( N n ) ( I {n N} (1 π(θ, τ)) (N n) N + an 1 a N 1 ) ( b N b a Γ(a) θa 1 e bθ b am n m I {θ>0} Γ(a τ am 1 e bmτ I {τ>0} m) i=1 g(s i, B i, L i, E i ) (λ i + k i ) Y i tot ( Y tot i Y src i ) ( λ i λ i + k i Y i tot! ) Y src ( i e (λ i +k i ) I {Y tot Z i + } 1 λ i λ i + k i ) Y tot i Y i src I { Y src i ) N ( bn 1 + b N ) an I {N Z + } p (B i, L i, E i ) θτ θ S (θ+1) I i {τ<si } {0,1,...,Y i tot } } with λ i λ(s i, B i, L i, E i ) and k i k(b i, L i, E i ). 12 / 44

13 Posterior Inference (Broken-Pareto) The broken power-law posterior can be shown to be: p ( N, θ, τ, S obs, Yobs n, src Yobs tot ), B obs, L obs, E obs [( ) ] 1 N = p(n, Y obs tot, B obs, L obs, E obs ) I n {n N} (1 π(θ, τ)) (N n) ) ( ( ) N ( ) an N + an 1 1 bn m b a j I a N 1 {N Z 1 + b N 1 + b + } j N Γ(a j=1 j ) θa j 1 e b j θj I j {θj >0} n p (τ 1,..., τ m) I {0<τ1 <τ 1 < <τm} p (B i, L i, E i ) g(s i, B i, L i, E i ) i=1 m j 1 ( ) θl ( ) ( ) τl+1 (θj +1) θj Si I τ j=1 l=1 l { } τ j τ τ j j S i <τ j+1 (λ i + k i ) Y i tot Y i tot! ( ) ( Y tot λ i i Yi src λ i + k i ) Y src i e (λ i +k i ) I {Y tot Z i + } ( 1 λ i λ i + k i ) Y tot i Y i src I { Y src i {0,1,...,Y i tot } } with τ m+1 = +, λ i λ(s i, B i, L i, E i ), k i k(b i, L i, E i ), and 0 i=1 ( τi+1 τ i ) θi = / 44

14 Computation Strategy: Gibbs Sampler The Gibbs sampler consists of five steps: [N n, θ], [θ n, N, S obs, τ], [τ n, N, θ, S obs, B obs, L obs, E obs ], [ Sobs N, θ, τ, I obs, Yobs tot, Y obs src, B ] obs, L obs, E obs, [ Y src tot obs Yobs, B ] obs, L obs, E obs, I obs, S obs where each parameter block is sampled via various numerical and probabilistic routines, including Metropolis Hastings. To Model Checking To Model Validation To Computational Details 14 / 44

15 Simulation Example MCMC Output (L) Posterior logn-logs (red: missing, gray: observed), truth (blue). (R) Posterior distributions for N, θ, τ 15 / 44

16 Non-Ignorable Missingness (L) Nonignorable (full) analysis: (R) Ignorable analysis: ˆθ = 0.990, (0.803, 1.192) ˆθ = 0.784, (0.520, 0.978) Truth: θ = / 44

17 Posterior Predictive Checking Consider testing the hypothesis: H 0 : The model is correctly specified, vs., H 1 : The model is not correctly specified. MCMC draws allow to check the adequacy of the model fit for Bayesian models using the posterior predictive distribution: p(y y) = p(y, θ y)dθ = p(y θ) p(θ y)dθ Idea: (Assuming conditional independence) We expect the predictive distribution of new data to look similar to the empirical distribution of the observed data. Extension: We expect function summaries of interesting features (e.g., test statistics) of the new data to look similar to the empirical distribution of functions of the observed data. 17 / 44

18 Posterior Predictive Checking Procedure: 1. Sample θ from the posterior distribution p(θ y) 2. Given θ, sample y (replicate photon counts) from p(y θ) 3. Compute desired statistic T (y ) 4. Compute the posterior predictive p value = proportion of the replicate cases in which T (y ) exceeds T (y): p b = P (T (y ) T (y) y, H 0 ). Choice of test statistic T (y) to summarize the desired features of distribution of the photon counts? n = dim(y 1,..., y n ) Max = max(y 1,..., y n ) Med = Q 2 (y 1,..., y n ) IQR = Q 3 (y 1,..., y n ) Q 1 (y 1,..., y n ) Skewness = skewness(y 1,..., y n ) R 2 S crude = coefficient of determination between log N log S crude, where S crude = (y bg)γ/e 18 / 44

19 Posterior Predictive Check: Examples If the model is true, the posterior predictive p-value will almost certainly be very close to 0.5. Figure: (Left) If the model is in fact true, the sample IQR of new sets of n observations will look similar to that of observed n data points. (Right) The sample value of RS 2 crude of new sets of n observations does not look similar to that of observed n data points, so the data is unlikely to have been produced from current model.

20 Bivariate PP p values The bivariate posterior predictive p value = area under slice of bivariate posterior predictive distr. of two test statistics Figure: (Left) Model correctly specified, (Right) Model incorrectly specified 20 / 44

21 Model Selection Figure: (Left) Regular Pareto, (Right) Broken Pareto Note: similar posterior log N log S, but very different predictive and inferential properties 21 / 44

22 Model Selection Model selection is a challenging problem in Bayesian applications. Popular methods are: Bayes Factor, DIC, and WAIC. Bayes Factor p(m 1 y) p(m 2 y) = p(y M 1) p(m 1 ) p(y M 2 ) p(m 2 ) = BF p(m 1 ) 12 p(m 2 ), Has good model selection performance in simple settings Problem: Requires evaluating marginal probability of the data given the model M k p(y M k ) = p(y) = p(y θ)p(θ)dθ. Ω 22 / 44

23 Model Selection DIC, Deviance Information Criteria (Spiegelhalter, 2002) DIC = 2 log p(y θ Bayes ) + 2p DIC { } p DIC = 2 log p(y θ Bayes ) E[log p(y θ) y] WAIC, Widely Applicable Information Criteria (Watanabe, 2010) N WAIC = 2 log E[p(y i θ) y] + 2p WAIC p WAIC = 2 i=1 N {log E[p(y i θ) y] E[log p(y i θ) y]} i=1 Problem: In complex hierarchical problems, DIC and WAIC can have poor model selection performance by over-fitting. Alternatives? 23 / 44

24 Adaptive Fence Method (AF) Original method of adaptive fence (Jiang et.al., 2008). For a collection of nested candidate models M 1, M 2,..., M k, with the full model M k, Select a measure of goodness of fit (Q = log p(y ˆθ MLE )) Select optimality criterion (minimal dimension) Construct a fence Q(M j ) Q( M) c where c is a constant, and necessarily Q(M j ) > Q( M) for all j Label model M c if it is in the Fence and satisfied optimality criterion The model to maximize p = P(M c = M opt ) is chosen to be the correct model The probability P is evaluated under model M 24 / 44

25 Adaptive Fence Method (AF) Recent article (Jiang, 2014) presents various fence methods. Example of Adaptive Fence Method p.star[, kk] c.vec[, kk] Adaptive Fence (AF) selects the cut-off c by maximizing the empirical probability of selection Parametric bootstrap is used to evaluate the maximum empirical probability P given parameter MLE of model M Problem: MLE are not easily available for log(n) log(s) application with missing data Problem: AF has adequate performance when true model is (closest to) the boundary model of the nested candidate models Problem: Non-nested candidate models, for which no full model can be easily identified? 25 / 44

26 Bayesian Adaptive Fence Method (BAFM) We extend AF method for Bayesian applications and propose Use non-parametric bootstrap Use alternate (Bayesian version) measures of goodness-of-fit, Q 1 = N log i=1 [ 1 L L l=1 Q 2 = log p(y β PostMean ), Q 3 = log p(y β PostMedian ), p(y i β (l) MCMC ) Q 4 = log p(y β PostMode ), Q 5 = 1 L log p(y β (l) MCMC L ), l=1 { } Q 6 = Median log p(y β (l) MCMC ), l = 1,..., L. where l = 1,..., L draws of MCMC parameters θ (l) are collected. Introduce two phantom boundary models of best and worst fit Q( M k+1 ) = 0 and Q(M 0 ) > max{q(m 1 ),..., Q(M k )}. 26 / 44 ],

27 Example: p vs c plot (BAFM) p* based on Q6. N = 200. p* c 27 / 44

28 Simulation of Model Selection Performance Simulated true model SimID Truth τ 1, τ 2, τ 3 θ 1, θ 2, θ 3 1 bp ,, 0.5,, 2 bp , , 0.5, 0.7, 3 bp , , 0.5, 1.5, 4 bp , , , 1.0, bp , , , 0.7, 0.9 Results SimID Truth DIC mean DIC median DIC mode BF BAFM 1 bp (Q 4 ) 2 bp (Q 4 ) 3 bp (Q 4 ) 4 bp (Q 4 ) 5 bp (Q 2 ) Table: Proportion of correctly selecting true model based on DIC at posterior mean, median, mode, BF with harmonic mean approximation, and Bayesian Adaptive Fence Method for best performing measure Q with automatic peak detection. 28 / 44

29 BAFM Issues Which criterion measure Q to use? Complete likelihood is unknown Middle peak does not always exist. Cannot select one best model uniquely Very computationally intensive. Need to run MCMC for each bootstrap re-sample 29 / 44

30 References P.D. Baines, I.S. Udaltsova, A. Zezas, V.L. Kashyap (2011) Bayesian Estimation of log N log S, Proc. of Statistical Challenges in Modern Astronomy V P.D. Baines, I.S. Udaltsova, A. Zezas, V.L. Kashyap (2016) Hierarchical Modeling for log N log S: Theory, Methods and Applications Working Paper J. Jiang. et. al (2008) Fence methods for mied model selection. Ann Stat, 36: J. Jiang (2014) The Fence Methods. Adv Stat, 2014: R.J.A. Little, D.B. Rubin. (2002) Statistical analysis with missing data, Wiley. R.K.W. Wong, P. Baines, A. Aue, T.C.M. Lee, V.L. Kashyap (2014) Automatic estimation of flux distributions of astrophysical source populations. Ann Appl Stat, 8:1690?1712. A. Zezas et al. (2004) Chandra survey of the Bar region of the SMC Revista Mexicana de Astronomia y Astrofisica (Serie de Conferencias) Vol. 20. IAU Colloquium 194, pp / 44

31 Validating Bayesian Computation Given the complexity of the hierarchical model and computation, it is important to validate that everything works correctly: a self-consistency check. Procedure: 1. Simulate parameters from the prior, and data from the model, given those parameters 2. Fit the model to obtain posterior intervals 3. Record whether or not the true value of the parameter was within the interval 4. Repeat steps 1, 2, & 3 a large number of times, and calculate the average coverage The average and nominal coverages should be equal. These validation checks are extremely important when dealing with complex procedures. 31 / 44

32 Model Validation 90% Posterior intervals vs. Truth (200 datasets) Actual vs. Nominal Coverage (200 datasets) Posterior 0e+00 1e 17 2e 17 3e 17 4e 17 5e flagged intervals Actual Target N theta Smin Sobs 0e+00 1e 17 2e 17 3e 17 4e 17 5e True tau.1 Nominal Figure: (Left) Posterior 90% intervals of τ 1 = S min show reasonable coverage roughly 90% of the time, when nominal coverage is set to 90%. (Right) Average coverage probabilities for each parameter is within the expected bounds at each nominal probability. 32 / 44

33 Computational Details The Gibbs sampler consists of five steps: [N n, θ], [θ n, N, S obs, τ], [τ n, N, θ, S obs, B obs, L obs, E obs ], [ Sobs N, θ, τ, I obs, Yobs tot, Yobs, src ] B obs, L obs, E obs, [ Y src obs Yobs tot ], B obs, L obs, E obs, I obs, S obs. Sample the total number of sources, N, (Numerical Integration): ( ) N p (N ) I n {n N} (1 π(θ, τ)) (N n) p (N) p (S obs N, θ, τ) Γ(N + a ( ) N N) 1 Γ(N n + 1) (1 π(θ, τ)) (N n) I {n N} b N + 1 The marginal probability of observing a source π(θ, τ) is pre-computed via the numerical integration. π(θ, τ) = g(s i, B i, L i, E i ) p(s i θ, τ) p(b i, L i, E i ) ds i db i dl i de i. 33 / 44

34 Computational Details cont... The Gibbs sampler consists of five steps: [N n, θ], [θ n, N, S obs, τ], [τ n, N, θ, S obs, B obs, L obs, E obs ], [ Sobs N, θ, τ, I obs, Yobs tot, Yobs, src ] B obs, L obs, E obs, [ Y src obs Yobs tot ], B obs, L obs, E obs, I obs, S obs. Sample the power-law slope, θ, (Metropolis Hastings using a Normal-proposal): p (θ ) p (θ) p (S obs N, θ, τ) (1 π(θ, τ)) (N n) ( n ( ) ) (1 π(θ, τ)) (N n) Si Gamma θ; a + n, b + log τ Sample the observed photon counts Yobs,i src, i = 1,..., n: p (Y src i ) p(yi src ( Bin Y tot i, S i, B i, L i, E i ) Yi src ; Yi tot, i=1 λ(s i, B i, L i, E i ) λ(s i, B i, L i, E i ) + k(b i, L i, E i ) ) 34 / 44

35 Computational Details cont... Sample the minimum flux τ (Metropolis Hastings using a Truncated-Normal-proposal after log-transformation): p(τ ) p(τ) p(θ) p(n) p (B obs, L obs, E obs ) p (n, S obs, I obs N, θ, τ, B obs, L obs, E obs ) Gamma (τ; a m + nθ, b m ) (1 π(θ, τ)) N n I {0<τ<cm} p(η = log(τ) ) = e η p(τ = e η ) e η(nθ+am+1) e bmeη (1 π(θ, τ = e η )) N n I {η<log(cm)}. Sample the fluxes S obs,i, i = 1,..., n (Metropolis Hastings using a Normal-proposal): p (S i ) p (S i N, θ, τ) p (I i = 1 S i, B i, L i, E i ) p ( Yi tot S i, B i, L i, E i ) p ( Yi src Yi tot ), S i, B i, L i, E i Pareto(S i ; θ, τ) g(s i, B i, L i, E i ) Pois(Yi tot ; λ(s i, B i, L i, E i )+k(b i, L i, E i )) ( Bin Yi src ; Yi tot, λ(s i, B i, L i, E i ) λ(s i, B i, L i, E i ) + k(b i, L i, E i ) ) 35 / 44

36 Computational Details cont... Sample θ = (θ1,..., θ m) T : (Metropolis-Hastings using a Normal-proposal) [ p(θ ) (1 π(θ, τ)) N n] ( ) m τj+1 m ] Gamma θ j ; a j + n(j) 1, b j + I {j m} log [n(i)i {i j+1} + ( ) si log, τ j=1 j τ i=1 i I(j) j where I(j) = { i : τ j s i < τ j+1 } and n(j) is the cardinality of I(j) i.e., I(j) (n(j)) denotes the set (number) of source indices whose flux is contained in the interval corresponding to the j-th mixture component. Sample the break-points τ = (τ2,..., τ m) T (Metropolis-Hastings based on original transformed scale η j = h( τ τ 1 ) = log(τ j τ j 1 ), j = 2,..., m): [ p(η ) = p(h( τ τ 1 ) ) (1 π(θ, τ)) (N n)] exp 1 m { cj (η j µ j ) } 2 I {τ1 <τ 2 2 < <τm} j=2 m j 1 ( ) θl τl+1 n(j) ( ) ( ) (θj +1) θj si I τ j=1 l=1 l {τ1 <min(s τ i I(j) j τ 1,...,sn)}, j 36 / 44

37 Computational Details cont... Fluxes of missing sources can (optionally) be imputed to produce posterior draws of a corrected log N log S Impute missing fluxes S mis,i, i = 1,..., n (Rejection Sampling): (B i, L i, E i ) p(b i, L i, E i ) S i n, N, θ, τ, B i, L i, E i, I i = 0 (1 g(s i, B i, L i, E i )) Pareto(S i ; θ, τ). 37 / 44

38 Computational Notes Some important things to note: For single power-law models computation is relatively fast (few hours), and insensitive to the number of missing sources The fluxes of the missing sources need not be imputed Fluxes of missing sources can (optionally) be imputed to produce posterior draws of a corrected log N log S Computation for the broken-power law model is slower (day) Generalized mixtures of Pareto s (or other forms) require only minor modifications of general scheme 38 / 44

39 Application: CDFN X-rays Application: X-ray emission from point sources in the CHANDRA Deep Field North (CDFN) survey. The deepest kev band survey ever made 2 Ms exposure covering field of 448 sq. arcmin Subsample sources based on off-axis angle threshold of 8 arcmin Dataset consists of 225 observed sources Priors: a = 10, b = 10, E [N] = 300, Var(N) = 100 2, E [τ 1 ] = , Var(τ 1 ) = ( ) 2, E [log(τ 2 τ 1 )] = 38, Var(log(τ 2 τ 1 )) = / 44

40 CDFN log(n)-log(s)

41 Method bp0 bp1 bp2 bp0 g = 1 BAFM Sel. DIC.Mean Q 1 bp2 DIC.Median Q 2 bp2 DIC.Mode Q 3 bp2 DIC.V Q 4 bp2 WAIC Q 5 bp2 WAIC Q 6 bp2 Table: DIC, WAIC, BAFM for CDFN analysis.

42 bp0 bp1 bp2 g = 1 # obs src (0.63) 0.07 (0.08) 0.23 Minimum ph ct (0.40) 0.40 (0.41) 0.36 Maximum ph ct (0.26) 0.22 (.24) 0.08 Median ph ct (0.10) 0.03 (0.05) 0.40 Lower quartile of ph cts (0.23) 0.17 (0.17) 0.28 Upper quantile of ph cts (0.12) 0.04 (0.07) 0.32 IQR of ph cts (0.11) 0.04 (0.07) 0.32 Crude estimate of R (0.33) 0.31 (0.31) 0.15 # obs src vs. Median ph ct (0.22) 0.15 (0.20) 0.65 Lower vs. Upper quartile (0.26) 0.16 (0.21) 0.75 # obs sources vs. IQR (0.20) 0.13 (0.20) 0.72 # obs sources vs. R (0.59) 0.59 (0.60) 0.18 Table: Posterior predictive p-values for CDFN analysis. (Adjusted for low-flux*)

43 CDFN Conclusions Estimate 95% (central) (Wong et.al.,2014) Parameter Mean (Med) interval Estimate SE θ (0.59) (0.40,1.09) θ (0.56) (0.36,1.18) θ (0.82) (0.67,1.03) log 10 (τ 1 ) (-16.44) (-16.60,-16.35) log 10 (τ 2 ) (-16.11) (-16.36,-15.61) log 10 (τ 3 ) (-15.74) (-16.00,-15.47) N 281 (280) (259,312) Table: Parameter estimates for CDFN analysis based on bp2 model. Compared to competing method for estimation of the number of breakpoints and parameters via interwoven EM (Wong et.al., 2014) 43 / 44

44 CDFN Conclusions Estimated slope of 0.46 (median), (mean) 95% (central) interval = (0.312, 0.681) Wong et al. (2014) estimated broken power-law slopes of ˆθ = (0.483, 0.854). Estimated missing data fraction 24%, interval:(0.0, 28.25%) Evidence of a possible break in the power-law in the observed log N log S. Note: The ignorable analysis gives a posterior median of 0.363, mean of 0.367, and 95% interval (0.248, 0.511). 44 / 44

Bayesian Estimation of log N log S

Bayesian Estimation of log N log S Paul Baines, Irina Udaltsova Department of Statistics University of California, Davis July 12th, 2011 Introduction What is log N log S? Cumulative number of sources detectable