Strong Lens Modeling (II): Statistical Methods

Strong Lens Modeling (II): Statistical Methods Chuck Keeton Rutgers, the State University of New Jersey

Probability theory multiple random variables, a and b joint distribution p(a, b) conditional distribution p(a b) marginal distribution p(a) = p(a, b) db note: p(a, b) = p(a b) p(b).7.6.4.3.2.

even if model is correct, measured data may not exactly match model predictions because of noise -d example: suppose model predicts d mod, and measured value follows a Gaussian distribution with uncertainty σ L(d obs d mod ) exp [ (dobs d mod ) 2 ] 2σ 2 e χ2 /2 call this the likelihood of the data given the model L d obs obs

if model predictions depend on set of parameters q, write this as L(d obs q) exp [ (dobs d mod (q)) 2 ] 2σ 2 how to use? when model is wrong, d obs is far from d mod so χ 2 is high and L is low; adjust model to reduce χ 2 and increase L maximum likelihood method L d

Bayesian inference goal: see what we can infer about the parameters from the data; so shift from p(d q) to p(q d) note: Bayes s theorem: p(d, q) = p(d q) p(q) = p(q d) p(d) p(q d) = p(d q) = likelihood L(d q) p(d q) p(q) p(d) p(q) = prior probability distribution for q p(d) = evidence (more later) p(q d) = posterior probability distribution for q given d idea: use the posterior to quantify constraints on the parameters

Quantifying constraints how do we use p(q d) to quantify parameter constraints? could use µ and σ; but those have specific meaning only for Gaussian distributions better to generalize: median and 68% confidence interval.4.3.2..

Nuisance parameters suppose we have some joint p(a, b), but we are mainly interested in a we say b is a nuisance parameter probability theory lets us integrate out b to get marginalized distribution for a: p(a) = p(a, b) db in general, this is not the same as optimizing the nuisance parameter

example: with σ y = + x 2 p(x, y) exp [ (x µ x) 2 ] ] 2σx 2 exp [ y2 2σy 2 optimize: p(x) exp [ (x µ x) 2 ] 2σx 2 marginalize: p(x) ( + x 2 ) exp [ (x µ x) 2 ].4 2σ 2 x.3.2.

evidence quantifies overall probability of getting these data from this model: Z p(d) = L(d q) p(q) dq can be used to compare different models (even they have different numbers of parameters) L q q

Monte Carlo Markov Chains often it is inconvenient or even impossible to analyze full posterior instead, turn to statistical sampling: set of points {q k } drawn from the posterior (for now: assume flat priors, so p(q) L(q)) method: pick some starting point q postulate some trial distribution, p try (q) draw a trial point, q try, from p try ; probability to accept is [ ] L(qtry ) min L(q ), if accept trial point, put q 2 = q try ; otherwise, put q 2 = q. iterate!

Let trial distribution be a simple Gaussian. - - - 2 2.5 step 2.

Let trial distribution be a simple Gaussian. - - - 2 2.5 step 5.

Let trial distribution be a simple Gaussian. - - - 2 2.5 step.

Let trial distribution be a simple Gaussian. - - - 2 2.5 step 5.

Let trial distribution be a simple Gaussian. - - - 2 2.5 step 2.

Let trial distribution be a simple Gaussian. - - - 2 2.5 step 25.

Let trial distribution be a simple Gaussian. - - - 2 2.5 step 3.

When to stop? want: to sample L well to get results that are independent of starting point solution: run multiple chains keep going until statistical properties of chains are equivalent throw away first half of each chain to eliminate memory of starting point

Multiple chains - - - 2 2.5, chains, simple Gaussian steps.

Q) Can we pick trial distribution to make more efficient? A) Use covariance matrix of points so far. - - - 2 2.5 with adaptive steps.

How big to make the steps? - - - 2 2.5 with tiny steps.

How big to make the steps? - - - 2 2.5 with adjustable step size.

results joint posterior, p(x, y): just plot all the sampled points y..... 2. 2.5 x

results marginalized posterior, p(x): just plot a histogram of the x-values of all the sampled points likewise for p(y) 2.. 2.5 2.. y

Nested sampling introduced by Skilling (24, 26) peel away layers of constant likelihood one by one estimate volume of each layer statistically combine (L i, V i ) values to estimate the Bayesian evidence get a sample of points as a by-product (courtesy R. Fadely) variants: Shaw et al. (27), Feroz & Hobson (28), Brewer et al. (29), Betancourt (2); statistical uncertainties: CRK (2)

given likelihood L(q) and prior π(q), write evidence as Z = L(q) π(q) dq define fractional volume with likelihood higher than L X(L) = π(q) dq L(q)>L in principle, can invert to find L(X), then write Z = L(X) dx discretize: if we can find a set of points (L i, X i ) then we can write Z = N nest i= L i (X i X i )

how to get the points? L i is easy draw uniformly (from prior) in region with L > L X i is harder in principle, requires integration proceed statistically...

consider M points drawn uniformly from region with L > L draw likelihood contours through them, let enclosed volumes be V > V 2 >... > V M these are random variables write V = V t where t (, ) then t is the largest of M random variables drawn uniformly between and characterized by probability distribution p(t) = Mt M t = M (M + )

begin with M live points drawn uniformly from full prior; let their likelihoods be L µ (µ =,..., M) at step k: extract live point with lowest L, call it k-th sampled point: L k = min µ (L µ ) estimate the associated volume as X k = X k t k where t k is a random number drawn from p(t) = Mt M replace extracted live point with a new point drawn from the priors but restricted to the region L(q) L k iterate for N nest steps

2 - - - -2-2 3 4 Nested sampling setup.

2 - - - -2-2 3 4 Nested sampling step.

2 - - - -2-2 3 4 Nested sampling step 2.

2 - - - -2-2 3 4 Nested sampling step 3.

2 - - - -2-2 3 4 Nested sampling step 4.

2 - - - -2-2 3 4 Nested sampling step 5.

2 - - - -2-2 3 4 Nested sampling step.

2 - - - -2-2 3 4 Nested sampling step 3.

2 - - - -2-2 3 4 Nested sampling step 5.

2 - - - -2-2 3 4 Nested sampling step 7.

2 - - - -2-2 3 4 Nested sampling step 9.

development of evidence: contributions from live points, sampled points, and total.2..8.6.4.2 p (See CRK 2 for statistical uncertainties.)