Multimodal Nested Sampling

Similar documents
Strong Lens Modeling (II): Statistical Methods

An introduction to Markov Chain Monte Carlo techniques

Statistics and Prediction. Tom

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

MCMC Sampling for Bayesian Inference using L1-type Priors

Bayesian Phylogenetics:

Reminder of some Markov Chain properties:

Introduction to Bayes

arxiv: v1 [stat.co] 23 Apr 2018

Bayesian Inference and MCMC

Bayesian Methods for Machine Learning

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

STA 4273H: Statistical Machine Learning

17 : Markov Chain Monte Carlo

Computational statistics

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Large Scale Bayesian Inference

Monte Carlo in Bayesian Statistics

Bayesian Estimation with Sparse Grids

Lecture 7 and 8: Markov Chain Monte Carlo

LECTURE 15 Markov chain Monte Carlo

Approximate Bayesian computation: an application to weak-lensing peak counts

Approximate Bayesian Computation: a simulation based approach to inference

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Model Fitting in Particle Physics

Bayesian Regression Linear and Logistic Regression

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Markov Chain Monte Carlo methods

Bayesian Nonparametric Regression for Diabetes Deaths

Metropolis-Hastings Algorithm

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

Markov Chain Monte Carlo methods

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Introduction to Machine Learning CMU-10701

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Nested Sampling. Brendon J. Brewer. brewer/ Department of Statistics The University of Auckland

CosmoSIS Webinar Part 1

Data Analysis I. Dr Martin Hendry, Dept of Physics and Astronomy University of Glasgow, UK. 10 lectures, beginning October 2006

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet

Recent advances in cosmological Bayesian model comparison

Doing Bayesian Integrals

Learning the hyper-parameters. Luca Martino

Statistical Searches in Astrophysics and Cosmology

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Practical Numerical Methods in Physics and Astronomy. Lecture 5 Optimisation and Search Techniques

F denotes cumulative density. denotes probability density function; (.)

Principles of Bayesian Inference

Simulated Annealing for Constrained Global Optimization

Statistical techniques for data analysis in Cosmology

MARKOV CHAIN MONTE CARLO

Advances and Applications in Perfect Sampling

Results: MCMC Dancers, q=10, n=500

CPSC 540: Machine Learning

A profile likelihood analysis of the CMSSM with Genetic Algorithms

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

Weak convergence of Markov chain Monte Carlo II

Estimating marginal likelihoods from the posterior draws through a geometric identity

Markov Chain Monte Carlo (MCMC)

Introduction to Bayesian methods in inverse problems

Advanced Statistical Modelling

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

CSC 2541: Bayesian Methods for Machine Learning

A Beginner s Guide to MCMC

an introduction to bayesian inference

Lecture 6: Markov Chain Monte Carlo

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

Markov chain Monte Carlo

Markov Chain Monte Carlo

SUPPLEMENT TO MARKET ENTRY COSTS, PRODUCER HETEROGENEITY, AND EXPORT DYNAMICS (Econometrica, Vol. 75, No. 3, May 2007, )

Markov Chain Monte Carlo

19 : Slice Sampling and HMC

Introduction to CosmoMC

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Bayesian Inference in Astronomy & Astrophysics A Short Course

Kernel adaptive Sequential Monte Carlo

Probabilistic Graphical Models

Two Statistical Problems in X-ray Astronomy

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics

On Bayesian Computation

Bayesian Inference for Discretely Sampled Diffusion Processes: A New MCMC Based Approach to Inference

Bayesian room-acoustic modal analysis

Stochastic optimization Markov Chain Monte Carlo

VCMC: Variational Consensus Monte Carlo

Monte Carlo Markov Chains: A Brief Introduction and Implementation. Jennifer Helsby Astro 321

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

6 Markov Chain Monte Carlo (MCMC)

Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method

DAG models and Markov Chain Monte Carlo methods a short overview

High-dimensional Problems in Finance and Economics. Thomas M. Mertens

The Metropolis-Hastings Algorithm. June 8, 2012

MCMC: Markov Chain Monte Carlo

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Normalising constants and maximum likelihood inference

Transcription:

Multimodal Nested Sampling Farhan Feroz Astrophysics Group, Cavendish Lab, Cambridge

Inverse Problems & Cosmology Most obvious example: standard CMB data analysis pipeline But many others: object detection, signal enlargement, signal separation,

Definition: an approach to statistics in which all forms of uncertainty are expressed in terms of probability (Radford M. Neal) Bayesian Inference

The Bayesian Way

Bayesian Inference Bayes Theorem Likelihood Prior Pr( Θ D, H ) = Pr( D Θ, H ) Pr( Θ Pr( D H ) H ) Posterior Evidence Model Selection Pr( H Pr( H 1 0 D ) = D) Pr( D Pr( D H H 1 0 ) Pr( H ) Pr( H 1 0 ) )

Bayesian Computation Priors and posteriors are often complex distributions May not be easily represented as formulas Represent the distribution by drawing random samples from it Visualize these samples by viewing them or low-dimensional projections of them Make Monte Carlo estimates for their probabilities and expectations Sampling from the prior is often easy, sampling from the posterior, difficult

Some Cosmological Posteriors Some are nice, others are nasty Maximization (local or global) and covariance matrices partial information better to sample from the posterior using MCMC

Bayesian Evidence Evidence = Z = L( θ ) π ( θ ) dθ Evaluations of the n-dimensional integral presents great numerical challenge If dimension n of parameter space is small, calculate unnormalized posterior P ( θ ) = L( θ ) π ( θ ) over grid in parameter space get evidence trivially For higher-dimensional problems, this approach rapidly becomes impossible Need to find alternative methods Gaussian approximation, Savage-Dickey ratio Evidence evaluation at least an order of magnitude more costly than parameter estimation.

Nested Sampling Introduced by John Skilling in 2004. Monte Carlo technique for efficient evaluation of the Bayesian Evidence. Re-parameterize the integral with the prior mass X n defined as, dx = π ( θ ) d θ, so that X ( λ) = L( θ ) > X defined such that it uniquely specifies the likelihood Z L( X ) dx = 1 0 Suppose we can evaluate L j = L(X j ) where m 0 < X m < <X 2 < X 1 < 1 then Z = L j w j where w j = (X j-1 X j+1 )/2 j= 1 λ π ( θ ) d n θ

Nested Sampling: Algorithm 1. Set j = 0; initially X 0 = 1, Z = 0 2. Sample N live points uniformly inside the initial prior space (X 0 = 1) and calculate their likelihoods 3. Set j = j + 1 4. Find the point with the lowest L i and remove it from the list of live points 5. Increment the evidence as Z = Z + L i ( X i-1 -X i+1 ) / 2 6. Reduce the prior volume X i / X i-1 = t i where P(t) = N t N-1 7. Replace the rejected point with a new point sampled from π (θ ) with hard-edged region L > L i N 8. If Lmax X j < αz then set Z = Z + L N i =1 ( θi ) / stop else goto 3

Error Estimation Bulk of posterior around X e -H where H is the information H = log( dp / dx ) dx where dp = LdX / Z Since log X i = ( i ± i) / N, we expect the procedure to take NH ± NH steps to shrink down the bulk of posterior Dominant uncertainty in Z is due to the Poisson variability in the number of steps, NH ± NH, required to reach the bulk of posterior log X i and log Z are subject to standard deviation uncertainty of ( X i 1 X i 1) log Z = log Li ± i 2 + H N H / N

Nested Sampling: Demonstration Egg-Box Posterior

Nested Sampling: Demonstration 250 200 150 100 50 0 0 1 0.2 0.8 0.4 0.6 0.6 0.4 0.8 0.2 0 Egg-Box Posterior

Nested Sampling Advantages: Typically requires around 100 times fewer samples than thermodynamic integration for evidence calculation Does not get stuck at phase changes Parallelization possible if efficiency is known Bonus: posterior samples easily obtained as by-product. Take full sequence of rejected points, θ i, & weigh i th sample by p i = L i w i /Z Problem: must sample efficiently from prior within complicated, hard-edged likelihood constraint. MCMC can be inefficient

Ellipsoidal Nested Sampling Mukherjee et al. (2006) introduced ellipsoidal bound for the remaining prior volume with hard constraint, L > L i, at each iteration Construct an n-dimensional ellipsoid using the covariance matrix of the current live points Enlarge this ellipsoid by some enlargement factor ( f) Easily extendable to multi-modal problems through clustering

Ellipsoidal Nested Sampling

Ellipsoidal Nested Sampling

Ellipsoidal Nested Sampling - Problems

Ellipsoidal Nested Sampling - Solution

Simultaneous Nested Sampling Introduced by Feroz & Hobson (2007) (arxiv:0704:3704) Improvements over recursive ellipsoidal nested sampling Non-recursive so requires fewer likelihood evaluations in multimodal problems Identify the number of clusters using X-means Can use ellipsoidal, Metropolis or any other sampling method to sample from the hard constraint Evaluation of local as well as global evidence values

Identification of Clusters Infer appropriate number of clusters from the current live point set using X- means (Pelleg et al. 2000) X-means: partition into the number of clusters that optimizes the Bayesian Information Criteria (BIC) X-means performs well overall but has some inconsistencies

Evaluation of Local Evidences In simultaneous nested sampling, if a cluster is non-intersecting with its sibling and non-ancestor clusters, it is added to the list of isolated clusters Sum the evidence contributions from the rejected points inside this isolated cluster to the local evidence of the corresponding mode Underestimated local evidence of the modes that are sufficiently close Store information about clusters of the past few iterations Match the isolated clusters with the ones at the past iterations and increment its local evidence if the rejected points in those iteration fall into its matched clusters

Sampling from Overlapping Ellipsoids k clusters at iteration i with n 1, n 2,, n k points and V 1, V 2,, V k volumes of the corresponding (enlarged) ellipsoids Choose an ellipsoid with probability k p k = V k / V tot, where V tot = V j j= 1 Sample from the chosen ellipsoid with the hard constraint L > L i Find the number n, of ellipsoids the chosen sample lies and accept the sample with probability 1 / n

Dealing with Degeneracies One ellipsoid is a very bad approximation to a banana shaped likelihood region Sub-cluster every cluster found by X-means Minimum number of points in each sub-cluster being ( D + 1 ) with D being the dimensionality of the problem Expand these sub-clusters by sharing points with neighboring sub-clusters Sample from them using the strategy outlined in previous section

Metropolis Nested Sampling (MNS) Replace ellipsoidal sampling in simultaneous ellipsoidal nested sampling by Metropolis-Hastings method Proposal distribution: Isotropic Gaussian with fixed width, σ, during a nested sampling iteration At each iteration, pick one of N live points randomly as the starting position for random walk Take n s (=20) steps from the starting point with each new sample, x, being accepted if L(x ) >L i. Adjust σ after every nested sampling iteration to maintain the acceptance rate around 50%

Example: Gaussian Shells Posterior defined as L( x ) = circ( x; c1, r1, w1 ) + circ( x; c2, r2, w2 ), where 2 1 ( x c r) circ( x; c, r, w) = exp. 2 2 2πw 2w Typical of degeneracies in many beyond-the-standard-model parameter space scans in Particle Physics

Gaussian Shells in 2D: Results w 1 = w 2 = 0.1, r 1 = r 2 = 2, c 1 = (-3.5,0.0), c 2 = (3.5,0.0) Analytical Results: log Z = -1.75, log Z 1 = -2.44, log Z 2 = -2.44 Ellipsoidal & Metropolis Nested Sampling with N like ~ 20,000 log Z = -1.78 ± 0.08, log Z 1 = -2.49 ± 0.09, log Z 2 = -2.47 ± 0.09 Bank sampler (modified Metropolis-Hastings, arxiv:0705.0486) required N like ~ 1 x 10 6, for parameter estimation and no evidence evaluation

Gaussian Shells upto 100D: Results Analytical Metropolis Nested Sampling dim log Z local log Z * log Z local log Z 1 local log Z 2 N like 10-14.6-15.3-14.6 ± 0.2-15.4 ± 0.2-15.3 ± 0.2 127,463 30-60.1-60.8-60.1 ± 0.5-60.4 ± 0.5-61.3 ± 0.5 489,416 50-112.4-113.1-112.2 ± 0.5-112.9 ± 0.5-113.0 ± 0.5 857,937 70-168.2-168.9-167.5 ± 0.6-167.7 ± 0.6-170.7 ± 0.7 1,328,012 100-255.6-255.3-254.2 ± 0.8-254.4 ± 0.8-256.7 ± 0.8 2,091,314 * analytically local log Z 1 = local log Z 2 = local log Z

Application: Astronomical Object Detection Main Problems: Parameter estimation Model comparison Quantification of detection

Quantifying Cluster Detection Pr( H Pr( H D) D) Pr(D H Pr(D H ) Pr( H ) Pr( H 1 1 1 R = = = 0 0 0 ) ) Z Z 1 0 Pr( H Pr( H 1 0 ) ) H 0 = there is no cluster with its center lying in the region S H 1 = there is one cluster with its center lying in the region S 1 S Z0 = L0dX L0 = s For clusters distributed according to Poisson distribution Pr( H1) = μs Pr( H ) 0 R = Z1μs L 0

Weak Gravitational Lensing unlensed galaxies projected mass with shear map overlaid lensed galaxies

Wide Field Weak Gravitational Lensing true convergence map noisy convergence map inferred convergence map 0.5 X 0.5 degree 2, 100 gal per arcmin 2 & σ = 0.3 Concordance ΛCDM Cosmology with cluster mass & redshifts drawn from Press-Schechter mass function

Wide Field Lensing: : Application to N-Body N Simulations Produced by Martin White, 2005 Covering 3 X 3 degree 2 Concordance ΛCDM Cosmology 65 galaxies per arcmin 2 σ = 0.3 1350 halos with M 200 > 10 13.5 h -1 M sun

140 120 350 300 total clusters detected clusters No. of Clusters 100 80 60 40 20 250 200 150 100 50 0 0.2 0.4 0.6 0.8 1 z 0 1 10 14 5 10 13 M 200 (h-1 M sun ) 5 10 14 1 1 0.8 0.8 completeness 0.6 0.4 0.2 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 z 0 1 10 14 5 10 13 M 200 (h-1 M sun ) 5 10 14 146 positive detections of which 131 are true

(In)completeness of Weak Lensing Hennawi & Spergel, 2007 Feroz et al. in preparation 1 1 completeness 0.8 0.6 0.4 0.2 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 z 0 1 10 14 5 10 13 M 200 (h-1 M sun ) 5 10 14

Bayesian Analysis of msugra Most popular realization of MSSM with universal boundary conditions 5 msugra parameters ( M1/ 2, m0, tan β, A0,sgn( μ)) + Standard Model parameters Allanach et al. performed the bank sampler analysis parameter constraints Bayesian evidence based model comparison vital for analyzing models of SUSY breaking at low energies using LHC data

Bayesian Analysis of msugra: : Results Allanach et al, arxiv:07050487 0.025 3.5 0.02 3.5 3 3 0.02 m 0 (TeV) 2.5 2 1.5 0.015 0.01 m 0 (TeV) 2.5 2 1.5 0.015 0.01 1 0.5 0.005 1 0.5 0.005 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 M 1/2 (TeV) 0 10 20 30 40 50 tan(β) Feroz et al. in preparation

Conclusions Bayesian framework provides unified approach with 2 levels of inference parameter estimation and confidence limits by maximising or exploring posterior model selection by integrating posterior to obtain evidence Nested sampling efficient in both evidence evaluation and parameter estimation main issue is sampling from prior within hard likelihood constraint MCMC and ellipsoidal bound methods promising clustering allows sampling from multimodal/degenerate posteriors Many cosmological and particle physics applications so try it for yourself!

Cluster Tomography Wittman et al., 2001 Assume a mass profile & fit for shear as a function of source photometric redshift How reliable is this technique? spectroscopic redshift

Weak Lensing: : Parameter Constraints 13 483 12 y (arcsec) 482.5 482 c 11 10 481.5 227 226 225 x (arcsec) 9 8 2 3 M 200 (h 1 4 5 M sun ) x 10 14 0.6 0.6 0.5 0.5 z 0.4 z 0.4 0.3 0.3 8 10 12 c 2 3 M (h 1 4 5 M ) 200 sun x 10 14

Weak Lensing: : Parameter Constraints Press-Schechter Prior Uninformative Priors 229 228 227 226 225 224 x (arcsec) 480 482 484 486 488 y (arcsec) 6 8 10 12 14 c 2 4 6 8 10 M 200 (h 1 M ) sun x 10 14 0.1 0.2 0.3 0.4 0.5 0.6 0.7 z

Metropolis Hastings Algorithm Metropolis-Hastings algorithm to sample from P(θ) Start at an arbitrary point θ 0 At each step, draw a trial point, θ, from the proposal distribution Q(θ θ 0 ) Calculate ratio r = P(θ ) Q(θ n θ ) / P(θ n ) Q(θ θ n ) accept θ n+1 = θ with probability max(1,r) else set θ n+1 = θ n After initial burn-in period, any (positive) proposal Q convergence to P(θ) Common choice of Q, multivariate Gaussian centred on θ n but many others

Metropolis Hastings Algorithm Some Problems Choice of proposal Q strongly affects convergence rate and sampling efficiency large proposal width ε trial points rarely accepted small proposal width ε chain explores P(θ) by a random walk very slow If largest scale of P(θ) is L, typical diffusion time t ~ (L/ ε) 2 If smallest scale of P(θ) is l, need ε ~ l, diffusion time t ~ (L/ l) 2 Particularly bad for multimodal distributions Transitions between distant modes very rare No one choice of proposal width ε works Standard convergence tests will suggest convergence, but actually only true in a subset of modes

Thermodynamic Integration MCMC sampling (with annealing) from full posterior requires no assumptions regarding hypotheses or priors Basic method is thermodynamic integration: define so the required evidence value is Z(1) λ Z ( λ) = L ( θ ) π ( θ ) dθ λ Begin MCMC sampling from L ( θ ) π ( θ )dθ, starting with λ = 0 then slowly raising the value according to some annealing schedule until λ = 1. Use the N s samples corresponding to any particular value of λ to obtain an estimate of the quantity log L λ But log L λ = 1 dz = Z dλ log Z(1) = log Z(0) + d log Z dλ, so 1 N log L dλ λ 0 j= 1 j log L λ j Δλ j

Thermodynamic Integration Problems: Evidence value stochastic, need multiple runs to estimate the error on the evidence Accurate evidence evaluation requires slow annealing Common schedules (linear, geometric) can get stuck in local maxima Can not navigate through phase changes Let dx = πdθ : prior mass As λ : 0 1, annealing should track along the curve But d log L / d log X = 1/ λ so annealing schedule can not navigate through convex regions (phase changes)