Sampling versus optimization in very high dimensional parameter spaces

Size: px

Start display at page:

Download "Sampling versus optimization in very high dimensional parameter spaces"

Bryan Gilmore
5 years ago
Views:

1 Sampling versus optimization in very high dimensional parameter spaces Grigor Aslanyan Berkeley Center for Cosmological Physics UC Berkeley Statistical Challenges in Modern Astronomy VI Carnegie Mellon University June 6, 2016

2 Collaborators Uroš Seljak Yu Feng Chirag Modi

3 Sampling: Hamiltonian Monte Carlo (HMC) Each parameter becomes a particle position. Momentum variables are introduced. Particles follow Hamiltonian dynamics. U = -ln(posterior). Huge advantage over random walk: Information in the derivatives is used to walk in the right direction. Acceptance rate = 1 theoretically. For each iteration need to integrate equations of motion numerically using staggered leapfrog (or similar) methods. Typically ~10 numerical integration steps are taken per iteration. Method of choice for sampling high dimensional parameter spaces. Tuning: masses, integration steps, integration time.

Optimization: BFGS Quasi-Newton method. Needs the first, but not second derivatives. At each iteration the inverse Hessian is estimated from previous iterations (never stored explicitly).

4 Optimization: BFGS Quasi-Newton method. Needs the first, but not second derivatives. At each iteration the inverse Hessian is estimated from previous iterations (never stored explicitly). A direction of move is deduced, followed by line search. Line search: Moré-Theunte 1992 L-BFGS: Limited memory BFGS. Store and use only a few previous iterations. Works almost as well as BFGS!

5 Linear Model data d = Rs + n S = ss N = nn signal noise C dd = RSR + N Binning: S l = {s ml } (m l =1,...,M l ) Q l = R l R RSR = X l Likelihood: l Q l L(d ) =(2 ) N/2 det(c) 1/2 exp 1 2 d C 1 d projection matrix Minimum variance estimator (Wiener Filter): ŝ = SR C 1 d For gaussian fields this is the same as the maximum probability estimator! U.Seljak astro-ph/

6 Linear Model Fisher matrix: F ll 0 2 ln L(d l 0 ˆ The inverse is an estimate of the covariance matrix of the parameters: D ˆ ˆ E D ˆ EDˆ E = F 1 Calculation: F ll 0 = 1 2 tr Q lc 1 Q l 0C 1 Window: W ll 0 = F ll P 0 l F 0 ll 0 Power spectrum quadratic estimator: ˆ l = 1 2 X l 0 F 1 ll 0 d C 1 Q l 0C 1 d b l 0 b l = tr NC 1 Q l C 1 D ˆ l E = X l 0 W ll 0 l 0 U.Seljak astro-ph/

7 Estimating Noise Bias and Fisher Matrix Noise bias: simulate noise: d n Pass through optimizer: ŝ n b l = l S 1 ŝ nŝ n S 1 l Fisher matrix: simulate signal: s s Pass through optimizer: ŝ s For each bin l 0 simulate extra signal in that bin only: s l 0 s l = s 0 s + s l 0 Pass through optimizer: ŝ l 0 * P ŝ l = ŝ 0 l ŝ 0 s F ll = K l 0 ŝ k l l (k l ) 2 + with P 0 k s l 0 l 0 (k l 0) l K l 0 is the number of modes Power spectrum: ˆ l = X kl ŝ(k l ) 2 * P ŝn (k l ) 2 l W 0 ll 0K 1 l 0 P k l 0 s s(k l 0) 2 P k l ŝ s (k l ) 2 +

8 Linear case: Weak Lensing Power spectrum Toy Example: 64x64 grid

9 Linear case: Weak Lensing Toy Example: 64x64 grid Add noise and mask Observed data:

10 Original Results L-BFGS Linear Algebra Fisher matrix:

11 Linear case: Weak Lensing 1024x1024 grid: ~million parameters Power spectrum

12 Linear case: Weak Lensing 1024x1024 grid: ~million parameters

13 Reconstruction Original L-BFGS in action CPU time ~ 1 min.

14 L-BFGS vs. Conjugate Gradient

15 Power Spectrum Fiducial power is 50% larger than the true power. CPU time ~ 1 hour

16 Nonlinear Case: Large Scale Structure Grid: 128 3, Box size: (750 Mpc/h) 3 Nonlinear evolution: 2LPT (Second order Lagrangian perturbation theory) Simulation 2LPT Noise Code used: fastpm (Yu Feng, Man-Yat Chu, Uros Seljak, )

17 Reconstruction Nonlinear density Original L-BFGS in action CPU time ~ 1 hour

18 L-BFGS vs. Conjugate Gradient

19 Power Spectrum PRELIMINARY Fiducial power has no wiggles. Power spectra are convolved with window. CPU time ~ 60 hours

20 Power Spectrum PRELIMINARY Showing ratio to fiducial (smooth) power. CPU time ~ 60 hours

21 Comparison with HMC Previously: HMC: Jasche, Wandelt, , Wang, Mo, Yang, van den Bosch, , HMC Burnin: ~500 iterations (~5,000 function/derivative calls). L-BFGS optimization: ~100 iterations (~100 function/ derivative calls). HMC correlation length: ~200. HMC sample of 10,000: ~100,000 function/derivative calls. L-BFGS full fisher matrix/power spectrum estimation: ~5,000 function/derivative calls.

22 Summary L-BFGS is a fast optimizer for very high dimensional parameter spaces. Conjugate gradient works almost as well as L-BFGS for the linear case. Not so much for the nonlinear case. Our reconstruction method works well for both linear and nonlinear models with ~million parameters at least. HMC is at least an order of magnitude more expensive. But if you really need a full sample then HMC is the way to go. DO NOT use HMC for minimization! Optimizers (L-BFGS, Conjugate gradient) and HMC publicly available (soon) as a part of the cosmo++ package: cosmopp.com

Large Scale Bayesian Inference

Large Scale Bayesian Inference Large Scale Bayesian I in Cosmology Jens Jasche Garching, 11 September 2012 Introduction Cosmography 3D density and velocity fields Power-spectra, bi-spectra Dark Energy, Dark Matter, Gravity Cosmological