Practical Numerical Methods in Physics and Astronomy. Lecture 5 Optimisation and Search Techniques

Practical Numerical Methods in Physics and Astronomy Lecture 5 Optimisation and Search Techniques Pat Scott Department of Physics, McGill University January 30, 2013 Slides available from http://www.physics.mcgill.ca/ patscott

Outline 1 General Considerations 2

General Considerations Optimisation - the problem Optimisation is finding global minima and maxima for what x = ~xneedle does min[fhaystack (~x )] = f (xneedle )? maximisation is just minimisation of fhaystack (~x )

The general strategy To optimise, we always require an objective or fitness function f haystack Any search problem can be posed in terms of some sort of fitness function We may care just about finding x needle or about mapping f haystack in the region of x needle e.g. comparing theory to data - just best fit parameters? - or errors on best-fit? - or just a good overall map, without even finding the exact best fit? goal is global minimum, but often result is local minima

Optimisation vs root finding multi-d optimisation usually easier than multi-d root finding optimisation by root finding on f haystack = 0 doesn t work - makes all local minima and maxima (and pts of inflection!) degenerate - highly unlikely to find the global extremum root finding for h( x) = 0 by minimisation of h 2 ( x) is not enough - generally run into problems with local minima - can be improved by combination with Newton s method in multi-d

Options: deterministic, non-gradient methods - Brent s method in 1D - downhill simplex in multi-d deterministic, gradient-based methods - steepest descent stochastic, gradient-inspired methods - MCMCs - nested sampling - simulated annealing stochastic, non-gradient methods - genetic algorithms - differential evolution many others...

Outline General Considerations 1 General Considerations 2

Synopsis: Bracket the minimum with 3 points and use Brent s usual tricks parabola through 1 2 3 parabola through 1 2 4 3 1 2 5 4 Tracks 6 individual pts always 2 brackets and a third pt lower than brackets quadratic interpolation + bisection similar point ID shuffling to root-finding version similar conditions for interpolation step can be improved with derivative information

Outline General Considerations 1 General Considerations 2

Synopsis: Follow the gradient until you hit a local (line) minimum; reassess. Always need to hang a 90 left or right Works (for local min), but inefficient Requires 1D minimisation routine (e.g. Brent s)

Variants on steepest descent General idea of line minimisation can be improved Improvement from better directional basis set direction set methods - many ways to choose basis - goal is to choose directions s.t. successive line minimisations don t interfere Still uses 1D minimisation along a line requires Brent s or similar

Outline General Considerations 1 General Considerations 2

The downhill simplex method Synopsis: Ooze down the slope and around corners like a blob of goo (or an amoeba) short, simple, fun, effective any dimension no brackets, derivatives or line minimisation required still only good for local minima

1 Evaluate f ( x) at corners 2 Find worst-fit corner high low simplex at beginning of step

1 Evaluate f ( x) at corners 2 Find worst-fit corner 3 Replace worst-fit corner with new point reflected across remaining points high simplex at beginning of step low Let's try this! reflection

1 Evaluate f ( x) at corners 2 Find worst-fit corner 3 Replace worst-fit corner with new point reflected across remaining points 4 If new point is awesome*, extend simplex as well high That was awesome! simplex at beginning of step low Let's try this! reflection reflection and expansion *awesome = better than best-fit NEXT STEP

1 Evaluate f ( x) at corners 2 Find worst-fit corner 3 Replace worst-fit corner with new point reflected across remaining points 4 If new point is awesome*, extend simplex as well 5 If new point is terrible, discard it and try 1D contraction instead simplex at beginning of step high low Let's try this! reflection That was crap. That was awesome! contraction reflection and expansion That was OK. *awesome = better than best-fit terrible worse than second-worst fit OK = in between NEXT STEP

1 Evaluate f ( x) at corners 2 Find worst-fit corner 3 Replace worst-fit corner with new point reflected across remaining points 4 If new point is awesome*, extend simplex as well 5 If new point is terrible, discard it and try 1D contraction instead 6 If 1D contraction is also terrible, do multi-d contraction about best corner *awesome = better than best-fit terrible worse than second-worst fit OK = in between high That was awesome! reflection and expansion simplex at beginning of step low Let's try this! NEXT STEP reflection That was OK. That was crap. contraction That was crap too. multiple contraction

Outline General Considerations 1 General Considerations 2

s Synopsis: Jump around like a particle diffusing down a gradient

s Synopsis: Jump around like a particle diffusing down a gradient Biased random walk Trotta s example: like an elephant on the savannah looking for water

s Synopsis: Jump around like a particle diffusing down a gradient Biased random walk Trotta s example: like an elephant on the savannah looking for water - wanders randomly until it finds a few puddles - moves generally and stochastically around the surrounding area until it sights a bigger puddle

Definition 1 Monte Carlo: direct simulation of some stochastic process by drawing repeated samples from a known distribution Definition 2 Markov Chain: a string of system states / samples where each state depends only on the previous one Definition 3 : a Monte Carlo sampling from a distribution where each new sample is drawn with some reference to the last

Metropolis-Hastings sampling One particular sampling scheme for generating Markov Chains Best known, has nice statistical properties (more later) Randomly generate a new proposed point x maybe Test if f haystack ( x maybe ) < f haystack ( x current ) If so, x maybe x new If not, choose x maybe x new with probability f haystack ( x current )/f haystack ( x maybe )... and x current x new with probability [1 f haystack ( x current )/f haystack ( x maybe )]

Proposal functions General Considerations Q How do you generate the proposed point?

Proposal functions General Considerations Q How do you generate the proposed point? A You need a proposal function P( x) Generally some local distribution centred on the current point

Proposal functions General Considerations Q How do you generate the proposed point? A You need a proposal function P( x) Generally some local distribution centred on the current point e.g. a product of Gaussians in every direction

Proposal functions & burn-in Ideally, P = f haystack in vicinity of x current not usually practical P should be chosen adaptively to get the best approximation to f haystack - e.g. by analysing previous points and adjusting σ for a Gaussian P After a suitable number steps, memory of starting point is gone this is the burn-in period; all points during burn-in should be discarded proposal function may be fixed after burn-in (more later)

MCMC step by step (for minimization) 1 Initialise P 2 Choose a random starting point z 3 Take a Metropolis-Hastings step a. Choose a proposal point y from P b. If α f ( z) f ( y) 1 accept y as the new z c. Otherwise (α < 1), generate a random uniform deviate β d. If β < α, accept y as the new z e. Otherwise, z remains the same 4 If burn-in is still going, adjust P (usually on the basis of previous points) 5 If burn-in is finished, test for convergence 6 Repeat from Step 3

Statistical features of MCMCs Bayesians love MCMCs... MCMC procedure ensures that density of points in chain is proportional to the value of f haystack Makes marginalising (integrating) over uninteresting parameters easy - just sum the number of points Must fix the proposal function to have this property = extra-important to throw out burn-in points MCMCs and similar algorithms can also be good for frequentists Don t fix the proposal function let it keep optimising itself on the go to find the global minimum Use a very strict convergence criterion

Convergence General Considerations Local minima are an issue Easy to get stuck if local mode is wider/deeper than proposal function Need to use multiple chains with different starting values, and combine results Convergence criteria Coarsest option is to test the variance σrunning 2 of the last few points in the chain - σrunning 2 < σ2 threshold = chain found a minimum (local/global) - Very rough but OK(ish) if you know f haystack is unimodal Can be defined in terms of fractional change in Bayesian evidence ( f ( x)d x) Many other more sophisticated schemes Some ppl use a constant length of chain this can be risky

A couple of other random points... Temperature Chains can be assigned different temperatures s.t. α f ( z) f ( y) exp(ln f ( z) ln f ( y) T ) = exp(ln f ( z)/t ) exp(ln f ( y)/t ) = T > 1 = for α < 1, α goes up vs. normal MCMC = steps are more easily accepted like giving the jumpy diffusive particle a higher T allows skipping over local minima more easily Combining chains with different T breaks statistical properties (I think) Alternative sampling methods Gibbs sampling, slice sampling, others... ( ) 1 f ( z) T f ( y)

A few MCMC examples in research 4 3.5 3 m 0 (TeV) 2.5 2 1.5 1 0.5 0.5 1 1.5 2 m (TeV) 1/2 Putze et al (2010)

When to use which method?... as always, this is problem-specific... make sure to try a few For 1D where you can bracket, Brent s is best For multi-d with few modes, direction set type methods do OK For multi-d with many modes, and/or badly-behaved f, need MCMC/MultiNest/GAs

Housekeeping General Considerations Next lecture: Monday Feb 4 Numerical Integration I