Metropolis Algorithm

Size: px

Start display at page:

Download "Metropolis Algorithm"

Melissa Ball
6 years ago
Views:

//7 A Modeling, Inference, and Mining Jim Cordes, Cornell University Lecture MCMC example Reading: Ch, and in Gregory (from before) Chapter 9 of Mackay (Monte Carlo Methods) hip://www.inference.phy.

%fa %A87 GeneNc Algorithms: Principles of Natural SelecNon Applied to ComputaNon (Stephanie. Forrest) hip://science.sciencemag.org/content/ //87/tab- pdf Webpage: www.astro.cornell.

1 //7 A Modeling, Inference, and Mining Jim Cordes, Cornell University Lecture MCMC example Reading: Ch, and in Gregory (from before) Chapter 9 of Mackay (Monte Carlo Methods) hip:// book.pdf An IntroducNon to MCMC for Machine Learning (Andrieu et al., Machine Learning,, hip://link.springer.com/arncle/.%fa %A87 GeneNc Algorithms: Principles of Natural SelecNon Applied to ComputaNon (Stephanie. Forrest) hip://science.sciencemag.org/content/ //87/tab- pdf Webpage: Projects! Abstract Paper PresentaNon: May : pm Metropolis Algorithm a = acceptance probability X t+ -a = rejection probability Current state X t Some other state X t+ Choose a such that The probabilines of reaching different values of X are given by the target PDF The target PDF is reached asymptoncally at a rate that depends on the proposal PDF used to generate trial values of X t+. Detailed balance is achieved (as many transinons out of as into a given state) which also means that the Markov sequence is Nme reversible.

2 //7 Determining the acceptance probability: On previous pages we used the true transition matrix q(x x ) that defines the Markov chain and that has the target PDF as the eigen-pdf. For MCMC problems, we are free to choose any transition matrix we like, but its performance may or may not be suitable for a particular application. As Gregory says, finding an ideal proposal distribution is an art. So let a candidate transition matrix be Q(x x ) that is normalized in the usual way: X Q(x x )=. Generally Q will not satisfy detailed balance for the target PDF: x P (x )Q(x x ) = P (x)q(x x). We fix this by putting in a fudge factor a(x x ): or P (x )Q(x x )a(x x )=P (x)q(x x) a(x x )= P (x)q(x x) P (x )Q(x x ). We don t want the factor to exceed unity, however, so we write apple a(x x )=min, P (x)q(x x) P (x )Q(x x ). MCMC exploits this convergence to the ensemble state probabilities. The simplest form of the algorithm:. Choose a proposal density Q(y, x t ) that will be used to determine the value of x t+. Suppose that this proposal density is symmetric in its arguments.. Generate a value y from the proposal density.. Calculate the test ratio a = P (y) P (x t ). The test ratio is the acceptance probability for the candidate sample y.. Choose a random number u [, ].. If a accept the sample and set x t+ = y.. If a< accept y if u a and set x t+ = y. 7. Otherwise set x t+ = x t (i.e. the new value equals the previous value). 8. Each time step has a value. 9. The sampling steers the time sequence favorably toward regions of higher probability but allows the trajectory to move to regions of low probability.. Samples are correlated as with a random walk type process.. The burn-in time corresponds to the initial, transient portion of the time series x t that it takes the Markov process to converge. Often the autocorrelation function of the time sequence is used to diagnose the time series. 7

3 //7 Q(x; x () ) P (x) x () x Q(x; x () ) P (x) x () Figure 9.. Metropolis Hastings method in one dimension. The proposal distribution Q(x ; x) is here shown as having a shape that changes as x changes, though this is not typical of the proposal densities used in practice. x From MacKay For general, possibly asymmetric forms for the transition matrix, the test ratio is a = P (y)q(x t,y) P (x t )Q(y, x t. It reduces to the previous form when Q is symmetric in its arguments. This form preserves detailed balance of the Markov process (meaning that statistically the same results are gotten under time reversal) that is required in order for the state probability vector to converge to the desired target PDF. A system in thermal equilibrium has as many particles leaving a state as are entering. By analogy, a Markov process that has stationary statistics must also satisfy detailed balance. With the acceptance probability defined above, the Markov chain will satisfy detailed balance. See Gregory, Section. for a proof. Also the paper by Andrieu et al. on the course web page. 8

4 //7 Machine Learning,,, c Kluwer Academic Publishers. Manufactured in The Netherlands. An Introduction to MCMC for Machine Learning CHRISTOPHE ANDRIEU C.Andrieu@bristol.ac.uk Department of Mathematics, Statistics Group, University of Bristol, University Walk, Bristol BS8 TW, UK NANDO DE FREITAS nando@cs.ubc.ca Department of Computer Science, University of British Columbia, Main Mall, Vancouver, BC VT Z, Canada ARNAUD DOUCET doucet@ee.mu.oz.au Department of Electrical and Electronic Engineering, University of Melbourne, Parkville, Victoria, Australia MICHAEL I. JORDAN jordan@cs.berkeley.edu Departments of Computer Science and Statistics, University of California at Berkeley, 87 Soda Hall, Berkeley, CA 97-77, USA Abstract. This purpose of this introductory paper is threefold. First, it introduces the Monte Carlo method with emphasis on probabilistic machine learning. Second, it reviews the main building blocks of modern Markov chain Monte Carlo simulation, thereby providing and introduction to the remaining papers of this special issue. Lastly, it discusses new interesting research horizons. Keywords: Markov chain Monte Carlo, MCMC, sampling, stochastic algorithms. Introduction A recent survey places the Metropolis algorithm among the ten algorithms that have had the greatest influence on the development and practice of science and engineering in the th century (Beichl & Sullivan, ). This algorithm is an instance of a large class of sampling algorithms, known as Markov chain Monte Carlo (MCMC). These algorithms have played a significant role in statistics, econometrics, physics and computing science over the last two decades. There are several high-dimensional problems, such as computing the volume of a convex body in d dimensions, for which MCMC simulation is the only known general approach for providing a solution within a reasonable time (polynomial in d) (Dyer, Frieze, & Kannan, 99; Jerrum & Sinclair, 99). While convalescing from an illness in 9, Stan Ulam was playing solitaire. It, then, occurred to him to try to compute the chances that a particular solitaire laid out with cards would come out successfully (Eckhard, 987). After attempting exhaustive combinatorial calculations, he decided to go for the more practical approach of laying out several solitaires at random and then observing and counting the number of successful plays. This idea of selecting a statistical sample to approximate a hard combinatorial problem by a much simpler problem is at the heart of modern Monte Carlo simulation. C. ANDRIEU ET AL.... i=. i=.. Figure. Metropolis-Hastings algorithm.... i=. i=.. Figure. Target distribution and histogram of the MCMC samples at different iteration points. The MH algorithm is very simple, but it requires careful design of the proposal distribution q(x x). In subsequent sections, we will see that many MCMC algorithms arise by considering specific choices of this distribution. In general, it is possible to use suboptimal inference and learning algorithms to generate data-driven proposal distributions. The transition kernel for the MH algorithm is K MH ( x (i+) x (i)) = q ( x (i+) x (i)) A ( x (i), x (i+)) + δ x (i)( x (i+) ) r ( x (i)),

5 //7 Toy examples of MCMC using Gaussian target and proposal PDFs The target PDF is N(µ, ). For a proposal PDF we use N(µ p, p) that is wide enough so that values are generated that overlap with the target PDF. So use µ p = and p = µ + /. In practice, of course, we would not know the parameters of the target PDF (otherwise what would be the point of doing MCMC?) and we might not know its support in parameter space. Experimentation may be required to ensure that the parameter space is adequately sampled. Plots: Plots: Histograms of MC points x t,t=,,n for different N and different µ and. Autocovariance functions of x t the MC time series. ˆµx for single realizations that show the correlation time for Lessons: the more that the target and proposal PDFs differ, the longer it takes for the time series to show stationary statistics that conform to the target PDF. The burn-in time is thus longer in such cases because it is related to the autocorrelation time. Example time series are shown for two of the cases that illustrate the burn-in time and the correlation time.

6 //7 Histograms Demonstrate how the distribunon of MC points trends to the target PDF Target PDF = Gaussian with non- zero mean Proposal PDF = N(, σ ) with σ wide enough to span the target PDF Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ =.8 =.9 ˆµ =. ˆ =.9 Gaussian: µ =.8 =.9 ˆµ =. ˆ =.9 N=8 N= Target PDF Proposal PDF Gaussian: µ =.8 =.9 ˆµ =. ˆ =. N=8 8 8 Gaussian: µ =.8 =.9 ˆµ =.7 ˆ =. Gaussian: µ =.8 =.9 ˆµ =.8 ˆ =. N= 8 8 Gaussian: µ =.8 =.9 ˆµ =. ˆ =. N=8 N= X Counts MCMC of offset Gaussian tar Gaussian: µ =.8 =.7 ˆµ =.8 Gaussian: µ =.8 =.7 ˆµ =.9 Gaussian: µ =.8 =.7 ˆµ =. MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ = -.99 =. ˆµ = -.8 ˆ =.8 Gaussian: µ = -.99 =. ˆµ = -.89 ˆ = Gaussian: µ = -.99 =. ˆµ = -.8 ˆ =. 9 Gaussian: µ = -.99 =. ˆµ = -. ˆ = MCMC of offset Gaussian tar Gaussian: µ = -.99 =. ˆµ = Gaussian: µ = -.99 =. ˆµ = -.8

7 //7 Target μ, σ Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ =.8 =.9 ˆµ =. ˆ =.9 Gaussian: µ =.8 =.9 ˆµ =. ˆ =.9 N=8 N= Target PDF Proposal PDF Gaussian: µ =.8 =.9 ˆµ =. ˆ =. N=8 8 8 Gaussian: µ =.8 =.9 ˆµ =.7 ˆ =. Gaussian: µ =.8 =.9 ˆµ =.8 ˆ =. N= 8 8 Gaussian: µ =.8 =.9 ˆµ =. ˆ =. N=8 N= X Counts MCMC of offset Gaussian tar Gaussian: µ =.8 =.7 ˆµ =.8 Gaussian: µ =.8 =.7 ˆµ =.9 Gaussian: µ =.8 =.7 ˆµ =. Gaussian: µ = -.99 =. ˆµ = -.8 ˆ =.8 Gaussian: µ = -.99 =. ˆµ = -.89 ˆ = Target μ, σ μ, σ from MC values MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ =.8 =.9 ˆµ =. ˆ =.9 Gaussian: µ =.8 =.9 ˆµ =. ˆ =.9.. N=8 Gaussian: µ = -.99 =. ˆµ = -.8 ˆ =. Gaussian: µ = -.99 =. ˆµ = -. ˆ =.9... N= Proposal PDF Target PDF Gaussian: µ =.8 =.9 ˆµ =. ˆ =. Gaussian: µ =.8 =.9 ˆµ =.8 ˆ =. N=8 N= Gaussian: µ = -.99 =. ˆµ = -.9 ˆ =. Gaussian: µ = -.99 =. ˆµ = -.99 ˆ = Gaussian: µ =.8 =.9 ˆµ =.7 ˆ =. Gaussian: µ =.8 =.9 ˆµ =. ˆ =. N=8 X N=89 Counts Counts 8 MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF 8 8 X 8 8 Counts Counts MCMC of offset Gaussian tar Gaussian: µ = -.99 =. ˆµ = MCMC of offset Gaussian tar. Gaussian: µ =.8 =.7 ˆµ = Gaussian: µ = -.99 =. ˆµ = Gaussian: µ =.8 =.7 ˆµ = Gaussian: µ = -.99 =. ˆµ = -. Gaussian: µ =.8 =.7 ˆµ = MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ = -.99 =. ˆµ = -.8 ˆ =.8 Gaussian: µ = -.99 =. ˆµ = -.89 ˆ = Gaussian: µ = -.99 =. ˆµ = -.8 ˆ =. 9 Gaussian: µ = -.99 =. ˆµ = -. ˆ = MCMC of offset Gaussian tar Gaussian: µ = -.99 =. ˆµ = Gaussian: µ = -.99 =. ˆµ = -.8

8 //7 Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ =.8 =.9 ˆµ =. ˆ =.9 Even broader target PDF Proposal PDF propornonately broader 8 8 Gaussian: µ =.8 =.9 ˆµ =. ˆ =. 8 8 Gaussian: µ =.8 =.9 ˆµ =.7 ˆ =. 8 8 X Gaussian: µ =.8 =.9 ˆµ =. ˆ = Gaussian: µ =.8 =.9 ˆµ =.8 ˆ =. 8 8 Gaussian: µ =.8 =.9 ˆµ =. ˆ =. 8 8 Four cases with different target PDFs Even for target PDFs with large means, we obtain convergence. 8

9 //7 Counts Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ =.8 =.9 ˆµ =. ˆ = Gaussian: µ =.8 =.9 ˆµ =. ˆ = Gaussian: µ =.8 =.9 ˆµ =. ˆ = Gaussian: µ =.8 =.9 ˆµ =.8 ˆ =. 8 8 Gaussian: µ =.8 =.9 ˆµ =.7 ˆ =. Gaussian: µ =.8 =.9 ˆµ =. ˆ = X Note only states for st 8 MC samples MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ = -.99 =. ˆµ = -.8 ˆ =.8 Gaussian: µ = -.99 =. ˆµ = -.8 ˆ =. Gaussian: µ = -.99 =. ˆµ = -.89 ˆ = Gaussian: µ = -.99 =. ˆµ = -. ˆ = Gaussian: µ = =. ˆµ = -.9 ˆ =. Gaussian: µ = -.99 =. ˆµ = -.99 ˆ =. X Counts Counts Gaussian: µ =.8 =.7 ˆµ =.8 ˆ = MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF. Gaussian: µ =.8 =.7 ˆµ =.9 ˆ =.7 Gaussian: µ =.8 =.7 ˆµ =. ˆ =.7 7 Gaussian: µ =.8 =.7 ˆµ =. ˆ =. 7 Gaussian: µ =.8 7 =.7 ˆµ =. ˆ =.7 Gaussian: µ =.8 =.7 ˆµ =. ˆ =. 8 X MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ = -.99 =. ˆµ = -.8 ˆ = Gaussian: µ = -.99 =. ˆµ = -.8 ˆ = Gaussian: µ = -.99 =. ˆµ = -. ˆ = Gaussian: µ = -.99 =. ˆµ = -.88 ˆ = Gaussian: µ = -.99 =. ˆµ = -.98 ˆ = Gaussian: µ = -.99 =. ˆµ = -.99 ˆ = X Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ = -.99 =. ˆµ = -. ˆ =.77 Narrow target PDF Gaussian: µ = -.99 =. ˆµ = -. ˆ = Gaussian: µ = -.99 =. ˆµ = -.9 ˆ =. Gaussian: µ = -.99 =. ˆµ = -.97 ˆ = Gaussian: µ = -.99 =. ˆµ = -. ˆ =. Gaussian: µ = -.99 =. ˆµ = -.99 ˆ = X 9

10 //7 Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ =.8 =.7 ˆµ =.8 ˆ =.8 Broader target PDF Gaussian: µ =.8 =.7 ˆµ =.9 ˆ =.7 Gaussian: µ =.8 =.7 ˆµ =. ˆ = X Gaussian: µ =.8 =.7 ˆµ =. ˆ =.7 Gaussian: µ =.8 =.7 ˆµ =. ˆ =. Gaussian: µ =.8 =.7 ˆµ =. ˆ =. Counts Gaussian: µ = -.99 =. ˆµ = -.8 ˆ =.8 Gaussian: µ = -.99 =. ˆµ = -.8 ˆ =. MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Sequence of progressively narrower target PDFs Gaussian: µ = -.99 =. ˆµ = -.9 ˆ =. Gaussian: µ = -.99 =. ˆµ = -.89 ˆ = Gaussian: µ = -.99 =. ˆµ = -. ˆ = X Gaussian: µ = -.99 =. ˆµ = -.99 ˆ =.

11 //7 Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ = -.99 =. ˆµ = -.8 ˆ = Gaussian: µ = -.99 =. ˆµ = -.8 ˆ = Gaussian: µ = -.99 =. ˆµ = -.88 ˆ = Gaussian: µ = -.99 =. ˆµ = -.98 ˆ = Gaussian: µ = -.99 =. ˆµ = -. ˆ =. Gaussian: µ = -.99 =. ˆµ = -.99 ˆ = X Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ = -.99 =. ˆµ = -. ˆ = Gaussian: µ = -.99 =. ˆµ = -. ˆ = Gaussian: µ = -.99 =. ˆµ = -.9 ˆ =. Gaussian: µ = -.99 =. ˆµ = -.97 ˆ = Gaussian: µ = -.99 =. ˆµ = -. ˆ =. Gaussian: µ = -.99 =. ˆµ = -.99 ˆ = X

//7 ACFs of MCMC-generated Time Series Width of ACF = correlanon Nme for the Nme series Too long a correlanon Nme à inefficient sampling of parameter space Longer correlanon Nmes correspond to

12 //7 ACFs of MCMC-generated Time Series Width of ACF = correlanon Nme for the Nme series Too long a correlanon Nme à inefficient sampling of parameter space Longer correlanon Nmes correspond to proposal PDFs that have larger support relanve to the support of the target PDF. Time Series of MCMC Samples Case with wide target PDF Gaussian: µ =.8 =.9 ˆµ =. ˆ =. Time Series 8 Time (steps)

13 //7 Time Series of MCMC Samples Case with narrow target PDF. Gaussian: µ = -.99 =. ˆµ = -.99 ˆ =.. Gaussian: µ = -.99 =. ˆµ = -. ˆ =.7.. Burn- in Nme.. Time Series.. Time Series Time (steps) 8 Time (steps) Gaussian: µ =.8 =.7 ˆµ =. ˆ =...8 Full ACF. ACV.... Lag (time steps) Gaussian: µ =.8 =.7 ˆµ =. ˆ =.8 ACV Zoom in to inner % (same case, different realizanon). Lag (time steps)

14 //7 Relatively wide target PDF Gaussian: µ =.8 =.7 ˆµ =. ˆ =.7 Gaussian: µ =.8 =.7 ˆµ =. ˆ = ACV. ACV Lag (time steps). Lag (time steps) Wider target PDF à Narrower ACF Gaussian: µ =.8 =.9 ˆµ =. ˆ =. Gaussian: µ =.8 =.9 ˆµ =. ˆ = ACV. ACV Lag (time steps). Lag (time steps)

15 //7. Narrower target PDF à Wider ACF Gaussian: µ = -.99 =. ˆµ = -.98 ˆ =..8. ACV.... Lag (time steps) Narrower target PDF à Wider ACF. Gaussian: µ = -.99 =. ˆµ = -.99 ˆ =..8. ACV.... Lag (time steps)

16 //7. Narrower target PDF à Wider ACF Gaussian: µ = -.99 =. ˆµ = -.99 ˆ =..8. ACV.... Lag (time steps) Unsuitable Proposal PDFs

17 //7 7

18 //7 8

19 //7 9

20 //7

21 //7

22 //7

23 //7 Gibbs Sampling Gibbs sampling in MCMC is a simplified way to explore an N- dimensional space. It proceeds by MC- ing points along each axis sequennally, so is similar to the D examples in class and in MacKay Chapter 9. Stat Comput () :9 9 DOI.7/s A Markov Chain Monte Carlo version of the genetic algorithm Differential Evolution: easy Bayesian computing for real parameter spaces Cajo J. F. Ter Braak Abstract Differential Evolution (DE) is a simple genetic algorithm for numerical optimization in real parameter spaces. In a statistical context one would not just want the optimum but also its uncertainty. The uncertainty distribution can be obtained by a Bayesian analysis (after specifying prior and likelihood) using Markov Chain Monte Carlo (MCMC) simulation. This paper integrates the essential ideas of DE and MCMC, resulting in Differential Evolution Markov Chain (DE-MC). DE-MC is a population MCMC algorithm, in which multiple chains are run in parallel. DE-MC solves an important problem in MCMC, namely that of choosing an appropriate scale and orientation for the jumping distribution. In DE-MC the jumps are simply a fixed multiple of the differences of two random parameter vectors that are currently in the population. The selection process of DE-MC works via the usual Metropolis ratio which defines the probability with which a proposal is accepted. In tests with known uncertainty distributions, the efficiency of DE-MC with respect to random walk Metropolis with optimal multivariate Normal jumps ranged from 8% for small population sizes to % for large population sizes and even to % for the 97.% point of a variable from a -dimensional Student distribution. Two Bayesian examples illustrate the potential of DE-MC in practice. DE-MC is shown to facilitate multidimensional updates in a multi-chain Metropolis-within- Gibbs sampling approach. The advantage of DE-MC over conventional MCMC are simplicity, speed of calculation and convergence, even for nearly collinear parameters and multimodal densities.

24 //7 Bayesian Spectral Estimation 7 A Parametric Bayesian Approach to Spectral Analysis Consider a stochastic process whose (ensemble average) power spectrum is S(f). From a data vector x =col(x(t ),,x(t n )) we want to estimate the spectrum. The covariance matrix C for x has elements equal to appropriate values from the autocorrelation function R( ) which can be calculated from the Fourier transform of the power spectrum (Wiener-Khinchin theorem). C = hxx i = {C ij } = {R(t i t j )} The spectrum has parameters : S(f) =S(f; ) so the covariance matrix is also a function of the parameters, A quadratic form for the data (similar to a C = C( ). cost function) is Q = x C ( )x. If x(t) is a Gaussian process, then the likelihood function is apple L ( ) = ( ) n/ (det C( )) exp x C ( )x If the prior PDF for the parameters is f ( ), the posterior PDF for is f ( )L ( ) f ( x) = Z d f ( )L ( ) 8

25 //7 Examples White noise: The covariance matrix is diagonal and if all elements are equal ( homoscedastic ), the only parameter is and the ACF is R( ) =,. So the spectral estimate is simply bs(f) = b where b is the posterior estimate of the variance and B is the bandwidth. Note that it is consistent to use a bandlimited process for white noise that is a discrete process. For a continuous process, the Dirac delta function that would apply is formally inconsistent with a finite bandwidth. Real world cases do not have this issue! B 9 Power law power spectrum: This case is tricky because a spectral cutoff is required to avoid a divergence in the total variance but a finite data set may not sample the entire extent of the spectrum. Let the spectrum have the form S(f) =S f, f f L where the lower-frequency cutoff f L is required for. We will ignore shallower power laws with <. They require an upper frequency cutoff to keep the total variance finite. The covariance matrix has elements (van Haasteren and Levin, MNRAS, 8, 7) " # X C ij = S fl ( )sin( /) ( f L ij ) ( ) n ( f L ij ) n (n)!(n + ) In general, the model parameters are =col(s,f L, ). For finite data sets, terms in the infinite sum can be truncated for n> if f L max (van Haasteren and Levin ). In some applications for pulsars where a quadratic polynomial is removed from a data of length T, the dependence of C ij on the cutoff f L is removed in some cases. If there is no dependence on f L, the model parameters are =col(s, ). When the time series duration T satisfies T fl we expect the covariance matrix to be independent of f L because the lowest-frequency sinusoids contained in the spectrum are effectively constant over the interval [,T]. n=

26 //7 Steep power law spectra yield time series with sample variances that vary radically from realization to realization. For =, for example, values of the variance are spread over two orders of magnitude or more (Shannon and Cordes, ApJ, 7, 7). This implies that the spectral estimate from an individual realization may give a significantly biased result for at least the amplitude of the spectrum b S and perhaps for the spectral index ˆ. Numerical experiments indicate that the Bayesian estimates from simulated spectra recover the input parameters reasonably well. Bayesian Inference on a Chirped Sinusoid

27 //7 Chirped Signals A Bayesian Approach to Spectral Analysis Chirped signals are oscillating signals with time variable frequencies, usually with a linear variation of frequency with time. E.g. f(t) =A cos( t + t + ). Examples: plasma wave diagnostic signals Signals propagated through dispersive media (seismic cases, plasmas) Gravitational waves from inspiraling binary stars Doppler-shifted signals over fractions of an orbit (e.g. acceleration of pulsar in its orbit) Jaynes Approach to Spectral Analysis: cf. Jaynes Bayesian Spectrum and Chirp Analysis in Maximum Entropy and Bayesian Spectral Analysis and Estimation Problems Cited by Bretthorst in Bayesian Spectrum Analysis and Parameter Estimation, (987) Briefly by Gregory in Chapter of Bayesian Logical Data Analysis for the Physical Sciences Result: Optimal processing is a nonlinear operation on the data without recourse to smoothing. However, the DFT-based spectrum (the periodogram ) plays a key role in the estimation. Fresnel funcnon c(t) =e i t What is the FT of c(t)? 7

28 //7 Start with Bayes theorem p(h/di) posterior prob. = p(h/i) prior prob. new data p(d/hi) p(d/i) In this context, probabilities represent a simple mapping of degrees of belief onto real numbers. Recall p(d/hi) vs.d for fixed H = sampling distribution p(d/hi) vs.h for fixed D = likelihood function Read H as a statement that a parameter vector lies in a region of parameter space. Data model: y(t) = f(t)+e(t) f(t) = A cos( t + t + ) with = and = for the data e(t) = whitegaussian noise, e =, e = Data Set: D = {y(t), t T }, N =T +datapoints. 8

29 //7 Data Probability: The probability of obtaining a data set of N samples is P (D HI)= t P [y(t)] = T t= T ( ) / e [y(t) f(t)], () which we can rewrite as a likelihood function once we acquire a data set and evaluate the probability for a specific H. Writing out the parameters explicitly, the likelihood function is L(A,,, ) e T t= T [y(t) A cos( t + t + )] For simplicity, assume that T so that many cycles of oscillation are summed over. Then t cos ( t + t + ) = t [ + cos ( t + t + )] T + N Expand the argument of the exponential in the likelihood function, we have y(t) A cos( t + t + ) = y (t)+a cos ( t + t + ) Ay(t)cos( t + t + ) We care only about terms that are functions of the parameters, so we drop the y (t) term to get T t= T [y(t) A cos( t + t + )] [A cos ( t + t + ) Ay(t)cos( t + t + t The likelihood function becomes L(A,, Integrating out the phase:, ) e A t A y(t)cos( t + t + ) t y(t)cos( t + t + ) NA NA In calculating a power spectrum [in this case, a chirped power spectrum ( chirpogram )], we do not care about the phase of any sinusoid in the data. In Bayesian estimation, such a parameter is called a nuisance parameter. Since we do not know anything about, we integrate over its prior distribution, a pdf that is 9

30 //7 uniform over [, ]: f ( ) = otherwise. The marginalized likelihood function becomes Using the identity we have t L(A,, ) d L(A,,, ) = A d exp y(t)cos( t + t NA + ) t = exp NA d exp A y(t)cos( t + t + ) t y(t)cos( t + cos( t + t + ) =cos( t + t )cos sin( t + t )sin t + ) = cos t y(t)cos( t + t ) apple P t y(t)sin( t + t ) apple Q P cos Q sin = P + Q cos[ +tan (Q/P )]. This result may be used to evaluate the integral over in the marginalized likelihood function: A d exp t y(t)cos( t + t + ) To evaluate the integral we use the identity, This yields I (x) A d exp t = d e x cos =modifiedbesselfunction A P + Q cos[ + tan (Q/P )] d e irrelevant phase shift y(t)cos( t + t A + ) = I P + Q We now simplify P + Q : P + Q = t y(t)cos( t + t ) + y t sin( t + t ) t P + Q = t = t y(t)y(t )[cos( t + t ) cos( t + t )+sin( t + t ) sin( t + t )] t cos[ (t t )+ (t t )] t y(t)y(t )cos[ (t t )+ (t t )]. 7

31 //7 Define C(, ) N (P + Q )=N t t y(t)y(t )cos[ (t t )+ (t t ) ], Then the integral over gives A NC(, ) d L(A,,, ) I and the marginalized likelihood is L(A,, )=e NA A NC(, ) I. 8 Notes: () The data appear only in C(, ). () C is a sufficient statistic, meaning that it contains all information from the data that are relevant to inference using the likelihood function. () How do we read L(A,, )? As the probability distribution of the parameters A,, in terms of the data dependent quantity C(, ). (Note that L is not normalized as a PDF). As such, L is a quite different quantity from the Fourier-based power spectrum. () What is the quantity C(, ) N t t y(t)y(t )cos[ (t t )+ (t t ) ]? For a given data set,, are variables. If we plot C(, ), we expect to get a large value when = signal, = signal. () For a non-chirped but oscillatory signal ( =, =), the quantity C(, ) is nothing other than the periodogram (the squared magnitude of the Fourier transform of the data). We then see that, for this case, the likelihood function is a nonlinear function of the Fourier estimate for the power spectrum. 9

32 //7 Interpretation of the Bayesian and Fourier Approaches We found the marginalized likelihood for the frequency and chirp rate to be L(A,, )=e NA I A NC(, ). and the limiting form for the Bessel function s argument x is I (x) ex x. In this case the marginalized likelihood is L(A,, ) e NA I A NC(, ) A NC(, ) e NA e A NC(, )/ /. Since C(, ) is large when and match those of any true signal, we see that it is exponentiated as compared to appearing linearly in the periodogram. Now let s consider the case with no chirp rate, =. Examples in the literature show that the width of the Bayesian PDF is much narrower than the periodogram, C(, ). Does this mean that the uncertainty principle has been avoided? The answer is no! Uncertainty Principle in the Periodogram: For a data set of length T, the frequency resolution implied by the spectral window function is Width of the Bayesian PDF: f When the argument of the Bessel function is large the exponentiation causes the PDF to be much narrower than the spectral window for the periodogram. T.

33 //7 Interpretation: The periodogram is the distribution of power (or variance) with frequency for the particular realization of data used to form the periodogram. The spectral window also depicts the distribution of variance for a pure sinusoid in the data (with infinite signal to noise ratio). The Bayesian posterior is the PDF for the frequency of a sinusoid and therefore represents a very different quantity than the periodogram and are thus not directly comparable.. The Bayesian method addresses the question, what is the PDF for the frequency of the sinusoid that is in the data.?. The periodogram is the distribution of variance in frequency.. If we use the periodogram to estimate the sinusoid s frequency, we get a result that is more comparable: (a) First note that the width of the posterior PDF involves the signal to noise ratio (in the square root of the periodogram) NA/ while the width of the periodogram s spectral window is independent of the SNR. (b) General result: if a spectral line has width, its centroid can be determined to an accuracy SNR. This result follows from matched filtering, which we will discuss later on. (c) Quantitatively, the periodogram yields the same information about the location of the spectral line as does the posterior PDF.. Problem: derive an estimate for the width of the posterior PDF that can be compared with the estimate for the periodogram.

34 //7 Comparison of Spectral Line Localization Properties Claim: While the periodogram gives a spectral line that is much broader than the width of the posterior PDF for frequency, the ability to localize the spectral line in frequency is the same for both approaches. Periodogram: The signal-to-noise ratio (S/N) of the line is NA/ (as in DFT of complex exponential). The spectral resolution is res /(T +)(since our time interval is [ T,T]. The width of the line (e.g. FWHM) is of order the spectral resolution. Assume S/N is large. Posterior PDF: The PDF for is dominated by the exponential factor E( ) =exp{a NC(, )/ } From the expression for C we have C max = C( = )=NA / so E max = e A NC max/ = e (N/)(A/ ) For offset frequencies = + we can expand various things to show that E( + ) E max e (N/)(A/ ) [ (T +)/ ] This function has a width when the exponential =/ or = In terms of resolution units this is res = = NA N(T +) NA S/N oflineinperiodogram Figure : Left: Time series of sinusoid + white noise with A/ =sampled N =times over an interval of length T =. Right: Plot of the periodogram (red) and Bayesian PDF of the time series.

35 //7 Figure : Left: Time series of sinusoid + white noise with A/ =/ sampled N =times over an interval of length T =. Right: Plot of the periodogram (red) and Bayesian PDF of the time series.

36 //7 Prewhitening and Sinusoid Detection Prewhitening What is Prewhitening? Prewhitening is an operation that processes a time series (or some other data sequence) to make it behave statistically like white noise. The pre means that whitening precedes some other analysis that likely works better if the additive noise is white. These operations can be viewed in either the time domain or the frequency domain:. Make the ACF of the time series appear more like a delta function.. Make the spectrum appear flat. Example data sets that may require prewhitening:. A well behaved noise process with an additive low frequency (or polynomial) trend added to it.. A deterministic signal with an additive red-noise process. Viewed in the frequency domain, prewhitening means that the dynamic range of the measured data is reduced.

37 //7 Why bother? Recall from our discussions of spectral analysis the issues of leakage and bias. These arise from sidelobes inherent to spectral estimation. We can minimize leakage in two ways: () make sidelobes smaller and () minimize the power that is prone to leaking into sidelobes. Spectral windows address the former while prewhitening mitigates the latter. Leakage into sidelobes also constitutes bias in spectral estimates. However bias appears in other data analysis procedures. Consider least-squares fitting of a sinusoid to a signal of the form x(t) =A cos( t + )+r(t)+n(t), where n(t) is WSS white noise and r(t) is red noise with a steep power spectrum. Red noise can strongly bias fitting of a model ˆx(t) =Â cos(ˆ + ˆ) because its power can leak across the underlying spectrum causing a least-square fit to give highly discrepant values of Â, ˆ, and ˆ. Prewhitening of the time series ideally would yield a transformed time series of the form x (t) =A cos( t + to which fitting a sinusoidal model will be less biased. )+n (t) Procedures: We have already seen one analysis that is related to prewhitening: the matched filter (MF). The MF doesn t whiten the spectrum of the output but it does weights the frequency components of the measured quantity to maximize the S/N of the signal. The signal model in this case is x(t) =a A(t) +n(t). Recall for an arbitrary spectrum S n (f) for additive noise that the frequency-domain MF for a signal A(t) is h(f) Ã(f) S n (f). Taking equality for simplicity, when the filter is applied to the measurements x(t), we have ỹ(f) = x(f) h (f) a Ã(f) S n (f) This means that the ensemble-average spectrum of the filter output is + ñ(f)ã (f). S n (f) ỹ(f) = = a Ã(f) Sn(f) ñ(f) Ã(f) + Sn(f) a Ã(f) + Sn(f) Ã(f) Sn(f) Sn(f) = a Ã(f) Sn(f) = Ã(f) S n (f) + Ã(f) S n (f) a Ã(f) + S n (f) 7

38 //7 Signals with trends: A common situation is where a quantity of the form a A(t)+n(t) is superposed with a strong trend, such as a baseline variation. Similar issues arise in measurements of spectra. Consequences of trends include:. Bias in estimating parameters of A(t t ) or its spectral analog A( ).. Erroneous estimates of cross correlations between two time series such as x(t) =s (t)+n (t) and y(t) =s (t)+n (t), where s, are signals of interest and n, are measurement errors. I.e. we may be interested in the correlation C = s (t)s (t) or C = [s (t) s ][s (t) s ] N t N t where s, =(/N t ) t t s,(t) are the sample means. If there are trends p, (t) added to x(t) and y(t) the correlation Ĉ of x and y used to estimate C may be dominated completely by the trends and not the signal parts of the measurements. A fix: Trends can often be modeled as a polynomial of some order that can be fitted to the measurements. The order of the polynomial needs to be chosen wisely. For a pulse or spectral line confined to some range of t or this is straight forward. But for a detection problem where the signal location is not known, the situation is very tricky. t Prewhitening filter: Consider again x(t) =a A(t)+n(t) and let s trivially construct a frequencydomain filter that whitens the measurements. We want a filter h(t) that flattens the noise n(t) in the frequency domain. Let y(t) =x(t) h(t) where means convolution. All we need is h(f) = / S n (f). Then the ensemble spectrum of the output ỹ(f) is ỹ(f) = x(f) h(f) = x(f) S n (f) = a Ã(f) S n (f) Note how this differs from the result for a matched filter. But the result is that in the mean the spectrum of the additive noise has been flattened. Prewhitening is important in both detection and estimation applications. + 8

39 //7 Leakage and Bias Prewhitening in the least-squares estimation context: Consider our standard linear model y = X + n, which has a least-squares solution for the parameter vector = X C n X X C n y, where the covariance matrix of the noise vector n is C n = nn. This is also the maximum likelihood solution in the right circumstances (which are?). As with any covariance matrix, C n is Hermitian and positive, semi-definite. This means that the quadratic form for an arbitrary vector z satisfies z C n z. Such matrices can always be factored according to the Cholesky decomposition: C n = LL where L is a lower-diagonal matrix; e.g. a b c L = d e f. g h i j See hrp://en.wikipedia.org/wiki/cholesky_decomposinon 9

40 //7 Utility: we can transform the model as follows using L: Substituting into the solution vector for yields y = Ly w X = LX w. and using y =(Ly w ) = y wl, X =(LX w ) = X wl, and C n =(LL ) = L L = X C n X X C n y = (X w L Cn L X w ) X w L Cn L y I I = X wx w X w y. So what? The solution is identical to the least-squares case where the noise covariance matrix is diagonal; i.e. the noise vector n w = L n has been transformed to white noise. We have whitened the data. When is this useful? An example is the fitting of a sinusoidal function amid red noise where leakage effects are important just as they are for spectral analysis. A specific example is the fitting of astrometric parameters or periodicities in radial velocity data. What s the catch? You need to know the covariance matrix of the noise n to do the Cholesky decomposition. This can be easier said than done! 7 Examples of sine wave + red and white noise Examples were generated with a signal y(t) =cos( t/p + )+r(t)/snr r + w(t)/snr w where r, w have unit variance and are scaled by the signal to noise ratios snr r and snr w, respectively. The covariance matrix for the combined noise n = r + w was calculated by averaging C n = nn over realizations. Note that for some real situations where we have only a single time series, we would need to calculate C n differently, e.g. from first principles, prior knowledge, etc. In practice, realizations of r were generated and the mean subtracted. Then white noise was added to form n and then the Cholesky decomposition was done using the command L = scipy.linalg.cholesky(c n, lower=true) For data vectors of length N, the lower-diagonal matrix L is N N. If the mean had been subtracted from the white noise as well, the rank of the covariance matrix would be N and the decomposition would fail. Results in the following figures indicate that. Power-law red noise with spectral indices s i < do not benefit particularly from whitening because leakage is much less. 8

41 //7. What matters is the signal to noise ratio of the cosine to the signal contained in one resolution bandwidth f T centered on the frequency of the sinusoid. For a steep power law, only a small fraction of the total power in the red noise is in this band whereas the flatter the spectrum, the larger this fraction is. 9 Signal + Noise Noise only Cholesky whitening: N = Sine+RN+WN Si =. S/Nr =. S/Nw =. Time Series Spectra Time (bins) Frequency (bins) Figure : Example of whitening using the Cholesky decomposition. The signal consists of a sine wave with period of. time bins with additive red and white noise. Signal-to-noise ratios of the signal relative to each kind of noise are given. Top left: original time series (red) and whitened time series (black). Bottom left: original noise (red) and whitened noise (black). Top right: power spectra of the original and whitened time series. Bottom right: power spectra of original and whitened noise sequences.

42 //7 Cholesky whitening: N = Sine+RN+WN Si =. S/Nr =. S/Nw =. Time Series Spectra Signal + Noise Noise only Time (bins) Frequency (bins) Figure : Example of whitening using the Cholesky decomposition. The signal consists of a sine wave with period of. time bins with additive red and white noise. Signal-to-noise ratios of the signal relative to each kind of noise are given. Top left: original time series (red) and whitened time series (black). Bottom left: original noise (red) and whitened noise (black). Top right: power spectra of the original and whitened time series. Bottom right: power spectra of original and whitened noise sequences. Signal + Noise Noise only Cholesky whitening: N = Sine+RN+WN S i =. S/N r =. S/N w =. Time Series Spectra 8 8 Time (bins) Frequency (bins) Figure : Example of whitening using the Cholesky decomposition. The signal consists of a sine wave with period of. time bins with additive red and white noise. Signal-to-noise ratios of the signal relative to each kind of noise are given. Top left: original time series (red) and whitened time series (black). Bottom left: original noise (red) and whitened noise (black). Top right: power spectra of the original and whitened time series. Bottom right: power spectra of original and whitened noise sequences.

43 //7 Signal + Noise Noise only Cholesky whitening: N = Sine+RN+WN S i =. S/N r =. S/N w =. Time Series Spectra Time (bins) Frequency (bins) Figure : Example of whitening using the Cholesky decomposition. The signal consists of a sine wave with period of. time bins with additive red and white noise. Signal-to-noise ratios of the signal relative to each kind of noise are given. Top left: original time series (red) and whitened time series (black). Bottom left: original noise (red) and whitened noise (black). Top right: power spectra of the original and whitened time series. Bottom right: power spectra of original and whitened noise sequences. Signal + Noise Noise only Cholesky whitening: N = Sine+RN+WN S i =. S/N r =. S/N w =. Time Series Spectra Time (bins) Frequency (bins) Figure : Example of whitening using the Cholesky decomposition. The signal consists of a sine wave with period of. time bins with additive red and white noise. Signal-to-noise ratios of the signal relative to each kind of noise are given. Top left: original time series (red) and whitened time series (black). Bottom left: original noise (red) and whitened noise (black). Top right: power spectra of the original and whitened time series. Bottom right: power spectra of original and whitened noise sequences.

44 //7 Signal + Noise Noise only Cholesky whitening: N = Sine+RN+WN S i =. S/N r =. S/N w =. Time Series Spectra Time (bins) Frequency (bins) Figure : Example of whitening using the Cholesky decomposition. The signal consists of a sine wave with period of. time bins with additive red and white noise. Signal-to-noise ratios of the signal relative to each kind of noise are given. Top left: original time series (red) and whitened time series (black). Bottom left: original noise (red) and whitened noise (black). Top right: power spectra of the original and whitened time series. Bottom right: power spectra of original and whitened noise sequences. Signal + Noise Noise only Cholesky whitening: N = Sine+RN+WN S i =. S/N r =. S/N w =. Time Series Spectra Time (bins) Frequency (bins) Figure 7: Example of whitening using the Cholesky decomposition. The signal consists of a sine wave with period of. time bins with additive red and white noise. Signal-to-noise ratios of the signal relative to each kind of noise are given. Top left: original time series (red) and whitened time series (black). Bottom left: original noise (red) and whitened noise (black). Top right: power spectra of the original and whitened time series. Bottom right: power spectra of original and whitened noise sequences.

45 //7 Signal + Noise Noise only Cholesky whitening: N = Sine+RN+WN S i =. S/N r =. S/N w =. Time Series Spectra Time (bins) Frequency (bins) Figure 8: Example of whitening using the Cholesky decomposition. The signal consists of a sine wave with period of. time bins with additive red and white noise. Signal-to-noise ratios of the signal relative to each kind of noise are given. Top left: original time series (red) and whitened time series (black). Bottom left: original noise (red) and whitened noise (black). Top right: power spectra of the original and whitened time series. Bottom right: power spectra of original and whitened noise sequences. 7 Cholesky whitening: N = Sine+RN+WN Si =. S/Nr =. S/Nw =. Time Series Spectra Signal + Noise Noise only Time (bins) Frequency (bins) Figure 9: Example of whitening using the Cholesky decomposition. The signal consists of a sine wave with period of. time bins with additive red and white noise. Signal-to-noise ratios of the signal relative to each kind of noise are given. Top left: original time series (red) and whitened time series (black). Bottom left: original noise (red) and whitened noise (black). Top right: power spectra of the original and whitened time series. Bottom right: power spectra of original and whitened noise sequences. 8

46 //7 Impulse Response and Spectrum of Whitening Filter We can think of the Cholesky decomposition as a filter that suppresses low frequencies for the purpose of estimating the parameters of a sinusoid. The filter response can be calculated from the impulse response as follows: Construct a data vector i corresponding to i j =for all j except j = j where i j =. Then the impulse response is h = L i. Then, expressed as a time function h j,j =,,N, the frequency-domain response is the squared magnitude of the DFT of h j : H k = h k 9 Figure : Example of whitening using the Cholesky decomposition along with the impulse response and its spectrum. The signal consists of a sine wave with period of. time bins with additive red and white noise. Signal-to-noise ratios of the signal relative to each kind of noise are given. Left figure: Top left: original time series (red) and whitened time series (black). Bottom left: original noise (red) and whitened noise (black). Top right: power spectra of the original and whitened time series. Bottom right: power spectra of original and whitened noise sequences. Right figure: Top panel: input impulse (red) and impulse response of the Cholesky filter. Bottom Panel: Spectra of the impulse and impulse response, respectively. The filter shows the suppression of frequencies below about bins; this frequency is signal-to-noise ratio dependent. Note that the form of the impulse response is of a running difference that will remove a low- frequency varianon.

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2013

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2013 Lecture 26 Localization/Matched Filtering (continued) Prewhitening Lectures next week: Reading Bases, principal