A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring

Size: px

Start display at page:

Download "A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring"

Erika Ward
5 years ago
Views:

1 A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 25 Lecture 25:! Markov Processes and Markov Chain Monte Carlo!! Chapter 29 of Mackay (Monte Carlo Methods)! Chapter 2 in Gregory (MCMC)! An Introduction to MCMC for Machine Learning (Andrieu et al. 23, Machine Learning, 5, 5! Genetic Algorithms: Principles of Natural Selection Applied to Computation (Stephanie Forrest, Science 993, 26, 872)!!!!

2 Markov Processes Markov processes are used for modeling as well as in statistical inference problems.! Markov processes are generally n th order:! The current state of a system may depend on n previous states! Most applications consider st order processes! Hidden Markov processes:! A physical system may involve transitions between discrete states, but observables my reflect those states only indirectly (e.g. measurement noise, other physics, etc.)!

3 Markov Chains and Markov Processes Definitions: A Markov process has future samples determined only by the present state and by a transition probability from the present state to a future state. A Markov chain is one that has a countable number of states. Transitions between states are described by an n n stochastic matrix Q with elements q ij comprising the probabilities for changing in a single time step from state s i to state s j with i, j =,...,n. The state probability vector P has elements comprising the ensemble probability of finding the system in each state. E.g. for a three-state system: States = {s,s 2,,s n }, Q = q q 2 q 3 q 2 q 22 q 23 q 3 q 32 q 33. Normalization across a row is j q ij =since the system must be in some state at any time. In a single time step the probability for staying in the i th state is the metastability q ii and the probability for residing in that state for a time T is proportional to q T ii.

4 Example of a two-state Markov process States = {s,s 2 }, Q = q q 2 q 2 q 22. So Q 2 = q q 2 q 2 q 22 q q 2 q 2 q 22 = q 2 + q 2 q 2 q q 2 + q 2 q 22 q 2 q + q 22 q 2 q 2 q 2 + q 2 22 We want lim t! Qt. This gets messy very quickly even though there are only two independent quantitis, since q 2 = q and q 2 = q 22. But it can be shown that and Q = p p 2 p p 2 where p = T T + T 2 p 2 = T 2 T + T 2 T =( q ) and T 2 =( q 22 ). Thus the transition probabilities q,q 22 determine both the mean lifetime of each state T and T 2 and the probabilities p and p 2 of finding the process in each state. 5

5 Two-state Markov Processes

6 The probability density function (PDF) for the duration of a given state is therefore a geometric series that sums to f T (T )=Ti Ti T, T =, 2,, () with mean and rms values T i =( q ii ), T i /T i = q ii. (2) Asymptotic behavior as the number of steps : The transition matrix after t steps is Q t. Under the reasonable assumptions that all elements of Q are non-negative and that all states are accessible in a finite number of steps, Q t converges to a steady-state form Q as t that has identical rows. Each row of Q is equal to the state probability vector P, the elements of which are the probabilities that a given time sample is in a particular state. P also equals the normalized left eigenvector of Q that has unity eigenvalue, i.e. PQ = P (e.g. Papoulis). For P to exist, the determinant det(q I) = (where I is the identity matrix), but this is automatically satisfied for a stochastic matrix corresponding to a stationary process. Convergence of Q t to a matrix with identical rows implies that the transition probabilities trend to those appropriate for an i.i.d. process when the time step t is much larger than the mean lifetimes T i of any of the states. For a two-state system P has elements p =( q 22 )/(2 q q 22 ) and p 2 = P. 2

7 Utility of Markov processes:. Modeling: Many processes in the lab and in nature are consistent with being Markov chains. The key elements are a set of discrete states and transitions that are random but are according to a transition matrix. 2. Sampling: A Markov chain can define a trajectory in the relevant space which can be used to randomly but efficiently sample the space. The key aspect of Markov Chain Monte Carlo is that the trajectory conforms statistically to the asymptotic form of the transition matrix. 3

different transition probabilities Two-states with

8 First order Markov processes: exponential PDFs for state durations Pure two-state processes with different transition probabilities Two-states with a periodic driving function à quasi-periodic state switching

10 State Changes in Pulsars B yr of state changes (Young et al. 23) Kramer et al. 26 State durations are widely but NOT exponentially distributed A strictly periodic forcing function (e.g. orbit) can produce quasi-periodic state changes Stochastic resonance model can produce similar histograms

11 Statistics are nice but what are the physics? Effective potential of a two state system State changes = stochastic jumps between wells A pulsar magnetosphere + accelerator is essentially a diode circuit with a return current Recent models (Liu, Spitkovsky, Timokhin, +) incorporate disks for the return current Stochastic resonance is from periodic modulation of the potential Markov switching and stochastic resonance are seen in laboratory diode circuits Pulsars more complicated because they are 2D circuits Periodic forcing in the equatorial disk can drive SR

12 Article pubs.acs.org/jctc Identifying Metastable States of Folding Proteins Abhinav Jain and Gerhard Stock* Biomolecular Dynamics, Institute of Physics, Albert Ludwigs University, 794 Freiburg, Germany *S Supporting Information ABSTRACT: Recent molecular dynamics simulations of biopolymers have shown that in many cases the global features of the free energy landscape can be characterized in terms of the metastable conformational states of the system. To identify these states, a conceptionally and computationally simple approach is proposed. It consists of (i) an initial preprocessing via principal component analysis to reduce the dimensionality of the data, followed by k-means clustering to generate up to 4 microstates, (ii) the most probable path algorithm to identify the metastable states of the system, and (iii) boundary corrections of these states via the introduction of cluster cores in order to obtain the correct dynamics. By adopting two well-studied model problems, hepta-alanine and the villin headpiece protein, the potential and the performance of the approach are demonstrated.. INTRODUCTION While molecular dynamics (MD) simulations account for the structure and dynamics of biomolecules in microscopic detail, they generate huge amounts of data. To extract the essential information and reduce the complex and highly correlated biomolecular motion from 3N atomic coordinates to a few collective degrees of freedom, dimensionality reduction methods such as principal component analysis (PCA) are commonly employed. 5 The resulting low-dimensional representation of the dynamics can then be used to construct the free energy landscape ΔG(V) = k B T ln P(V), where P is the probability distribution of the molecular system along the principal components V = {V,V 2,...}. Characterized by its minima (which represent the metastable conformational states of the systems) and its barriers (which connect these states), the energy landscape allows us to account for the pathways and their kinetics occurring in a biomolecular process. 6 8 Recent simulations of peptides, proteins, and RNA have shown that in many cases the free energy landscape can be well characterized in terms of metastable conformational states. 9 2 As an example, Figure A shows a two-dimensional free energy landscape of hepta-alanine 3 (Ala 7 ) obtained from an 8 ns MD simulation with subsequent PCA of the ϕ, ψ backbone dihedral angles (see section 3). The purple circles on the contour plot readily indicate about 3 well-defined minima (or basins) of the energy surface. They correspond to metastable conformational states, which can be employed to construct a transition network of the dynamics of the system The network can be analyzed to reveal the relevant pathways of the considered process, or to discuss general features of the system such as the topology (i.e., a hierarchical structure) of the energy landscape and network properties such as scale-freeness. Also, in protein folding, metastable states have emerged as a new paradigm. 9, Augmenting the funnel picture of folding, the presence of thermally populated metastable states may result in an ensemble of (rather than one or a few) folding pathways. 9 Moreover, they can result in kinetic traps, which may considerably extend the average folding time. As an example, Figure B shows the free energy landscape of the villin headpiece subdomain, obtained from a PCA of extensive folding trajectories by Pande and co-workers 33 (see section 4). Due to the high dimensionality of the energy landscape, the two-dimensional projection only vaguely indicates the multiple minima of the protein. Although energy landscapes as in Figure appear to easily provide the location of the energy minima, in general it turns out that metastable states are surprisingly difficult to identify, even for a seemingly simple system like Ala 7. To partition the conformational space into clusters of data points representing the states, one may use either geometric clustering methods such as k-means, 38 which require only data in a metric space, or kinetic clustering methods, which additionally require dynamical information on the process While geometrical methods are fast and easy to use, they show several well-known flaws. For example, since they usually require one to fix the number of clusters k beforehand, it easily happens that one combines two separate states into one (if k is chosen too small) or cuts one state into two (if k is chosen too large). Another problem is the appropriate definition of the border between two clusters. From a dynamical point of view, the correct border is clearly located at the top of the energy barrier between the two states. Using exclusively geometrical criteria, however, the middle between the two cluster centers appears as an obvious choice, see Figure 2A. As a consequence, conformational fluctuations in a single minimum of the energy surface may erroneously be taken as transitions to another energy minimum, see Figure 2B. The same problem may occur for systems with low energy barrier heights, say, ΔG B 3k B T. Kinetic cluster algorithms may avoid these problems by using the dynamical information provided by the time evolution of the MD trajectory In a first step, the conformational space is partitioned into disjoint microstates, which can be obtained, e.g., by geometrical clustering (see section 2.). Employing these microstates, we calculate the transition matrix {T mn } from the MD trajectory, where T mn represents the probability that Special Issue: Wilfred F. van Gunsteren Festschrift Received: January 3, 22 XXXX American Chemical Society A dx.doi.org/.2/ct377q J. Chem. Theory Comput. XXXX, XXX, XXX XXX

(A) Although the top of the energy barrier between the two states clearly represents the correct border, geometrical clustering methods may rather choose the geometrical middle between the two

13 Journal of Chemical Theory and Computation Article Figure 2. Common problems in the identification of metastable conformational states, illustrated for a two-state model, which is represented by a schematic free energy curve along some reaction coordinate r. (A) Although the top of the energy barrier between the two states clearly represents the correct border, geometrical clustering methods may rather choose the geometrical middle between the two cluster centers. (B) Typical time evolution of a MD trajectory along r for the two-state model and the corresponding probability distribution P(r). Low barrier heights or an inaccurate definition of the separating barrier may cause intrastate fluctuations to be mistaken as interstate transitions. The overlapping region of the two states is indicated by Δr. The introduction of cluster cores (shaded areas) can correct for this. Figure. Free energy landscape (in units of k B T) of (A) hepta-alanine and (B) the villin headpiece as a function of the first two principal components. The metastable conformational states of the system show up as minima of the energy surface (see Tables and 2 for the labeling). analyzing the eigenfunctions of the transition matrix,4 or by employing steepest-decent-type algorithms. 22,23,39,4 The choice of the method may depend to a large extent on the application in mind. For example, an important purpose of kinetic clustering is the construction of discrete Markov state

14 Properties of Markov processes relevant to MCMC. The state probability vector evolves as P t = P t Q, where Q is the transition matrix. 2. This implies P t = P Q t. 3. As t! Q t! Q where Q has equal rows, each equal to the asymptotic state probability vector, i.e. the ensemble probabilities for each state, which we write as P (rather than P. 4. The eigenvectors of the single-step transition matrix Q include one that equals the state probability vector; the associated eigenvalue is unity (see Papoulis). 4

15 Numerical example of a 3x3 Matrix (Python output) Mean state durations: T, T2, T3 = Q = [[.5..4 ] [ ] [ ]] Qˆ2 = [[ ] [ ] [ ]] Qˆ = [[ ] [ ] [ ]] Qˆ5 = [[ ] [ ] [ ]] Qˆ = [[ ] [ ] [ ]] Qˆ = [[ ] [ ] [ ]] Convergence to equal rows = state probability vector

Numerical example of a 3x3 Matrix (Python output) Qˆ = [[.6279.74486.

16 Numerical example of a 3x3 Matrix (Python output) Qˆ = [[ ] [ ] [ ]] Convergence to equal rows = state probability vector Eigenvalue problem: P = state probability vector (row vector) Q = transition matrix PQ = P à P(Q-I) = Eigenvector that has unit eigenvalue is equal to P eigenvalues = [ ] Row 2 = eigenvector with eigenvalue = eigenvectors = [[ ] [ ] [ ]] Normalize row 2 à state probability vector P State probabilities from Qˆ = State probabilities from eigenvector = Same

17 MCMC

18 Markov Chain Monte Carlo Monte Carlo methods: Various statistical calculations are done by using random samples that represent the relevant domain. Stated generally, an integral I = dx g(x) can be approximated as a sum over samples x j,j =,,n I n n g(x j ). j= For a simple domain (e.g. D, 2D) the samples over a uniform grid can be used. However for a high number of dimensions and where the full extent of the function is not known, a more intelligently selected set of samples may yield faster convergence. The error in the estimate of I is where 2 g = g 2 (x j ) n j Î g n, g 2 (x j n j 2 = g 2 (x j ) I 2 n j

19 Sampling methods: A common problem is the sampling of a posterior PDF in a Bayesian analysis with high dimensionality. In general it is much easier to get the shape of the PDF than it is to get the normalization because the latter requires integration over the full parameter space. Also, even if the normalization were known, sampling in multiple dimensions is difficult. Consider sampling from a function P (x) that could be, for example, a posterior PDF where x is a vector of parameters. If we don t know the normalization, we could write P (x) =ZP (x) where P is the unnormalized function. Uniform sampling: sample randomly but with uniform probability over each dimension of the parameter space. This is highly inefficient if probability is concentrated in islands in parameter space. Importance Sampling: samples are drawn from a different function Q(x) whose support covers that of P (x). Q is chosen to be a function that is simpler to draw samples from. In the desired summation, samples are then weighted according to X w j = P (x) Q(x), I j w j g(x j ) X. w j j 2

20 Rejection sampling: A proposal density Q(x) is used that is required to be larger than P (x) for all x. First a random x j is generated. Then a uniform number u 2 [, ] is generated. If u<p (x j )/Q(x j ), then the sample is accepted. Otherwise it is rejected. See Figure 29.8 of MacKay. Metropolis-Hasting (MH) method: Unlike the rejection method, the MH method chooses a new sample based on the current value of x. I.e. a sequence of x j values is viewed as a time sequence of a Markov process or chain. The proposal density depends on the current state and is in fact related to the transition matrix of a Markov process. The MH algorithm exploits the fact that a Markov process has a state probability vector that converges to a stable form if the process satisfies two conditions: () that it not be periodic; and (2) that all states are accessible from any other state. If these are satisfied, the Markov process will have stationary statistics. The trick and beauty of the MH algorithm is that a well chosen transition matrix will allow the Markov process to converge to a state probability vector equal to that of the PDF P (x) even if the normalization of P (x) is not known. In this context, P (x) is called the target density. The MH algorithm provides two things: () A time sequence of samples drawn from the target density P (x); (2) The PDF P (x) itself. 3

21 P (x) Q (x) φ(x) x Figure Functions involved in importance sampling. We wish to estimate the expectation of φ(x) under P (x) P (x). We can generate samples from the simpler distribution Q(x) Q (x). We can evaluate Q and P at any point. From MacKay

22 (a) P (x) cq (x) (b) P (x) cq (x) u x x x Figure Rejection sampling. (a) The functions involved in rejection sampling. We desire samples from P (x) P (x). We are able to draw samples from Q(x) Q (x), and we know a value c such that c Q (x) > P (x) for all x. (b) A point (x, u) is generated at random in the lightly shaded area under the curve c Q (x). If this point also lies below P (x) then it is accepted. From MacKay

23 Detailed balance and the choice of acceptance probability The Metropolis-Hastings algorithm hinges on three requirements for a Markov chain:. All states are accessible from any other state in a finite number of steps. This is often called irreducibility because if some states are not accessible, the chain could be reduced in size. 2. The chain must not be periodic. If it were, it could get stuck in a limit cycle. 3. The chain asymptotes to a state probability vector that is stationary. It does not depend on time (as in our usual definition of stationarity). This also means that the asymptotic PDF is equal to the target PDF (e.g. the posterior PDF of Bayesian inference). The target PDF is a left eigenvector of the transition matrix Q with unit eigenvector: PQ= P. 4. For this to be true, detailed balance must hold, meaning that transitions between any two possible values of the chain are equiprobable. 9

24 The eigenvalue equation can be written P (x) = X x P (x )q(x x ) where P (x) = State probability vector or target PDF; often written as (x). Gregory writes this as P (X t D, I) for Bayesian inference contexts. q(x x )= Transition probability between states x and x. Not to be confused with the proposal density. Gregory writes this as p(x t X t+ ). Satisfying the eigenvalue equation requires that This can be demonstrated explicitly: P (x )q(x x )=P (x)q(x x). (*) P (x) = X x P (x )q(x x ) substitute using equation (*) = X x P (x)q(x x) factor out P (x) = P (x) X x q(x x) sum of destination probabilities = P (x).

25 It is useful to separate the transition probability into two terms, one for the probability of moving to a new state; the other to stay in the same state: These satisfy so that q(x x )= q(x x ) +r(x {z } ) x,x. q(x x)= X q(x x )== X x x X x q(x x )+ X r(x ) x,x x {z } =r(x ) q(x x )= r(x ). Using this separation it can be shown that detailed balance still holds. 3

reached asymptotically at a rate that depends on the proposal PDF used to generate trial values of X t+.

26 Metropolis Algorithm a = acceptance probability X t+ -a = rejection probability Current state X t Some other state X t+ Choose a such that The probabilities of reaching different values of X are given by the target PDF The target PDF is reached asymptotically at a rate that depends on the proposal PDF used to generate trial values of X t+. Detailed balance is achieved (as many transitions out of as into a given state) which also means that the Markov sequence is time reversible.

27 Determining the acceptance probability: On previous pages we used the true transition matrix q(x x ) that defines the Markov chain and that has the target PDF as the eigen-pdf. For MCMC problems, we are free to choose any transition matrix we like, but its performance may or may not be suitable for a particular application. As Gregory says, finding an ideal proposal distribution is an art. So let a candidate transition matrix be Q(x x ) that is normalized in the usual way: X Q(x x )=. Generally Q will not satisfy detailed balance for the target PDF: x P (x )Q(x x ) 6= P (x)q(x x). We fix this by putting in a fudge factor a(x x ): or P (x )Q(x x )a(x x )=P (x)q(x x) a(x x )= P (x)q(x x) P (x )Q(x x ). We don t want the factor to exceed unity, however, so we write apple a(x x )=min, P (x)q(x x) P (x )Q(x x ). 4

28 MCMC exploits this convergence to the ensemble state probabilities. The simplest form of the algorithm:. Choose a proposal density Q(y, x t ) that will be used to determine the value of x t+. Suppose that this proposal density is symmetric in its arguments. 2. Generate a value y from the proposal density. 3. Calculate the test ratio a = P (y) P (x t ). The test ratio is the acceptance probability for the candidate sample y. 4. Choose a random number u [, ]. 5. If a accept the sample and set x t+ = y. 6. If a< accept y if u a and set x t+ = y. 7. Otherwise set x t+ = x t (i.e. the new value equals the previous value). 8. Each time step has a value. 9. The sampling steers the time sequence favorably toward regions of higher probability but allows the trajectory to move to regions of low probability.. Samples are correlated as with a random walk type process.. The burn-in time corresponds to the initial, transient portion of the time series x t that it takes the Markov process to converge. Often the autocorrelation function of the time sequence is used to diagnose the time series. 7

29 Q(x; x () ) P (x) x () x Q(x; x (2) ) P (x) x (2) Figure 29.. Metropolis Hastings method in one dimension. The proposal distribution Q(x ; x) is here shown as having a shape that changes as x changes, though this is not typical of the proposal densities used in practice. x From MacKay

30 For general, possibly asymmetric forms for the transition matrix, the test ratio is a = P (y)q(x t,y) P (x t )Q(y, x t. It reduces to the previous form when Q is symmetric in its arguments. This form preserves detailed balance of the Markov process (meaning that statistically the same results are gotten under time reversal) that is required in order for the state probability vector to converge to the desired target PDF. A system in thermal equilibrium has as many particles leaving a state as are entering. By analogy, a Markov process that has stationary statistics must also satisfy detailed balance. With the acceptance probability defined above, the Markov chain will satisfy detailed balance. See Gregory, Section 2.3 for a proof. Also the paper by Andrieu et al. on the course web page. 8

31 Machine Learning, 5, 5 43, 23 c 23 Kluwer Academic Publishers. Manufactured in The Netherlands. An Introduction to MCMC for Machine Learning CHRISTOPHE ANDRIEU C.Andrieu@bristol.ac.uk Department of Mathematics, Statistics Group, University of Bristol, University Walk, Bristol BS8 TW, UK NANDO DE FREITAS nando@cs.ubc.ca Department of Computer Science, University of British Columbia, 2366 Main Mall, Vancouver, BC V6T Z4, Canada ARNAUD DOUCET doucet@ee.mu.oz.au Department of Electrical and Electronic Engineering, University of Melbourne, Parkville, Victoria 352, Australia MICHAEL I. JORDAN jordan@cs.berkeley.edu Departments of Computer Science and Statistics, University of California at Berkeley, 387 Soda Hall, Berkeley, CA , USA Abstract. This purpose of this introductory paper is threefold. First, it introduces the Monte Carlo method with emphasis on probabilistic machine learning. Second, it reviews the main building blocks of modern Markov chain Monte Carlo simulation, thereby providing and introduction to the remaining papers of this special issue. Lastly, it discusses new interesting research horizons. Keywords: Markov chain Monte Carlo, MCMC, sampling, stochastic algorithms. Introduction A recent survey places the Metropolis algorithm among the ten algorithms that have had the greatest influence on the development and practice of science and engineering in the 2th century (Beichl & Sullivan, 2). This algorithm is an instance of a large class of sampling algorithms, known as Markov chain Monte Carlo (MCMC). These algorithms have played a significant role in statistics, econometrics, physics and computing science over the last two decades. There are several high-dimensional problems, such as computing the volume of a convex body in d dimensions, for which MCMC simulation is the only known general approach for providing a solution within a reasonable time (polynomial in d) (Dyer, Frieze, & Kannan, 99; Jerrum & Sinclair, 996). While convalescing from an illness in 946, Stan Ulam was playing solitaire. It, then, occurred to him to try to compute the chances that a particular solitaire laid out with 52 cards would come out successfully (Eckhard, 987). After attempting exhaustive combinatorial calculations, he decided to go for the more practical approach of laying out several solitaires at random and then observing and counting the number of successful plays. This idea of selecting a statistical sample to approximate a hard combinatorial problem by a much simpler problem is at the heart of modern Monte Carlo simulation.

32 6 C. ANDRIEU ET AL i=. i=5.5.5 Figure 5. Metropolis-Hastings algorithm i=. i= Figure 6. Target distribution and histogram of the MCMC samples at different iteration points. The MH algorithm is very simple, but it requires careful design of the proposal distribution q(x x). In subsequent sections, we will see that many MCMC algorithms arise by considering specific choices of this distribution. In general, it is possible to use suboptimal inference and learning algorithms to generate data-driven proposal distributions. The transition kernel for the MH algorithm is K MH ( x (i+) x (i)) = q ( x (i+) x (i)) A ( x (i), x (i+)) + δ x (i)( x (i+) ) r ( x (i)),

33 Toy examples of MCMC using Gaussian target and proposal PDFs The target PDF is N(µ, 2 ). For a proposal PDF we use N(µ p, 2 p) that is wide enough so that values are generated that overlap with the target PDF. So use µ p = and p =3 µ /2. In practice, of course, we would not know the parameters of the target PDF (otherwise what would be the point of doing MCMC?) and we might not know its support in parameter space. Experimentation may be required to ensure that the parameter space is adequately sampled. Plots:

34 Plots: Histograms of MC points x t,t=,,n for different N and different µ and. Autocovariance functions of x t the MC time series. ˆµx for single realizations that show the correlation time for Lessons: the more that the target and proposal PDFs differ, the longer it takes for the time series to show stationary statistics that conform to the target PDF. The burn-in time is thus longer in such cases because it is related to the autocorrelation time. Example time series are shown for two of the cases that illustrate the burn-in time and the correlation time.

35 !! Histograms Demonstrate how the distribution of MC points trends to the target PDF! Target PDF = Gaussian with non-zero mean! Proposal PDF = N(, σ 2 ) with σ wide enough to span the target PDF!

36 Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ =.38 =.29 ˆµ =.4 ˆ =.95 Gaussian: µ =.38 =.29 ˆµ =.36 ˆ =.9 5 N=8 4 N=32 Proposal PDF Target PDF Gaussian: µ =.38 =.29 ˆµ =.24 ˆ =.36 N= Gaussian: µ =.38 =.29 ˆµ =.37 ˆ =.22 N= X Gaussian: µ =.38 =.29 ˆµ =.8 ˆ =.34 N= Gaussian: µ =.38 =.29 ˆµ =.3 ˆ =.25 N= Counts 2 MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF

Target µ, σ Counts 4. 3.5 3. 2.5 2..5..5. 4 2 8 6 4 2 2 8 6 4 2 MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ =.38 =.29 ˆµ =.4 ˆ =.95 Gaussian: µ =.38 =.29 ˆµ =.36 ˆ =.

37 Target µ, σ Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ =.38 =.29 ˆµ =.4 ˆ =.95 Gaussian: µ =.38 =.29 ˆµ =.36 ˆ =.9 5 N=8 4 N=32 Proposal PDF Target PDF Gaussian: µ =.38 =.29 ˆµ =.24 ˆ =.36 N= Gaussian: µ =.38 =.29 ˆµ =.37 ˆ =.22 N= X Gaussian: µ =.38 =.29 ˆµ =.8 ˆ =.34 N= Gaussian: µ =.38 =.29 ˆµ =.3 ˆ =.25 N= Counts 2 MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF

38 Target µ, σ Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ =.38 =.29 ˆµ =.4 ˆ =.95 Gaussian: µ =.38 =.29 ˆµ =.36 ˆ =.9 5 N=8 4 N=32 Proposal PDF Target PDF Gaussian: µ =.38 =.29 ˆµ =.24 ˆ =.36 N= Gaussian: µ =.38 =.29 ˆµ =.37 ˆ =.22 N= µ, σ from MC values X Gaussian: µ =.38 =.29 ˆµ =.8 ˆ =.34 N= Gaussian: µ =.38 =.29 ˆµ =.3 ˆ =.25 N= Counts 2 MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF

39 ! Four cases with different target PDFs! Even for target PDFs with large means, we obtain convergence.!

40 Counts Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ =.38 =.29 ˆµ =.4 ˆ = Gaussian: µ =.38 =.29 ˆµ =.24 ˆ = Gaussian: µ =.38 =.29 ˆµ =.37 ˆ = Gaussian: µ = -.99 =.33 ˆµ = -.48 ˆ = Gaussian: µ = -.99 =.33 ˆµ = -.8 ˆ = X Gaussian: µ =.38 =.29 ˆµ =.36 ˆ = Gaussian: µ =.38 =.29 ˆµ =.8 ˆ = Gaussian: µ =.38 =.29 ˆµ =.3 ˆ = Note only 2 states for st 8 MC samples 5 MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ = -.99 =.33 ˆµ = -.95 ˆ = Gaussian: µ = -.99 =.33 ˆµ = -.89 ˆ = Gaussian: µ = -.99 =.33 ˆµ = -. ˆ = X Gaussian: µ = -.99 =.33 ˆµ = -.99 ˆ = Counts Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ =.38 =.67 ˆµ =.8 ˆ = Gaussian: µ =.38 =.67 ˆµ =.9 ˆ = Gaussian: µ =.38 =.67 ˆµ =.36 ˆ = Gaussian: µ = -.99 =.3 ˆµ = -.84 ˆ = Gaussian: µ = -.99 =.3 ˆµ = -.84 ˆ = X Gaussian: µ =.38 =.67 ˆµ =.53 ˆ = Gaussian: µ =.38 =.67 ˆµ =.42 ˆ = Gaussian: µ =.38 =.67 ˆµ =.36 ˆ = MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ = -.99 =.3 ˆµ = -.88 ˆ = Gaussian: µ = -.99 =.3 ˆµ = -.98 ˆ = Gaussian: µ = -.99 =.3 ˆµ = -. ˆ =.5 Gaussian: µ = -.99 =.3 ˆµ = -.99 ˆ = X

41 Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ = -.99 =.5 ˆµ = -.45 ˆ =.77 Narrow target PDF Gaussian: µ = -.99 =.5 ˆµ = -.9 ˆ = Gaussian: µ = -.99 =.5 ˆµ = -. ˆ = X Gaussian: µ = -.99 =.5 ˆµ = -.24 ˆ = Gaussian: µ = -.99 =.5 ˆµ = -.97 ˆ = Gaussian: µ = -.99 =.5 ˆµ = -.99 ˆ =

42 Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ =.38 =.67 ˆµ =.8 ˆ =.8 Broader target PDF Gaussian: µ =.38 =.67 ˆµ =.9 ˆ = Gaussian: µ =.38 =.67 ˆµ =.36 ˆ = X Gaussian: µ =.38 =.67 ˆµ =.53 ˆ = Gaussian: µ =.38 =.67 ˆµ =.42 ˆ = Gaussian: µ =.38 =.67 ˆµ =.36 ˆ =

43 Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ =.38 =.29 ˆµ =.4 ˆ =.95 Even broader target PDF Proposal PDF proportionately broader Gaussian: µ =.38 =.29 ˆµ =.24 ˆ = Gaussian: µ =.38 =.29 ˆµ =.37 ˆ = X Gaussian: µ =.38 =.29 ˆµ =.36 ˆ = Gaussian: µ =.38 =.29 ˆµ =.8 ˆ = Gaussian: µ =.38 =.29 ˆµ =.3 ˆ =

44 Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ = -.99 =.33 ˆµ = -.48 ˆ =.86 Sequence of progressively narrower target PDFs Gaussian: µ = -.99 =.33 ˆµ = -.8 ˆ = Gaussian: µ = -.99 =.33 ˆµ = -.95 ˆ = X Gaussian: µ = -.99 =.33 ˆµ = -.89 ˆ = Gaussian: µ = -.99 =.33 ˆµ = -. ˆ = Gaussian: µ = -.99 =.33 ˆµ = -.99 ˆ =

45 Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ = -.99 =.3 ˆµ = -.84 ˆ = Gaussian: µ = -.99 =.3 ˆµ = -.84 ˆ = Gaussian: µ = -.99 =.3 ˆµ = -. ˆ = Gaussian: µ = -.99 =.3 ˆµ = -.88 ˆ = Gaussian: µ = -.99 =.3 ˆµ = -.98 ˆ = Gaussian: µ = -.99 =.3 ˆµ = -.99 ˆ = X

46 Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ = -.99 =.5 ˆµ = -.45 ˆ = Gaussian: µ = -.99 =.5 ˆµ = -.9 ˆ = Gaussian: µ = -.99 =.5 ˆµ = -. ˆ = X Gaussian: µ = -.99 =.5 ˆµ = -.24 ˆ = Gaussian: µ = -.99 =.5 ˆµ = -.97 ˆ = Gaussian: µ = -.99 =.5 ˆµ = -.99 ˆ =

47 ACFs of MCMC-generated Time Series Width of ACF = correlation time for the time series! Too long a correlation time à inefficient sampling of parameter space! Longer correlation times correspond to proposal PDFs that have larger support relative to the support of the target PDF.!!

48 Time Series of MCMC Samples Case with wide target PDF Gaussian: µ =.38 =.29 ˆµ =.23 ˆ = Time Series Time (steps)

49 Time Series of MCMC Samples Case with narrow target PDF.2 Gaussian: µ = -.99 =.5 ˆµ = -.99 ˆ =.6.2 Gaussian: µ = -.99 =.5 ˆµ = -. ˆ =.7..2 Burn-in time..2 Time Series.4.6 Time Series Time (steps) Time (steps)

50 Gaussian: µ =.38 =.67 ˆµ =.3 ˆ = Full ACF.6 ACV Lag (time steps) Gaussian: µ =.38 =.67 ˆµ =.3 ˆ =.58 ACV Zoom in to inner % (same case, different realization) Lag (time steps)

51 Relatively wide target PDF Gaussian: µ =.38 =.67 ˆµ =.34 ˆ =.67 Gaussian: µ =.38 =.67 ˆµ =.35 ˆ = ACV.4 ACV Lag (time steps) Lag (time steps)

52 Wider target PDF à Narrower ACF Gaussian: µ =.38 =.29 ˆµ =.33 ˆ =.24 Gaussian: µ =.38 =.29 ˆµ =.35 ˆ = ACV.4 ACV Lag (time steps) Lag (time steps)

53 Narrower target PDF à Wider ACF. Gaussian: µ = -.99 =.33 ˆµ = -.98 ˆ = ACV Lag (time steps)

54 Narrower target PDF à Wider ACF. Gaussian: µ = -.99 =.3 ˆµ = -.99 ˆ = ACV Lag (time steps)

55 Narrower target PDF à Wider ACF. Gaussian: µ = -.99 =.5 ˆµ = -.99 ˆ = ACV Lag (time steps)

56 Unsuitable Proposal PDFs

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2015 http://www.astro.cornell.edu/~cordes/a6523 Lecture 23:! Nonlinear least squares!! Notes Modeling2015.pdf on course