Metropolis Algorithm

Size: px
Start display at page:

Download "Metropolis Algorithm"

Transcription

1 //7 A Modeling, Inference, and Mining Jim Cordes, Cornell University Lecture MCMC example Reading: Ch, and in Gregory (from before) Chapter 9 of Mackay (Monte Carlo Methods) hip:// book.pdf An IntroducNon to MCMC for Machine Learning (Andrieu et al., Machine Learning,, hip://link.springer.com/arncle/.%fa %A87 GeneNc Algorithms: Principles of Natural SelecNon Applied to ComputaNon (Stephanie. Forrest) hip://science.sciencemag.org/content/ //87/tab- pdf Webpage: Projects! Abstract Paper PresentaNon: May : pm Metropolis Algorithm a = acceptance probability X t+ -a = rejection probability Current state X t Some other state X t+ Choose a such that The probabilines of reaching different values of X are given by the target PDF The target PDF is reached asymptoncally at a rate that depends on the proposal PDF used to generate trial values of X t+. Detailed balance is achieved (as many transinons out of as into a given state) which also means that the Markov sequence is Nme reversible.

2 //7 Determining the acceptance probability: On previous pages we used the true transition matrix q(x x ) that defines the Markov chain and that has the target PDF as the eigen-pdf. For MCMC problems, we are free to choose any transition matrix we like, but its performance may or may not be suitable for a particular application. As Gregory says, finding an ideal proposal distribution is an art. So let a candidate transition matrix be Q(x x ) that is normalized in the usual way: X Q(x x )=. Generally Q will not satisfy detailed balance for the target PDF: x P (x )Q(x x ) = P (x)q(x x). We fix this by putting in a fudge factor a(x x ): or P (x )Q(x x )a(x x )=P (x)q(x x) a(x x )= P (x)q(x x) P (x )Q(x x ). We don t want the factor to exceed unity, however, so we write apple a(x x )=min, P (x)q(x x) P (x )Q(x x ). MCMC exploits this convergence to the ensemble state probabilities. The simplest form of the algorithm:. Choose a proposal density Q(y, x t ) that will be used to determine the value of x t+. Suppose that this proposal density is symmetric in its arguments.. Generate a value y from the proposal density.. Calculate the test ratio a = P (y) P (x t ). The test ratio is the acceptance probability for the candidate sample y.. Choose a random number u [, ].. If a accept the sample and set x t+ = y.. If a< accept y if u a and set x t+ = y. 7. Otherwise set x t+ = x t (i.e. the new value equals the previous value). 8. Each time step has a value. 9. The sampling steers the time sequence favorably toward regions of higher probability but allows the trajectory to move to regions of low probability.. Samples are correlated as with a random walk type process.. The burn-in time corresponds to the initial, transient portion of the time series x t that it takes the Markov process to converge. Often the autocorrelation function of the time sequence is used to diagnose the time series. 7

3 //7 Q(x; x () ) P (x) x () x Q(x; x () ) P (x) x () Figure 9.. Metropolis Hastings method in one dimension. The proposal distribution Q(x ; x) is here shown as having a shape that changes as x changes, though this is not typical of the proposal densities used in practice. x From MacKay For general, possibly asymmetric forms for the transition matrix, the test ratio is a = P (y)q(x t,y) P (x t )Q(y, x t. It reduces to the previous form when Q is symmetric in its arguments. This form preserves detailed balance of the Markov process (meaning that statistically the same results are gotten under time reversal) that is required in order for the state probability vector to converge to the desired target PDF. A system in thermal equilibrium has as many particles leaving a state as are entering. By analogy, a Markov process that has stationary statistics must also satisfy detailed balance. With the acceptance probability defined above, the Markov chain will satisfy detailed balance. See Gregory, Section. for a proof. Also the paper by Andrieu et al. on the course web page. 8

4 //7 Machine Learning,,, c Kluwer Academic Publishers. Manufactured in The Netherlands. An Introduction to MCMC for Machine Learning CHRISTOPHE ANDRIEU C.Andrieu@bristol.ac.uk Department of Mathematics, Statistics Group, University of Bristol, University Walk, Bristol BS8 TW, UK NANDO DE FREITAS nando@cs.ubc.ca Department of Computer Science, University of British Columbia, Main Mall, Vancouver, BC VT Z, Canada ARNAUD DOUCET doucet@ee.mu.oz.au Department of Electrical and Electronic Engineering, University of Melbourne, Parkville, Victoria, Australia MICHAEL I. JORDAN jordan@cs.berkeley.edu Departments of Computer Science and Statistics, University of California at Berkeley, 87 Soda Hall, Berkeley, CA 97-77, USA Abstract. This purpose of this introductory paper is threefold. First, it introduces the Monte Carlo method with emphasis on probabilistic machine learning. Second, it reviews the main building blocks of modern Markov chain Monte Carlo simulation, thereby providing and introduction to the remaining papers of this special issue. Lastly, it discusses new interesting research horizons. Keywords: Markov chain Monte Carlo, MCMC, sampling, stochastic algorithms. Introduction A recent survey places the Metropolis algorithm among the ten algorithms that have had the greatest influence on the development and practice of science and engineering in the th century (Beichl & Sullivan, ). This algorithm is an instance of a large class of sampling algorithms, known as Markov chain Monte Carlo (MCMC). These algorithms have played a significant role in statistics, econometrics, physics and computing science over the last two decades. There are several high-dimensional problems, such as computing the volume of a convex body in d dimensions, for which MCMC simulation is the only known general approach for providing a solution within a reasonable time (polynomial in d) (Dyer, Frieze, & Kannan, 99; Jerrum & Sinclair, 99). While convalescing from an illness in 9, Stan Ulam was playing solitaire. It, then, occurred to him to try to compute the chances that a particular solitaire laid out with cards would come out successfully (Eckhard, 987). After attempting exhaustive combinatorial calculations, he decided to go for the more practical approach of laying out several solitaires at random and then observing and counting the number of successful plays. This idea of selecting a statistical sample to approximate a hard combinatorial problem by a much simpler problem is at the heart of modern Monte Carlo simulation. C. ANDRIEU ET AL.... i=. i=.. Figure. Metropolis-Hastings algorithm.... i=. i=.. Figure. Target distribution and histogram of the MCMC samples at different iteration points. The MH algorithm is very simple, but it requires careful design of the proposal distribution q(x x). In subsequent sections, we will see that many MCMC algorithms arise by considering specific choices of this distribution. In general, it is possible to use suboptimal inference and learning algorithms to generate data-driven proposal distributions. The transition kernel for the MH algorithm is K MH ( x (i+) x (i)) = q ( x (i+) x (i)) A ( x (i), x (i+)) + δ x (i)( x (i+) ) r ( x (i)),

5 //7 Toy examples of MCMC using Gaussian target and proposal PDFs The target PDF is N(µ, ). For a proposal PDF we use N(µ p, p) that is wide enough so that values are generated that overlap with the target PDF. So use µ p = and p = µ + /. In practice, of course, we would not know the parameters of the target PDF (otherwise what would be the point of doing MCMC?) and we might not know its support in parameter space. Experimentation may be required to ensure that the parameter space is adequately sampled. Plots: Plots: Histograms of MC points x t,t=,,n for different N and different µ and. Autocovariance functions of x t the MC time series. ˆµx for single realizations that show the correlation time for Lessons: the more that the target and proposal PDFs differ, the longer it takes for the time series to show stationary statistics that conform to the target PDF. The burn-in time is thus longer in such cases because it is related to the autocorrelation time. Example time series are shown for two of the cases that illustrate the burn-in time and the correlation time.

6 //7 Histograms Demonstrate how the distribunon of MC points trends to the target PDF Target PDF = Gaussian with non- zero mean Proposal PDF = N(, σ ) with σ wide enough to span the target PDF Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ =.8 =.9 ˆµ =. ˆ =.9 Gaussian: µ =.8 =.9 ˆµ =. ˆ =.9 N=8 N= Target PDF Proposal PDF Gaussian: µ =.8 =.9 ˆµ =. ˆ =. N=8 8 8 Gaussian: µ =.8 =.9 ˆµ =.7 ˆ =. Gaussian: µ =.8 =.9 ˆµ =.8 ˆ =. N= 8 8 Gaussian: µ =.8 =.9 ˆµ =. ˆ =. N=8 N= X Counts MCMC of offset Gaussian tar Gaussian: µ =.8 =.7 ˆµ =.8 Gaussian: µ =.8 =.7 ˆµ =.9 Gaussian: µ =.8 =.7 ˆµ =. MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ = -.99 =. ˆµ = -.8 ˆ =.8 Gaussian: µ = -.99 =. ˆµ = -.89 ˆ = Gaussian: µ = -.99 =. ˆµ = -.8 ˆ =. 9 Gaussian: µ = -.99 =. ˆµ = -. ˆ = MCMC of offset Gaussian tar Gaussian: µ = -.99 =. ˆµ = Gaussian: µ = -.99 =. ˆµ = -.8

7 //7 Target μ, σ Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ =.8 =.9 ˆµ =. ˆ =.9 Gaussian: µ =.8 =.9 ˆµ =. ˆ =.9 N=8 N= Target PDF Proposal PDF Gaussian: µ =.8 =.9 ˆµ =. ˆ =. N=8 8 8 Gaussian: µ =.8 =.9 ˆµ =.7 ˆ =. Gaussian: µ =.8 =.9 ˆµ =.8 ˆ =. N= 8 8 Gaussian: µ =.8 =.9 ˆµ =. ˆ =. N=8 N= X Counts MCMC of offset Gaussian tar Gaussian: µ =.8 =.7 ˆµ =.8 Gaussian: µ =.8 =.7 ˆµ =.9 Gaussian: µ =.8 =.7 ˆµ =. Gaussian: µ = -.99 =. ˆµ = -.8 ˆ =.8 Gaussian: µ = -.99 =. ˆµ = -.89 ˆ = Target μ, σ μ, σ from MC values MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ =.8 =.9 ˆµ =. ˆ =.9 Gaussian: µ =.8 =.9 ˆµ =. ˆ =.9.. N=8 Gaussian: µ = -.99 =. ˆµ = -.8 ˆ =. Gaussian: µ = -.99 =. ˆµ = -. ˆ =.9... N= Proposal PDF Target PDF Gaussian: µ =.8 =.9 ˆµ =. ˆ =. Gaussian: µ =.8 =.9 ˆµ =.8 ˆ =. N=8 N= Gaussian: µ = -.99 =. ˆµ = -.9 ˆ =. Gaussian: µ = -.99 =. ˆµ = -.99 ˆ = Gaussian: µ =.8 =.9 ˆµ =.7 ˆ =. Gaussian: µ =.8 =.9 ˆµ =. ˆ =. N=8 X N=89 Counts Counts 8 MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF 8 8 X 8 8 Counts Counts MCMC of offset Gaussian tar Gaussian: µ = -.99 =. ˆµ = MCMC of offset Gaussian tar. Gaussian: µ =.8 =.7 ˆµ = Gaussian: µ = -.99 =. ˆµ = Gaussian: µ =.8 =.7 ˆµ = Gaussian: µ = -.99 =. ˆµ = -. Gaussian: µ =.8 =.7 ˆµ = MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ = -.99 =. ˆµ = -.8 ˆ =.8 Gaussian: µ = -.99 =. ˆµ = -.89 ˆ = Gaussian: µ = -.99 =. ˆµ = -.8 ˆ =. 9 Gaussian: µ = -.99 =. ˆµ = -. ˆ = MCMC of offset Gaussian tar Gaussian: µ = -.99 =. ˆµ = Gaussian: µ = -.99 =. ˆµ = -.8

8 //7 Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ =.8 =.9 ˆµ =. ˆ =.9 Even broader target PDF Proposal PDF propornonately broader 8 8 Gaussian: µ =.8 =.9 ˆµ =. ˆ =. 8 8 Gaussian: µ =.8 =.9 ˆµ =.7 ˆ =. 8 8 X Gaussian: µ =.8 =.9 ˆµ =. ˆ = Gaussian: µ =.8 =.9 ˆµ =.8 ˆ =. 8 8 Gaussian: µ =.8 =.9 ˆµ =. ˆ =. 8 8 Four cases with different target PDFs Even for target PDFs with large means, we obtain convergence. 8

9 //7 Counts Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ =.8 =.9 ˆµ =. ˆ = Gaussian: µ =.8 =.9 ˆµ =. ˆ = Gaussian: µ =.8 =.9 ˆµ =. ˆ = Gaussian: µ =.8 =.9 ˆµ =.8 ˆ =. 8 8 Gaussian: µ =.8 =.9 ˆµ =.7 ˆ =. Gaussian: µ =.8 =.9 ˆµ =. ˆ = X Note only states for st 8 MC samples MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ = -.99 =. ˆµ = -.8 ˆ =.8 Gaussian: µ = -.99 =. ˆµ = -.8 ˆ =. Gaussian: µ = -.99 =. ˆµ = -.89 ˆ = Gaussian: µ = -.99 =. ˆµ = -. ˆ = Gaussian: µ = =. ˆµ = -.9 ˆ =. Gaussian: µ = -.99 =. ˆµ = -.99 ˆ =. X Counts Counts Gaussian: µ =.8 =.7 ˆµ =.8 ˆ = MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF. Gaussian: µ =.8 =.7 ˆµ =.9 ˆ =.7 Gaussian: µ =.8 =.7 ˆµ =. ˆ =.7 7 Gaussian: µ =.8 =.7 ˆµ =. ˆ =. 7 Gaussian: µ =.8 7 =.7 ˆµ =. ˆ =.7 Gaussian: µ =.8 =.7 ˆµ =. ˆ =. 8 X MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ = -.99 =. ˆµ = -.8 ˆ = Gaussian: µ = -.99 =. ˆµ = -.8 ˆ = Gaussian: µ = -.99 =. ˆµ = -. ˆ = Gaussian: µ = -.99 =. ˆµ = -.88 ˆ = Gaussian: µ = -.99 =. ˆµ = -.98 ˆ = Gaussian: µ = -.99 =. ˆµ = -.99 ˆ = X Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ = -.99 =. ˆµ = -. ˆ =.77 Narrow target PDF Gaussian: µ = -.99 =. ˆµ = -. ˆ = Gaussian: µ = -.99 =. ˆµ = -.9 ˆ =. Gaussian: µ = -.99 =. ˆµ = -.97 ˆ = Gaussian: µ = -.99 =. ˆµ = -. ˆ =. Gaussian: µ = -.99 =. ˆµ = -.99 ˆ = X 9

10 //7 Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ =.8 =.7 ˆµ =.8 ˆ =.8 Broader target PDF Gaussian: µ =.8 =.7 ˆµ =.9 ˆ =.7 Gaussian: µ =.8 =.7 ˆµ =. ˆ = X Gaussian: µ =.8 =.7 ˆµ =. ˆ =.7 Gaussian: µ =.8 =.7 ˆµ =. ˆ =. Gaussian: µ =.8 =.7 ˆµ =. ˆ =. Counts Gaussian: µ = -.99 =. ˆµ = -.8 ˆ =.8 Gaussian: µ = -.99 =. ˆµ = -.8 ˆ =. MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Sequence of progressively narrower target PDFs Gaussian: µ = -.99 =. ˆµ = -.9 ˆ =. Gaussian: µ = -.99 =. ˆµ = -.89 ˆ = Gaussian: µ = -.99 =. ˆµ = -. ˆ = X Gaussian: µ = -.99 =. ˆµ = -.99 ˆ =.

11 //7 Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ = -.99 =. ˆµ = -.8 ˆ = Gaussian: µ = -.99 =. ˆµ = -.8 ˆ = Gaussian: µ = -.99 =. ˆµ = -.88 ˆ = Gaussian: µ = -.99 =. ˆµ = -.98 ˆ = Gaussian: µ = -.99 =. ˆµ = -. ˆ =. Gaussian: µ = -.99 =. ˆµ = -.99 ˆ = X Counts MCMC of offset Gaussian target PDF using zero-mean Gaussian proposal PDF Gaussian: µ = -.99 =. ˆµ = -. ˆ = Gaussian: µ = -.99 =. ˆµ = -. ˆ = Gaussian: µ = -.99 =. ˆµ = -.9 ˆ =. Gaussian: µ = -.99 =. ˆµ = -.97 ˆ = Gaussian: µ = -.99 =. ˆµ = -. ˆ =. Gaussian: µ = -.99 =. ˆµ = -.99 ˆ = X

12 //7 ACFs of MCMC-generated Time Series Width of ACF = correlanon Nme for the Nme series Too long a correlanon Nme à inefficient sampling of parameter space Longer correlanon Nmes correspond to proposal PDFs that have larger support relanve to the support of the target PDF. Time Series of MCMC Samples Case with wide target PDF Gaussian: µ =.8 =.9 ˆµ =. ˆ =. Time Series 8 Time (steps)

13 //7 Time Series of MCMC Samples Case with narrow target PDF. Gaussian: µ = -.99 =. ˆµ = -.99 ˆ =.. Gaussian: µ = -.99 =. ˆµ = -. ˆ =.7.. Burn- in Nme.. Time Series.. Time Series Time (steps) 8 Time (steps) Gaussian: µ =.8 =.7 ˆµ =. ˆ =...8 Full ACF. ACV.... Lag (time steps) Gaussian: µ =.8 =.7 ˆµ =. ˆ =.8 ACV Zoom in to inner % (same case, different realizanon). Lag (time steps)

14 //7 Relatively wide target PDF Gaussian: µ =.8 =.7 ˆµ =. ˆ =.7 Gaussian: µ =.8 =.7 ˆµ =. ˆ = ACV. ACV Lag (time steps). Lag (time steps) Wider target PDF à Narrower ACF Gaussian: µ =.8 =.9 ˆµ =. ˆ =. Gaussian: µ =.8 =.9 ˆµ =. ˆ = ACV. ACV Lag (time steps). Lag (time steps)

15 //7. Narrower target PDF à Wider ACF Gaussian: µ = -.99 =. ˆµ = -.98 ˆ =..8. ACV.... Lag (time steps) Narrower target PDF à Wider ACF. Gaussian: µ = -.99 =. ˆµ = -.99 ˆ =..8. ACV.... Lag (time steps)

16 //7. Narrower target PDF à Wider ACF Gaussian: µ = -.99 =. ˆµ = -.99 ˆ =..8. ACV.... Lag (time steps) Unsuitable Proposal PDFs

17 //7 7

18 //7 8

19 //7 9

20 //7

21 //7

22 //7

23 //7 Gibbs Sampling Gibbs sampling in MCMC is a simplified way to explore an N- dimensional space. It proceeds by MC- ing points along each axis sequennally, so is similar to the D examples in class and in MacKay Chapter 9. Stat Comput () :9 9 DOI.7/s A Markov Chain Monte Carlo version of the genetic algorithm Differential Evolution: easy Bayesian computing for real parameter spaces Cajo J. F. Ter Braak Abstract Differential Evolution (DE) is a simple genetic algorithm for numerical optimization in real parameter spaces. In a statistical context one would not just want the optimum but also its uncertainty. The uncertainty distribution can be obtained by a Bayesian analysis (after specifying prior and likelihood) using Markov Chain Monte Carlo (MCMC) simulation. This paper integrates the essential ideas of DE and MCMC, resulting in Differential Evolution Markov Chain (DE-MC). DE-MC is a population MCMC algorithm, in which multiple chains are run in parallel. DE-MC solves an important problem in MCMC, namely that of choosing an appropriate scale and orientation for the jumping distribution. In DE-MC the jumps are simply a fixed multiple of the differences of two random parameter vectors that are currently in the population. The selection process of DE-MC works via the usual Metropolis ratio which defines the probability with which a proposal is accepted. In tests with known uncertainty distributions, the efficiency of DE-MC with respect to random walk Metropolis with optimal multivariate Normal jumps ranged from 8% for small population sizes to % for large population sizes and even to % for the 97.% point of a variable from a -dimensional Student distribution. Two Bayesian examples illustrate the potential of DE-MC in practice. DE-MC is shown to facilitate multidimensional updates in a multi-chain Metropolis-within- Gibbs sampling approach. The advantage of DE-MC over conventional MCMC are simplicity, speed of calculation and convergence, even for nearly collinear parameters and multimodal densities.

24 //7 Bayesian Spectral Estimation 7 A Parametric Bayesian Approach to Spectral Analysis Consider a stochastic process whose (ensemble average) power spectrum is S(f). From a data vector x =col(x(t ),,x(t n )) we want to estimate the spectrum. The covariance matrix C for x has elements equal to appropriate values from the autocorrelation function R( ) which can be calculated from the Fourier transform of the power spectrum (Wiener-Khinchin theorem). C = hxx i = {C ij } = {R(t i t j )} The spectrum has parameters : S(f) =S(f; ) so the covariance matrix is also a function of the parameters, A quadratic form for the data (similar to a C = C( ). cost function) is Q = x C ( )x. If x(t) is a Gaussian process, then the likelihood function is apple L ( ) = ( ) n/ (det C( )) exp x C ( )x If the prior PDF for the parameters is f ( ), the posterior PDF for is f ( )L ( ) f ( x) = Z d f ( )L ( ) 8

25 //7 Examples White noise: The covariance matrix is diagonal and if all elements are equal ( homoscedastic ), the only parameter is and the ACF is R( ) =,. So the spectral estimate is simply bs(f) = b where b is the posterior estimate of the variance and B is the bandwidth. Note that it is consistent to use a bandlimited process for white noise that is a discrete process. For a continuous process, the Dirac delta function that would apply is formally inconsistent with a finite bandwidth. Real world cases do not have this issue! B 9 Power law power spectrum: This case is tricky because a spectral cutoff is required to avoid a divergence in the total variance but a finite data set may not sample the entire extent of the spectrum. Let the spectrum have the form S(f) =S f, f f L where the lower-frequency cutoff f L is required for. We will ignore shallower power laws with <. They require an upper frequency cutoff to keep the total variance finite. The covariance matrix has elements (van Haasteren and Levin, MNRAS, 8, 7) " # X C ij = S fl ( )sin( /) ( f L ij ) ( ) n ( f L ij ) n (n)!(n + ) In general, the model parameters are =col(s,f L, ). For finite data sets, terms in the infinite sum can be truncated for n> if f L max (van Haasteren and Levin ). In some applications for pulsars where a quadratic polynomial is removed from a data of length T, the dependence of C ij on the cutoff f L is removed in some cases. If there is no dependence on f L, the model parameters are =col(s, ). When the time series duration T satisfies T fl we expect the covariance matrix to be independent of f L because the lowest-frequency sinusoids contained in the spectrum are effectively constant over the interval [,T]. n=

26 //7 Steep power law spectra yield time series with sample variances that vary radically from realization to realization. For =, for example, values of the variance are spread over two orders of magnitude or more (Shannon and Cordes, ApJ, 7, 7). This implies that the spectral estimate from an individual realization may give a significantly biased result for at least the amplitude of the spectrum b S and perhaps for the spectral index ˆ. Numerical experiments indicate that the Bayesian estimates from simulated spectra recover the input parameters reasonably well. Bayesian Inference on a Chirped Sinusoid

27 //7 Chirped Signals A Bayesian Approach to Spectral Analysis Chirped signals are oscillating signals with time variable frequencies, usually with a linear variation of frequency with time. E.g. f(t) =A cos( t + t + ). Examples: plasma wave diagnostic signals Signals propagated through dispersive media (seismic cases, plasmas) Gravitational waves from inspiraling binary stars Doppler-shifted signals over fractions of an orbit (e.g. acceleration of pulsar in its orbit) Jaynes Approach to Spectral Analysis: cf. Jaynes Bayesian Spectrum and Chirp Analysis in Maximum Entropy and Bayesian Spectral Analysis and Estimation Problems Cited by Bretthorst in Bayesian Spectrum Analysis and Parameter Estimation, (987) Briefly by Gregory in Chapter of Bayesian Logical Data Analysis for the Physical Sciences Result: Optimal processing is a nonlinear operation on the data without recourse to smoothing. However, the DFT-based spectrum (the periodogram ) plays a key role in the estimation. Fresnel funcnon c(t) =e i t What is the FT of c(t)? 7

28 //7 Start with Bayes theorem p(h/di) posterior prob. = p(h/i) prior prob. new data p(d/hi) p(d/i) In this context, probabilities represent a simple mapping of degrees of belief onto real numbers. Recall p(d/hi) vs.d for fixed H = sampling distribution p(d/hi) vs.h for fixed D = likelihood function Read H as a statement that a parameter vector lies in a region of parameter space. Data model: y(t) = f(t)+e(t) f(t) = A cos( t + t + ) with = and = for the data e(t) = whitegaussian noise, e =, e = Data Set: D = {y(t), t T }, N =T +datapoints. 8

29 //7 Data Probability: The probability of obtaining a data set of N samples is P (D HI)= t P [y(t)] = T t= T ( ) / e [y(t) f(t)], () which we can rewrite as a likelihood function once we acquire a data set and evaluate the probability for a specific H. Writing out the parameters explicitly, the likelihood function is L(A,,, ) e T t= T [y(t) A cos( t + t + )] For simplicity, assume that T so that many cycles of oscillation are summed over. Then t cos ( t + t + ) = t [ + cos ( t + t + )] T + N Expand the argument of the exponential in the likelihood function, we have y(t) A cos( t + t + ) = y (t)+a cos ( t + t + ) Ay(t)cos( t + t + ) We care only about terms that are functions of the parameters, so we drop the y (t) term to get T t= T [y(t) A cos( t + t + )] [A cos ( t + t + ) Ay(t)cos( t + t + t The likelihood function becomes L(A,, Integrating out the phase:, ) e A t A y(t)cos( t + t + ) t y(t)cos( t + t + ) NA NA In calculating a power spectrum [in this case, a chirped power spectrum ( chirpogram )], we do not care about the phase of any sinusoid in the data. In Bayesian estimation, such a parameter is called a nuisance parameter. Since we do not know anything about, we integrate over its prior distribution, a pdf that is 9

30 //7 uniform over [, ]: f ( ) = otherwise. The marginalized likelihood function becomes Using the identity we have t L(A,, ) d L(A,,, ) = A d exp y(t)cos( t + t NA + ) t = exp NA d exp A y(t)cos( t + t + ) t y(t)cos( t + cos( t + t + ) =cos( t + t )cos sin( t + t )sin t + ) = cos t y(t)cos( t + t ) apple P t y(t)sin( t + t ) apple Q P cos Q sin = P + Q cos[ +tan (Q/P )]. This result may be used to evaluate the integral over in the marginalized likelihood function: A d exp t y(t)cos( t + t + ) To evaluate the integral we use the identity, This yields I (x) A d exp t = d e x cos =modifiedbesselfunction A P + Q cos[ + tan (Q/P )] d e irrelevant phase shift y(t)cos( t + t A + ) = I P + Q We now simplify P + Q : P + Q = t y(t)cos( t + t ) + y t sin( t + t ) t P + Q = t = t y(t)y(t )[cos( t + t ) cos( t + t )+sin( t + t ) sin( t + t )] t cos[ (t t )+ (t t )] t y(t)y(t )cos[ (t t )+ (t t )]. 7

31 //7 Define C(, ) N (P + Q )=N t t y(t)y(t )cos[ (t t )+ (t t ) ], Then the integral over gives A NC(, ) d L(A,,, ) I and the marginalized likelihood is L(A,, )=e NA A NC(, ) I. 8 Notes: () The data appear only in C(, ). () C is a sufficient statistic, meaning that it contains all information from the data that are relevant to inference using the likelihood function. () How do we read L(A,, )? As the probability distribution of the parameters A,, in terms of the data dependent quantity C(, ). (Note that L is not normalized as a PDF). As such, L is a quite different quantity from the Fourier-based power spectrum. () What is the quantity C(, ) N t t y(t)y(t )cos[ (t t )+ (t t ) ]? For a given data set,, are variables. If we plot C(, ), we expect to get a large value when = signal, = signal. () For a non-chirped but oscillatory signal ( =, =), the quantity C(, ) is nothing other than the periodogram (the squared magnitude of the Fourier transform of the data). We then see that, for this case, the likelihood function is a nonlinear function of the Fourier estimate for the power spectrum. 9

32 //7 Interpretation of the Bayesian and Fourier Approaches We found the marginalized likelihood for the frequency and chirp rate to be L(A,, )=e NA I A NC(, ). and the limiting form for the Bessel function s argument x is I (x) ex x. In this case the marginalized likelihood is L(A,, ) e NA I A NC(, ) A NC(, ) e NA e A NC(, )/ /. Since C(, ) is large when and match those of any true signal, we see that it is exponentiated as compared to appearing linearly in the periodogram. Now let s consider the case with no chirp rate, =. Examples in the literature show that the width of the Bayesian PDF is much narrower than the periodogram, C(, ). Does this mean that the uncertainty principle has been avoided? The answer is no! Uncertainty Principle in the Periodogram: For a data set of length T, the frequency resolution implied by the spectral window function is Width of the Bayesian PDF: f When the argument of the Bessel function is large the exponentiation causes the PDF to be much narrower than the spectral window for the periodogram. T.

33 //7 Interpretation: The periodogram is the distribution of power (or variance) with frequency for the particular realization of data used to form the periodogram. The spectral window also depicts the distribution of variance for a pure sinusoid in the data (with infinite signal to noise ratio). The Bayesian posterior is the PDF for the frequency of a sinusoid and therefore represents a very different quantity than the periodogram and are thus not directly comparable.. The Bayesian method addresses the question, what is the PDF for the frequency of the sinusoid that is in the data.?. The periodogram is the distribution of variance in frequency.. If we use the periodogram to estimate the sinusoid s frequency, we get a result that is more comparable: (a) First note that the width of the posterior PDF involves the signal to noise ratio (in the square root of the periodogram) NA/ while the width of the periodogram s spectral window is independent of the SNR. (b) General result: if a spectral line has width, its centroid can be determined to an accuracy SNR. This result follows from matched filtering, which we will discuss later on. (c) Quantitatively, the periodogram yields the same information about the location of the spectral line as does the posterior PDF.. Problem: derive an estimate for the width of the posterior PDF that can be compared with the estimate for the periodogram.

34 //7 Comparison of Spectral Line Localization Properties Claim: While the periodogram gives a spectral line that is much broader than the width of the posterior PDF for frequency, the ability to localize the spectral line in frequency is the same for both approaches. Periodogram: The signal-to-noise ratio (S/N) of the line is NA/ (as in DFT of complex exponential). The spectral resolution is res /(T +)(since our time interval is [ T,T]. The width of the line (e.g. FWHM) is of order the spectral resolution. Assume S/N is large. Posterior PDF: The PDF for is dominated by the exponential factor E( ) =exp{a NC(, )/ } From the expression for C we have C max = C( = )=NA / so E max = e A NC max/ = e (N/)(A/ ) For offset frequencies = + we can expand various things to show that E( + ) E max e (N/)(A/ ) [ (T +)/ ] This function has a width when the exponential =/ or = In terms of resolution units this is res = = NA N(T +) NA S/N oflineinperiodogram Figure : Left: Time series of sinusoid + white noise with A/ =sampled N =times over an interval of length T =. Right: Plot of the periodogram (red) and Bayesian PDF of the time series.

35 //7 Figure : Left: Time series of sinusoid + white noise with A/ =/ sampled N =times over an interval of length T =. Right: Plot of the periodogram (red) and Bayesian PDF of the time series.

36 //7 Prewhitening and Sinusoid Detection Prewhitening What is Prewhitening? Prewhitening is an operation that processes a time series (or some other data sequence) to make it behave statistically like white noise. The pre means that whitening precedes some other analysis that likely works better if the additive noise is white. These operations can be viewed in either the time domain or the frequency domain:. Make the ACF of the time series appear more like a delta function.. Make the spectrum appear flat. Example data sets that may require prewhitening:. A well behaved noise process with an additive low frequency (or polynomial) trend added to it.. A deterministic signal with an additive red-noise process. Viewed in the frequency domain, prewhitening means that the dynamic range of the measured data is reduced.

37 //7 Why bother? Recall from our discussions of spectral analysis the issues of leakage and bias. These arise from sidelobes inherent to spectral estimation. We can minimize leakage in two ways: () make sidelobes smaller and () minimize the power that is prone to leaking into sidelobes. Spectral windows address the former while prewhitening mitigates the latter. Leakage into sidelobes also constitutes bias in spectral estimates. However bias appears in other data analysis procedures. Consider least-squares fitting of a sinusoid to a signal of the form x(t) =A cos( t + )+r(t)+n(t), where n(t) is WSS white noise and r(t) is red noise with a steep power spectrum. Red noise can strongly bias fitting of a model ˆx(t) =Â cos(ˆ + ˆ) because its power can leak across the underlying spectrum causing a least-square fit to give highly discrepant values of Â, ˆ, and ˆ. Prewhitening of the time series ideally would yield a transformed time series of the form x (t) =A cos( t + to which fitting a sinusoidal model will be less biased. )+n (t) Procedures: We have already seen one analysis that is related to prewhitening: the matched filter (MF). The MF doesn t whiten the spectrum of the output but it does weights the frequency components of the measured quantity to maximize the S/N of the signal. The signal model in this case is x(t) =a A(t) +n(t). Recall for an arbitrary spectrum S n (f) for additive noise that the frequency-domain MF for a signal A(t) is h(f) Ã(f) S n (f). Taking equality for simplicity, when the filter is applied to the measurements x(t), we have ỹ(f) = x(f) h (f) a Ã(f) S n (f) This means that the ensemble-average spectrum of the filter output is + ñ(f)ã (f). S n (f) ỹ(f) = = a Ã(f) Sn(f) ñ(f) Ã(f) + Sn(f) a Ã(f) + Sn(f) Ã(f) Sn(f) Sn(f) = a Ã(f) Sn(f) = Ã(f) S n (f) + Ã(f) S n (f) a Ã(f) + S n (f) 7

38 //7 Signals with trends: A common situation is where a quantity of the form a A(t)+n(t) is superposed with a strong trend, such as a baseline variation. Similar issues arise in measurements of spectra. Consequences of trends include:. Bias in estimating parameters of A(t t ) or its spectral analog A( ).. Erroneous estimates of cross correlations between two time series such as x(t) =s (t)+n (t) and y(t) =s (t)+n (t), where s, are signals of interest and n, are measurement errors. I.e. we may be interested in the correlation C = s (t)s (t) or C = [s (t) s ][s (t) s ] N t N t where s, =(/N t ) t t s,(t) are the sample means. If there are trends p, (t) added to x(t) and y(t) the correlation Ĉ of x and y used to estimate C may be dominated completely by the trends and not the signal parts of the measurements. A fix: Trends can often be modeled as a polynomial of some order that can be fitted to the measurements. The order of the polynomial needs to be chosen wisely. For a pulse or spectral line confined to some range of t or this is straight forward. But for a detection problem where the signal location is not known, the situation is very tricky. t Prewhitening filter: Consider again x(t) =a A(t)+n(t) and let s trivially construct a frequencydomain filter that whitens the measurements. We want a filter h(t) that flattens the noise n(t) in the frequency domain. Let y(t) =x(t) h(t) where means convolution. All we need is h(f) = / S n (f). Then the ensemble spectrum of the output ỹ(f) is ỹ(f) = x(f) h(f) = x(f) S n (f) = a Ã(f) S n (f) Note how this differs from the result for a matched filter. But the result is that in the mean the spectrum of the additive noise has been flattened. Prewhitening is important in both detection and estimation applications. + 8

39 //7 Leakage and Bias Prewhitening in the least-squares estimation context: Consider our standard linear model y = X + n, which has a least-squares solution for the parameter vector = X C n X X C n y, where the covariance matrix of the noise vector n is C n = nn. This is also the maximum likelihood solution in the right circumstances (which are?). As with any covariance matrix, C n is Hermitian and positive, semi-definite. This means that the quadratic form for an arbitrary vector z satisfies z C n z. Such matrices can always be factored according to the Cholesky decomposition: C n = LL where L is a lower-diagonal matrix; e.g. a b c L = d e f. g h i j See hrp://en.wikipedia.org/wiki/cholesky_decomposinon 9

40 //7 Utility: we can transform the model as follows using L: Substituting into the solution vector for yields y = Ly w X = LX w. and using y =(Ly w ) = y wl, X =(LX w ) = X wl, and C n =(LL ) = L L = X C n X X C n y = (X w L Cn L X w ) X w L Cn L y I I = X wx w X w y. So what? The solution is identical to the least-squares case where the noise covariance matrix is diagonal; i.e. the noise vector n w = L n has been transformed to white noise. We have whitened the data. When is this useful? An example is the fitting of a sinusoidal function amid red noise where leakage effects are important just as they are for spectral analysis. A specific example is the fitting of astrometric parameters or periodicities in radial velocity data. What s the catch? You need to know the covariance matrix of the noise n to do the Cholesky decomposition. This can be easier said than done! 7 Examples of sine wave + red and white noise Examples were generated with a signal y(t) =cos( t/p + )+r(t)/snr r + w(t)/snr w where r, w have unit variance and are scaled by the signal to noise ratios snr r and snr w, respectively. The covariance matrix for the combined noise n = r + w was calculated by averaging C n = nn over realizations. Note that for some real situations where we have only a single time series, we would need to calculate C n differently, e.g. from first principles, prior knowledge, etc. In practice, realizations of r were generated and the mean subtracted. Then white noise was added to form n and then the Cholesky decomposition was done using the command L = scipy.linalg.cholesky(c n, lower=true) For data vectors of length N, the lower-diagonal matrix L is N N. If the mean had been subtracted from the white noise as well, the rank of the covariance matrix would be N and the decomposition would fail. Results in the following figures indicate that. Power-law red noise with spectral indices s i < do not benefit particularly from whitening because leakage is much less. 8

41 //7. What matters is the signal to noise ratio of the cosine to the signal contained in one resolution bandwidth f T centered on the frequency of the sinusoid. For a steep power law, only a small fraction of the total power in the red noise is in this band whereas the flatter the spectrum, the larger this fraction is. 9 Signal + Noise Noise only Cholesky whitening: N = Sine+RN+WN Si =. S/Nr =. S/Nw =. Time Series Spectra Time (bins) Frequency (bins) Figure : Example of whitening using the Cholesky decomposition. The signal consists of a sine wave with period of. time bins with additive red and white noise. Signal-to-noise ratios of the signal relative to each kind of noise are given. Top left: original time series (red) and whitened time series (black). Bottom left: original noise (red) and whitened noise (black). Top right: power spectra of the original and whitened time series. Bottom right: power spectra of original and whitened noise sequences.

42 //7 Cholesky whitening: N = Sine+RN+WN Si =. S/Nr =. S/Nw =. Time Series Spectra Signal + Noise Noise only Time (bins) Frequency (bins) Figure : Example of whitening using the Cholesky decomposition. The signal consists of a sine wave with period of. time bins with additive red and white noise. Signal-to-noise ratios of the signal relative to each kind of noise are given. Top left: original time series (red) and whitened time series (black). Bottom left: original noise (red) and whitened noise (black). Top right: power spectra of the original and whitened time series. Bottom right: power spectra of original and whitened noise sequences. Signal + Noise Noise only Cholesky whitening: N = Sine+RN+WN S i =. S/N r =. S/N w =. Time Series Spectra 8 8 Time (bins) Frequency (bins) Figure : Example of whitening using the Cholesky decomposition. The signal consists of a sine wave with period of. time bins with additive red and white noise. Signal-to-noise ratios of the signal relative to each kind of noise are given. Top left: original time series (red) and whitened time series (black). Bottom left: original noise (red) and whitened noise (black). Top right: power spectra of the original and whitened time series. Bottom right: power spectra of original and whitened noise sequences.

43 //7 Signal + Noise Noise only Cholesky whitening: N = Sine+RN+WN S i =. S/N r =. S/N w =. Time Series Spectra Time (bins) Frequency (bins) Figure : Example of whitening using the Cholesky decomposition. The signal consists of a sine wave with period of. time bins with additive red and white noise. Signal-to-noise ratios of the signal relative to each kind of noise are given. Top left: original time series (red) and whitened time series (black). Bottom left: original noise (red) and whitened noise (black). Top right: power spectra of the original and whitened time series. Bottom right: power spectra of original and whitened noise sequences. Signal + Noise Noise only Cholesky whitening: N = Sine+RN+WN S i =. S/N r =. S/N w =. Time Series Spectra Time (bins) Frequency (bins) Figure : Example of whitening using the Cholesky decomposition. The signal consists of a sine wave with period of. time bins with additive red and white noise. Signal-to-noise ratios of the signal relative to each kind of noise are given. Top left: original time series (red) and whitened time series (black). Bottom left: original noise (red) and whitened noise (black). Top right: power spectra of the original and whitened time series. Bottom right: power spectra of original and whitened noise sequences.

44 //7 Signal + Noise Noise only Cholesky whitening: N = Sine+RN+WN S i =. S/N r =. S/N w =. Time Series Spectra Time (bins) Frequency (bins) Figure : Example of whitening using the Cholesky decomposition. The signal consists of a sine wave with period of. time bins with additive red and white noise. Signal-to-noise ratios of the signal relative to each kind of noise are given. Top left: original time series (red) and whitened time series (black). Bottom left: original noise (red) and whitened noise (black). Top right: power spectra of the original and whitened time series. Bottom right: power spectra of original and whitened noise sequences. Signal + Noise Noise only Cholesky whitening: N = Sine+RN+WN S i =. S/N r =. S/N w =. Time Series Spectra Time (bins) Frequency (bins) Figure 7: Example of whitening using the Cholesky decomposition. The signal consists of a sine wave with period of. time bins with additive red and white noise. Signal-to-noise ratios of the signal relative to each kind of noise are given. Top left: original time series (red) and whitened time series (black). Bottom left: original noise (red) and whitened noise (black). Top right: power spectra of the original and whitened time series. Bottom right: power spectra of original and whitened noise sequences.

45 //7 Signal + Noise Noise only Cholesky whitening: N = Sine+RN+WN S i =. S/N r =. S/N w =. Time Series Spectra Time (bins) Frequency (bins) Figure 8: Example of whitening using the Cholesky decomposition. The signal consists of a sine wave with period of. time bins with additive red and white noise. Signal-to-noise ratios of the signal relative to each kind of noise are given. Top left: original time series (red) and whitened time series (black). Bottom left: original noise (red) and whitened noise (black). Top right: power spectra of the original and whitened time series. Bottom right: power spectra of original and whitened noise sequences. 7 Cholesky whitening: N = Sine+RN+WN Si =. S/Nr =. S/Nw =. Time Series Spectra Signal + Noise Noise only Time (bins) Frequency (bins) Figure 9: Example of whitening using the Cholesky decomposition. The signal consists of a sine wave with period of. time bins with additive red and white noise. Signal-to-noise ratios of the signal relative to each kind of noise are given. Top left: original time series (red) and whitened time series (black). Bottom left: original noise (red) and whitened noise (black). Top right: power spectra of the original and whitened time series. Bottom right: power spectra of original and whitened noise sequences. 8

46 //7 Impulse Response and Spectrum of Whitening Filter We can think of the Cholesky decomposition as a filter that suppresses low frequencies for the purpose of estimating the parameters of a sinusoid. The filter response can be calculated from the impulse response as follows: Construct a data vector i corresponding to i j =for all j except j = j where i j =. Then the impulse response is h = L i. Then, expressed as a time function h j,j =,,N, the frequency-domain response is the squared magnitude of the DFT of h j : H k = h k 9 Figure : Example of whitening using the Cholesky decomposition along with the impulse response and its spectrum. The signal consists of a sine wave with period of. time bins with additive red and white noise. Signal-to-noise ratios of the signal relative to each kind of noise are given. Left figure: Top left: original time series (red) and whitened time series (black). Bottom left: original noise (red) and whitened noise (black). Top right: power spectra of the original and whitened time series. Bottom right: power spectra of original and whitened noise sequences. Right figure: Top panel: input impulse (red) and impulse response of the Cholesky filter. Bottom Panel: Spectra of the impulse and impulse response, respectively. The filter shows the suppression of frequencies below about bins; this frequency is signal-to-noise ratio dependent. Note that the form of the impulse response is of a running difference that will remove a low- frequency varianon.

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2013

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2013 A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2013 Lecture 26 Localization/Matched Filtering (continued) Prewhitening Lectures next week: Reading Bases, principal

More information

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 25 http://www.astro.cornell.edu/~cordes/a6523 Lecture 25:! Markov Processes and Markov Chain Monte Carlo!! Chapter 29

More information

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2013

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2013 A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2013 Lecture 16 More on spectral analysis Reading: Chaper 13: Bayesian Revolution in Spectral Analysis (already assigned)

More information

A6523 Modeling, Inference, and Mining Jim Cordes, Cornell University

A6523 Modeling, Inference, and Mining Jim Cordes, Cornell University A6523 Modeling, Inference, and Mining Jim Cordes, Cornell University Lecture 19 Modeling Topics plan: Modeling (linear/non- linear least squares) Bayesian inference Bayesian approaches to spectral esbmabon;

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos Contents Markov Chain Monte Carlo Methods Sampling Rejection Importance Hastings-Metropolis Gibbs Markov Chains

More information

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2011

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2011 A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2011 Reading: Chapter 10 = linear LSQ with Gaussian errors Chapter 11 = Nonlinear fitting Chapter 12 = Markov Chain Monte

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Stochastic Processes. A stochastic process is a function of two variables:

Stochastic Processes. A stochastic process is a function of two variables: Stochastic Processes Stochastic: from Greek stochastikos, proceeding by guesswork, literally, skillful in aiming. A stochastic process is simply a collection of random variables labelled by some parameter:

More information

Advanced Statistical Methods. Lecture 6

Advanced Statistical Methods. Lecture 6 Advanced Statistical Methods Lecture 6 Convergence distribution of M.-H. MCMC We denote the PDF estimated by the MCMC as. It has the property Convergence distribution After some time, the distribution

More information

Additional Keplerian Signals in the HARPS data for Gliese 667C from a Bayesian re-analysis

Additional Keplerian Signals in the HARPS data for Gliese 667C from a Bayesian re-analysis Additional Keplerian Signals in the HARPS data for Gliese 667C from a Bayesian re-analysis Phil Gregory, Samantha Lawler, Brett Gladman Physics and Astronomy Univ. of British Columbia Abstract A re-analysis

More information

Signal Modeling, Statistical Inference and Data Mining in Astrophysics

Signal Modeling, Statistical Inference and Data Mining in Astrophysics ASTRONOMY 6523 Spring 2013 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Course Approach The philosophy of the course reflects that of the instructor, who takes a dualistic view

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

A6523 Modeling, Inference, and Mining Jim Cordes, Cornell University. False Positives in Fourier Spectra. For N = DFT length: Lecture 5 Reading

A6523 Modeling, Inference, and Mining Jim Cordes, Cornell University. False Positives in Fourier Spectra. For N = DFT length: Lecture 5 Reading A6523 Modeling, Inference, and Mining Jim Cordes, Cornell University Lecture 5 Reading Notes on web page Stochas

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 3 More Markov Chain Monte Carlo Methods The Metropolis algorithm isn t the only way to do MCMC. We ll

More information

Bayesian Methods and Uncertainty Quantification for Nonlinear Inverse Problems

Bayesian Methods and Uncertainty Quantification for Nonlinear Inverse Problems Bayesian Methods and Uncertainty Quantification for Nonlinear Inverse Problems John Bardsley, University of Montana Collaborators: H. Haario, J. Kaipio, M. Laine, Y. Marzouk, A. Seppänen, A. Solonen, Z.

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

arxiv:astro-ph/ v1 14 Sep 2005

arxiv:astro-ph/ v1 14 Sep 2005 For publication in Bayesian Inference and Maximum Entropy Methods, San Jose 25, K. H. Knuth, A. E. Abbas, R. D. Morris, J. P. Castle (eds.), AIP Conference Proceeding A Bayesian Analysis of Extrasolar

More information

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring Lecture 8 A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2015 http://www.astro.cornell.edu/~cordes/a6523 Applications: Bayesian inference: overview and examples Introduction

More information

Supplementary Note on Bayesian analysis

Supplementary Note on Bayesian analysis Supplementary Note on Bayesian analysis Structured variability of muscle activations supports the minimal intervention principle of motor control Francisco J. Valero-Cuevas 1,2,3, Madhusudhan Venkadesan

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As

More information

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani

More information

SAMSI Astrostatistics Tutorial. More Markov chain Monte Carlo & Demo of Mathematica software

SAMSI Astrostatistics Tutorial. More Markov chain Monte Carlo & Demo of Mathematica software SAMSI Astrostatistics Tutorial More Markov chain Monte Carlo & Demo of Mathematica software Phil Gregory University of British Columbia 26 Bayesian Logical Data Analysis for the Physical Sciences Contents:

More information

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Statistical Methods in Particle Physics Lecture 1: Bayesian methods Statistical Methods in Particle Physics Lecture 1: Bayesian methods SUSSP65 St Andrews 16 29 August 2009 Glen Cowan Physics Department Royal Holloway, University of London g.cowan@rhul.ac.uk www.pp.rhul.ac.uk/~cowan

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning for for Advanced Topics in California Institute of Technology April 20th, 2017 1 / 50 Table of Contents for 1 2 3 4 2 / 50 History of methods for Enrico Fermi used to calculate incredibly accurate predictions

More information

A Bayesian Approach to Spectral Analysis

A Bayesian Approach to Spectral Analysis Chirped Signals A Bayesian Approach o Specral Analysis Chirped signals are oscillaing signals wih ime variable frequencies, usually wih a linear variaion of frequency wih ime. E.g. f() = A cos(ω + α 2

More information

F & B Approaches to a simple model

F & B Approaches to a simple model A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 215 http://www.astro.cornell.edu/~cordes/a6523 Lecture 11 Applications: Model comparison Challenges in large-scale surveys

More information

Detection ASTR ASTR509 Jasper Wall Fall term. William Sealey Gosset

Detection ASTR ASTR509 Jasper Wall Fall term. William Sealey Gosset ASTR509-14 Detection William Sealey Gosset 1876-1937 Best known for his Student s t-test, devised for handling small samples for quality control in brewing. To many in the statistical world "Student" was

More information

Brief introduction to Markov Chain Monte Carlo

Brief introduction to Markov Chain Monte Carlo Brief introduction to Department of Probability and Mathematical Statistics seminar Stochastic modeling in economics and finance November 7, 2011 Brief introduction to Content 1 and motivation Classical

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1 Parameter Estimation William H. Jefferys University of Texas at Austin bill@bayesrules.net Parameter Estimation 7/26/05 1 Elements of Inference Inference problems contain two indispensable elements: Data

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

Utility of Correlation Functions

Utility of Correlation Functions Utility of Correlation Functions 1. As a means for estimating power spectra (e.g. a correlator + WK theorem). 2. For establishing characteristic time scales in time series (width of the ACF or ACV). 3.

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

Lecture 5. G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1

Lecture 5. G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1 Lecture 5 1 Probability (90 min.) Definition, Bayes theorem, probability densities and their properties, catalogue of pdfs, Monte Carlo 2 Statistical tests (90 min.) general concepts, test statistics,

More information

Probability and Information Theory. Sargur N. Srihari

Probability and Information Theory. Sargur N. Srihari Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

SAMSI Astrostatistics Tutorial. Models with Gaussian Uncertainties (lecture 2)

SAMSI Astrostatistics Tutorial. Models with Gaussian Uncertainties (lecture 2) SAMSI Astrostatistics Tutorial Models with Gaussian Uncertainties (lecture 2) Phil Gregory University of British Columbia 2006 The rewards of data analysis: 'The universe is full of magical things, patiently

More information

Doing Bayesian Integrals

Doing Bayesian Integrals ASTR509-13 Doing Bayesian Integrals The Reverend Thomas Bayes (c.1702 1761) Philosopher, theologian, mathematician Presbyterian (non-conformist) minister Tunbridge Wells, UK Elected FRS, perhaps due to

More information

16 : Approximate Inference: Markov Chain Monte Carlo

16 : Approximate Inference: Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models 10-708, Spring 2017 16 : Approximate Inference: Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Yuan Yang, Chao-Ming Yen 1 Introduction As the target distribution

More information

Miscellany : Long Run Behavior of Bayesian Methods; Bayesian Experimental Design (Lecture 4)

Miscellany : Long Run Behavior of Bayesian Methods; Bayesian Experimental Design (Lecture 4) Miscellany : Long Run Behavior of Bayesian Methods; Bayesian Experimental Design (Lecture 4) Tom Loredo Dept. of Astronomy, Cornell University http://www.astro.cornell.edu/staff/loredo/bayes/ Bayesian

More information

ELEG 3143 Probability & Stochastic Process Ch. 6 Stochastic Process

ELEG 3143 Probability & Stochastic Process Ch. 6 Stochastic Process Department of Electrical Engineering University of Arkansas ELEG 3143 Probability & Stochastic Process Ch. 6 Stochastic Process Dr. Jingxian Wu wuj@uark.edu OUTLINE 2 Definition of stochastic process (random

More information

I. Bayesian econometrics

I. Bayesian econometrics I. Bayesian econometrics A. Introduction B. Bayesian inference in the univariate regression model C. Statistical decision theory D. Large sample results E. Diffuse priors F. Numerical Bayesian methods

More information

Sensor Tasking and Control

Sensor Tasking and Control Sensor Tasking and Control Sensing Networking Leonidas Guibas Stanford University Computation CS428 Sensor systems are about sensing, after all... System State Continuous and Discrete Variables The quantities

More information

Cosmology & CMB. Set5: Data Analysis. Davide Maino

Cosmology & CMB. Set5: Data Analysis. Davide Maino Cosmology & CMB Set5: Data Analysis Davide Maino Gaussian Statistics Statistical isotropy states only two-point correlation function is needed and it is related to power spectrum Θ(ˆn) = lm Θ lm Y lm (ˆn)

More information

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2015 http://www.astro.cornell.edu/~cordes/a6523 Lecture 23:! Nonlinear least squares!! Notes Modeling2015.pdf on course

More information

Stochastic Processes: I. consider bowl of worms model for oscilloscope experiment:

Stochastic Processes: I. consider bowl of worms model for oscilloscope experiment: Stochastic Processes: I consider bowl of worms model for oscilloscope experiment: SAPAscope 2.0 / 0 1 RESET SAPA2e 22, 23 II 1 stochastic process is: Stochastic Processes: II informally: bowl + drawing

More information

17 : Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

The Metropolis-Hastings Algorithm. June 8, 2012

The Metropolis-Hastings Algorithm. June 8, 2012 The Metropolis-Hastings Algorithm June 8, 22 The Plan. Understand what a simulated distribution is 2. Understand why the Metropolis-Hastings algorithm works 3. Learn how to apply the Metropolis-Hastings

More information

Organization. I MCMC discussion. I project talks. I Lecture.

Organization. I MCMC discussion. I project talks. I Lecture. Organization I MCMC discussion I project talks. I Lecture. Content I Uncertainty Propagation Overview I Forward-Backward with an Ensemble I Model Reduction (Intro) Uncertainty Propagation in Causal Systems

More information

Frequentist-Bayesian Model Comparisons: A Simple Example

Frequentist-Bayesian Model Comparisons: A Simple Example Frequentist-Bayesian Model Comparisons: A Simple Example Consider data that consist of a signal y with additive noise: Data vector (N elements): D = y + n The additive noise n has zero mean and diagonal

More information

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2015 http://www.astro.cornell.edu/~cordes/a6523 Lecture 4 See web page later tomorrow Searching for Monochromatic Signals

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling Professor Erik Sudderth Brown University Computer Science October 27, 2016 Some figures and materials courtesy

More information

covariance function, 174 probability structure of; Yule-Walker equations, 174 Moving average process, fluctuations, 5-6, 175 probability structure of

covariance function, 174 probability structure of; Yule-Walker equations, 174 Moving average process, fluctuations, 5-6, 175 probability structure of Index* The Statistical Analysis of Time Series by T. W. Anderson Copyright 1971 John Wiley & Sons, Inc. Aliasing, 387-388 Autoregressive {continued) Amplitude, 4, 94 case of first-order, 174 Associated

More information

DETECTION theory deals primarily with techniques for

DETECTION theory deals primarily with techniques for ADVANCED SIGNAL PROCESSING SE Optimum Detection of Deterministic and Random Signals Stefan Tertinek Graz University of Technology turtle@sbox.tugraz.at Abstract This paper introduces various methods for

More information

Probability and Statistics for Final Year Engineering Students

Probability and Statistics for Final Year Engineering Students Probability and Statistics for Final Year Engineering Students By Yoni Nazarathy, Last Updated: May 24, 2011. Lecture 6p: Spectral Density, Passing Random Processes through LTI Systems, Filtering Terms

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

3. ESTIMATION OF SIGNALS USING A LEAST SQUARES TECHNIQUE

3. ESTIMATION OF SIGNALS USING A LEAST SQUARES TECHNIQUE 3. ESTIMATION OF SIGNALS USING A LEAST SQUARES TECHNIQUE 3.0 INTRODUCTION The purpose of this chapter is to introduce estimators shortly. More elaborated courses on System Identification, which are given

More information

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo Winter 2019 Math 106 Topics in Applied Mathematics Data-driven Uncertainty Quantification Yoonsang Lee (yoonsang.lee@dartmouth.edu) Lecture 9: Markov Chain Monte Carlo 9.1 Markov Chain A Markov Chain Monte

More information

A Bayesian Approach to Phylogenetics

A Bayesian Approach to Phylogenetics A Bayesian Approach to Phylogenetics Niklas Wahlberg Based largely on slides by Paul Lewis (www.eeb.uconn.edu) An Introduction to Bayesian Phylogenetics Bayesian inference in general Markov chain Monte

More information

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Lecture 8: Bayesian Estimation of Parameters in State Space Models in State Space Models March 30, 2016 Contents 1 Bayesian estimation of parameters in state space models 2 Computational methods for parameter estimation 3 Practical parameter estimation in state space

More information

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods Prof. Daniel Cremers 14. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric

More information

2A1H Time-Frequency Analysis II

2A1H Time-Frequency Analysis II 2AH Time-Frequency Analysis II Bugs/queries to david.murray@eng.ox.ac.uk HT 209 For any corrections see the course page DW Murray at www.robots.ox.ac.uk/ dwm/courses/2tf. (a) A signal g(t) with period

More information

ST 740: Markov Chain Monte Carlo

ST 740: Markov Chain Monte Carlo ST 740: Markov Chain Monte Carlo Alyson Wilson Department of Statistics North Carolina State University October 14, 2012 A. Wilson (NCSU Stsatistics) MCMC October 14, 2012 1 / 20 Convergence Diagnostics:

More information

Markov Chain Monte Carlo

Markov Chain Monte Carlo Markov Chain Monte Carlo Recall: To compute the expectation E ( h(y ) ) we use the approximation E(h(Y )) 1 n n h(y ) t=1 with Y (1),..., Y (n) h(y). Thus our aim is to sample Y (1),..., Y (n) from f(y).

More information

Lecture 27 Frequency Response 2

Lecture 27 Frequency Response 2 Lecture 27 Frequency Response 2 Fundamentals of Digital Signal Processing Spring, 2012 Wei-Ta Chu 2012/6/12 1 Application of Ideal Filters Suppose we can generate a square wave with a fundamental period

More information

Sequential Monte Carlo Methods for Bayesian Computation

Sequential Monte Carlo Methods for Bayesian Computation Sequential Monte Carlo Methods for Bayesian Computation A. Doucet Kyoto Sept. 2012 A. Doucet (MLSS Sept. 2012) Sept. 2012 1 / 136 Motivating Example 1: Generic Bayesian Model Let X be a vector parameter

More information

STA 294: Stochastic Processes & Bayesian Nonparametrics

STA 294: Stochastic Processes & Bayesian Nonparametrics MARKOV CHAINS AND CONVERGENCE CONCEPTS Markov chains are among the simplest stochastic processes, just one step beyond iid sequences of random variables. Traditionally they ve been used in modelling a

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

An introduction to Sequential Monte Carlo

An introduction to Sequential Monte Carlo An introduction to Sequential Monte Carlo Thang Bui Jes Frellsen Department of Engineering University of Cambridge Research and Communication Club 6 February 2014 1 Sequential Monte Carlo (SMC) methods

More information

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Bayesian Inference for DSGE Models. Lawrence J. Christiano Bayesian Inference for DSGE Models Lawrence J. Christiano Outline State space-observer form. convenient for model estimation and many other things. Bayesian inference Bayes rule. Monte Carlo integation.

More information

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods Prof. Daniel Cremers 11. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric

More information

MCMC Methods: Gibbs and Metropolis

MCMC Methods: Gibbs and Metropolis MCMC Methods: Gibbs and Metropolis Patrick Breheny February 28 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/30 Introduction As we have seen, the ability to sample from the posterior distribution

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.1 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

MCMC notes by Mark Holder

MCMC notes by Mark Holder MCMC notes by Mark Holder Bayesian inference Ultimately, we want to make probability statements about true values of parameters, given our data. For example P(α 0 < α 1 X). According to Bayes theorem:

More information

Kernel adaptive Sequential Monte Carlo

Kernel adaptive Sequential Monte Carlo Kernel adaptive Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) December 7, 2015 1 / 36 Section 1 Outline

More information

Learning the hyper-parameters. Luca Martino

Learning the hyper-parameters. Luca Martino Learning the hyper-parameters Luca Martino 2017 2017 1 / 28 Parameters and hyper-parameters 1. All the described methods depend on some choice of hyper-parameters... 2. For instance, do you recall λ (bandwidth

More information

Introduction to Bayesian methods in inverse problems

Introduction to Bayesian methods in inverse problems Introduction to Bayesian methods in inverse problems Ville Kolehmainen 1 1 Department of Applied Physics, University of Eastern Finland, Kuopio, Finland March 4 2013 Manchester, UK. Contents Introduction

More information

MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

More information

Bayesian rules of probability as principles of logic [Cox] Notation: pr(x I) is the probability (or pdf) of x being true given information I

Bayesian rules of probability as principles of logic [Cox] Notation: pr(x I) is the probability (or pdf) of x being true given information I Bayesian rules of probability as principles of logic [Cox] Notation: pr(x I) is the probability (or pdf) of x being true given information I 1 Sum rule: If set {x i } is exhaustive and exclusive, pr(x

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

MCMC Sampling for Bayesian Inference using L1-type Priors

MCMC Sampling for Bayesian Inference using L1-type Priors MÜNSTER MCMC Sampling for Bayesian Inference using L1-type Priors (what I do whenever the ill-posedness of EEG/MEG is just not frustrating enough!) AG Imaging Seminar Felix Lucka 26.06.2012 , MÜNSTER Sampling

More information

Bayesian room-acoustic modal analysis

Bayesian room-acoustic modal analysis Bayesian room-acoustic modal analysis Wesley Henderson a) Jonathan Botts b) Ning Xiang c) Graduate Program in Architectural Acoustics, School of Architecture, Rensselaer Polytechnic Institute, Troy, New

More information

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Review. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods Tomas McKelvey and Lennart Svensson Signal Processing Group Department of Signals and Systems Chalmers University of Technology, Sweden November 26, 2012 Today s learning

More information

A523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2011

A523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2011 A523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2011 Lecture 1 Organization:» Syllabus (text, requirements, topics)» Course approach (goals, themes) Book: Gregory, Bayesian

More information

Monte Carlo Methods. Leon Gu CSD, CMU

Monte Carlo Methods. Leon Gu CSD, CMU Monte Carlo Methods Leon Gu CSD, CMU Approximate Inference EM: y-observed variables; x-hidden variables; θ-parameters; E-step: q(x) = p(x y, θ t 1 ) M-step: θ t = arg max E q(x) [log p(y, x θ)] θ Monte

More information

Deblurring Jupiter (sampling in GLIP faster than regularized inversion) Colin Fox Richard A. Norton, J.

Deblurring Jupiter (sampling in GLIP faster than regularized inversion) Colin Fox Richard A. Norton, J. Deblurring Jupiter (sampling in GLIP faster than regularized inversion) Colin Fox fox@physics.otago.ac.nz Richard A. Norton, J. Andrés Christen Topics... Backstory (?) Sampling in linear-gaussian hierarchical

More information

Markov chain Monte Carlo methods in atmospheric remote sensing

Markov chain Monte Carlo methods in atmospheric remote sensing 1 / 45 Markov chain Monte Carlo methods in atmospheric remote sensing Johanna Tamminen johanna.tamminen@fmi.fi ESA Summer School on Earth System Monitoring and Modeling July 3 Aug 11, 212, Frascati July,

More information

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,

More information

Results: MCMC Dancers, q=10, n=500

Results: MCMC Dancers, q=10, n=500 Motivation Sampling Methods for Bayesian Inference How to track many INTERACTING targets? A Tutorial Frank Dellaert Results: MCMC Dancers, q=10, n=500 1 Probabilistic Topological Maps Results Real-Time

More information