AUTOMATIC CONTROL REGLERTEKNIK LINKÖPINGS UNIVERSITET. Machine Learning T. Schön. (Chapter 11) AUTOMATIC CONTROL REGLERTEKNIK LINKÖPINGS UNIVERSITET

About the Eam I/II) ), Lecture 7 MCMC and Sampling Methods Thomas Schön Division of Automatic Control Linköping University Linköping, Sweden. Email: schon@isy.liu.se, Phone: 3-373, www.control.isy.liu.se/~schon/ If you have followed the course and completed the eercises you will not be surprised when you see the eam. You will learn new things during the eam. Practicalities: Time frame: days h), somewhere during week 3. Within hours after you have collected the eam, you put your solutions in an envelope seal it) and hand it in. About the Eam II/II) 3) As usual the graduate eam honor code applies. This means, The course books, other books and MATLAB are all allowed aids. Internet services such as email, web browsers and other communication with the surrounding world concerning the eam is OT allowed. You are OT allowed to actively search for the solutions in books, papers, the Internet or anywhere else. You are OT allowed to talk to others save for the responsible teachers) about the eam at all. If anything is unclear concerning what is allowed and not, just ask any of the responsible teachers. You are not allowed to look at eams from earlier version of the course obviously hard in this course anyway...). Outline of Lecture 7 ) Summary of lecture Motivation for Monte Carlo methods Basic Sampling Methods Transformation Methods Rejection Sampling Importance Sampling Markov Chain Monte Carlo MCMC) General Properties Metropolis-Hastings Algorithm Gibbs Sampling Chapter )

Summary of Lecture I/III) ) In boosting we train a sequence of M models y m ), where the error function used to train a certain model depends on the performance of the previous models. The models are then combined to produce the resulting classifier for the two class problem) according to M ) Y M ) = sign α m y m ) m= We saw that the AdaBoost algorithm can be interpreted as a sequential minimization of an eponential cost function. Graphical Models: A graphical description of a probabilistic model where variables are represented by nodes and the relationships between variables correspond to edges. Summary of Lecture II/III) ) We started introducing some basic concepts for probabilistic graphical models G = V, L) consisting of. a set of nodes V a.k.a. vertices) representing the random variables and. a set of links L a.k.a. edges or arcs) containing elements i, j) L connecting a pair of nodes i, j) V and thereby encoding the probabilistic relations between nodes.... y y y Summary of Lecture III/III) 7) The set of parents to node j is defined as Pj) {i V i, j) E} The directed graph describes how the joint distribution factors into a product of factors p i Pi) ) only depending on a subset of the variables, p V ) = p i Pi) ). i V Hence, for the state-space model on the previous slide, we have px, Y) = p ) t= p t t ) t= py t t ) Motivation for Monte Carlo I/III) ) Probabilistic inference obviously depends on probability density functions. We have two important problems with probabilistic inference:. Computing integrals Eamples: f )d. Optimization Eamples: ˆ = arg ma Bayesian Inference p y) = py ) py ) d Maimum likelihood ˆ ML = arg ma py ) Marginalization p ) = p, ) d Maimum a posteriori ˆ MAP = arg ma p y)

Motivation for Monte Carlo II/III) 9) Motivation for Monte Carlo III/III) ) The models of reality are becoming more and more complicated for us to be able to perform these operations eactly. The standard way of solving these problems is to make analytical approimations either in the model or in the solution. Eamples: Use conjugate priors to make analytical calculation of the integrals and optimization possible. Variational approimations in the solutions. Analytical approimations change either the problem or the solution that we are trying to obtain and the effects are not always predictable......... Another framework is to use numerical methods called Monte Carlo methods to solve these problems. With Monte Carlo, no sacrifice in the model or in the solution is made. The accuracy is limited only by our. computational resources. 3 3 Both integration and optimization can be done with these methods. If one can represent a complicated density with samples as i= δ i)) any integral can be calculated by f ) d f i) ) i=..7....3. Basic Sampling Algorithms ) Transformation Method I/II) ) Most processing environments have built-in random number generators for the uniform distribution. Assume that we would like to generate samples from a general univariate density.the following scheme provides the way. Generating samples from density Generate u U, ) Calculate i) = cdf u) where cdf) = p ) d is the cumulative distribution function of. and cdf).... Suppose we have a random variable z p z z) which we can obtain the samples from. If we transform the samples of z with an invertible function f ) as y = f z), the density of the samples we obtain would be p y y) = p z z) dz dy = p zf y)) df y) dy The second term on the r.h.s. is absolutely necessary. Without it the r.h.s. might not even be a proper density. Eample: Let z U, ). Let f z) = z. Then we have f y) = y for y. py) = p z y) y = y for < y and otherwise. py) and pz).... y,z pz) py)

Transformation Method II/II) 3) Random umber Generation - Perfect Sampling ) For Gaussian random variable generation, the following transformation method is proposed. Bo-Muller Method. Generate z, z U, ).. If r z + z >, discard the samples and go to. 3. If r, calculate ln r ln r y = z r and y = z r for which y, y, ). z z y...... z.. z y Assumption: We have access to samples from the density we seek to estimate, ˆp ) = i= δ i) We are seeking an estimate according to, Î g)) = g)ˆp )d = i= gi ). The strong law of large numbers lim Î g)) a.s. Ig)). Central limit theorem lim Î g)) Ig)) ) σ d, ) Perfect Sampling - Eample and Problem ) Consider sampling from the following Gaussian miture =.3, ) +.7 9, 9) Random umber Generation ) Target density p) - We seek samples distributed according to this density. Proposal density q) - This density is simple to generate samples from. Acceptance probability w) - Used to decide whether the sample is OK. p ) w )q ) samples samples Obvious Problem: In general we are OT able to sample from the density we are interested in! Three common algorithms based on this idea:. Rejection sampling. Importance sampling 3. Metropolis-Hastings

Rejection Sampling I/IV) 7) Rejection Sampling II/IV) ) Suppose is too complicated to sample directly from. Let us introduce a latent variable u. Consider the joint distribution p, u) pu ) where we define pu ) Uu;, ) Then p, u) =... { =, if u, otherwise Hence, we must sample uniformly over the area under.. For this purpose, we use a proposal. density q) that is easy to sample from such that Kq) for all. Rejection Sampling. Sample i) q ).. Sample u U, Kq i) )). 3. If u p i) ) accept the sample i) as a valid sample from p ). Go to.. Otherwise, discard i) and go to. and q).. Rejection Sampling III/IV) 9) This procedure does not depend on the fact that p ) is normalized. i.e., all p ) terms can be replaced by an unnormalized version p ) such that = p ) d The procedure can be used with multivariate densities in the same way. The rejection rate is given by K. This is the percentage of what we waste. Therefore, one must select K as small as possible while still satisfying Kq) for all. There are adaptive versions where one tries to obtain better proposals during the sampling process. Even the optimal K generally grows eponentially as the dimension increases. Rejection Sampling IV/IV) ) Eample: Kernel based density estimates from samples obtained with rejection sampling. and q)............ = =........ = =

Importance Sampling I/IV) ) Importance Sampling II/IV) ) So far, we have presented sampling approaches where each sample has equal contribution importance) in the approimating particle sum as which gave f ) d δ i)) i= i= f i) ) In importance sampling, one samples from a proposal density i) q ) and uses a weighted approimation for p ). q) δ i)) i= We have the integral f ) d = f ) q) q) d i= w i) f i) ) where w i) pi) ) q i) ). This approimation procedure is equivalent to approimating p ) as i= w i) δ i)) When the density p ) is not normalized, one uses the approimation w i) δ i)) where w i) = wi) i= i= wi) Importance Sampling III/V) 3) Importance Sampling III/IV) ) Algorithm Importance sampling). Generate i.i.d. samples { i } i= from the proposal density q) and compute the importance weights w i = p i )/q i ), i =,...,.. Form the acceptance probabilities by normalization, w i = w i / j= w j, i =,...,. Results in the following approimation of the target density, = w i δ i) i= In rejection and other types of sampling, only particles positions denseness) carry information. In importance sampling, weights also carry important information. ote that, a large weight does not necessarily mean that the density value there is high. Particles density is still important..... Rejection Sampling q) w.... Importance Sampling q) w

Importance Sampling IV/IV) ) Sampling Importance Resampling ).... Proposal selection is very important. arrower proposals than the density can cause poor representation of the density in some parts of space. It is, in general, a good idea to choose wide proposals keeping in mind that a too wide proposal would result in too many samples with tiny weights which is a waste of computation. = q) w ˆ.... = q) w ˆ.... = q) w ˆ p ) i= w i δ i ) We can now make use of resampling in order to generate an unweighted set of samples. This is done by drawing new samples with replacement according to, P i = j) = w j, j =,...,, resulting in the following unweighted approimation ˆp ) = i= δ i ) Sampling Importance Resampling - Algorithm 7) The Importance of a Good Importance Density ) Algorithm Sampling Importance Resampling SIR)). Generate i.i.d. samples { i } i= from the proposal density q) and compute the importance weights w i = p i )/q i ), i =,...,.. Form the acceptance probabilities by normalization, w i = w i / j= w j, i =,...,. 3. For each i =,..., draw a new particle i t with replacement resample) according to, P i = j) = w j, j =,...,. q ) =, ) q ) =, ) samples used in booth eperiments. Lesson learned: It is very important to be careful in selecting the importance density.

State Estimation 9) For a nonlinear state-space model t+ p t+ t ) y t py t t ) we can show that the filtering and the one-step ahead prediction densities are {}}{{ }} { py t t ) p t y :t ) p t y :t ) = py t y :t ) p t+ y :t ) = p t+ t ) }{{} p t y :t ) }{{} d t The Particle Filter 3) Idea: Make use of SIR in order to find a first particle filter, i.e., an algorithm that provides approimations of p t y :t ) Recall, that we have from the previous slide) p t y :t ) = py t t )p t y :t ) py t y :t ) p t y :t ) py t t ) p t y :t ) }{{}}{{}}{{} p t ) a t ) q t ) This implies that SIR can be used to produce estimates of p t y :t ). A First Particle Filter 3) Algorithm A first particle filter). Initialize the particles, { i } i= p ) and let t :=.. Predict the particles by drawing i.i.d. samples, i t p t i t ), i =,...,. 3. Compute the importance weights { w i t } i=, w i t = py t i t), i =,...,. and normalize w i t = wi t / j= wj t.. For each i =,..., draw a new particle i t with replacement, P i t = j t ) = wj t, j =,...,.. Set t := t + and repeat from step. IS Eample Linear System Identification I/III) 3) Consider the following linear scalar state-space model k+ = θ k + v k, y k = k + e k, vk e k ) The initial state: ;, Σ ). θ with prior distribution θ θ;, σ θ ) ) )) σ, v σe. The identification problem is now to determine the posterior pθ y : ) using Importance Sampling. As usual, note the difference in notation compared to Bishop! The observations are denoted y and the latent variables are given by.

IS Eample Linear System Identification II/III) 33) k+ = θ k + v k, y k = k + e k, vk e k ) ) )) σ, v σe. We have solved this problem with both EM and VB using the latent variables : {,..., }. The main equation for the importance sampling for this eample is pθ y : ) py : θ)pθ) The problem here is that we cannot normalize this density analytically since py : θ) is too complicated. We can still evaluate py : θ) for different values of θ. IS Eample Linear System Identification III/III) 3) We would like to sample from pθ y : ) py : θ)pθ). Choose the proposal as the prior q ) = p ). Sample θ i) q ). Set the weights as w i) = py : θ i) )pθ i) ) qθ i) ) ormalize the weights w i) = = py : θ i) ) pθ y : )..... = iter no: MC : True posterior True θ EM VB IS wi).. i= wi) θ The likelihood py : θ i) ) required for the weights above is given by a Kalman filter as py : θ i) ) =py ) i= py i y :i, θ i) ) i= y i ; ŷ i i θ i) ), S i i θ i) )) Markov Chain Monte Carlo 3) Markov Chain Monte Carlo 3) Importance sampling is also bound to fail in high dimensions. This is due to the fact that, in high dimensions, the support of the density to be sampled from is only a tiny region in the overall space and for this case, it is very difficult to find a proposal without knowing the actual density. Markov Chain Monte Carlo MCMC) is proposed to overcome this problem. Whereas in standard sampling methods, the samples are independent from each other, MCMC uses dependent samples. Due to the Markov property, each sample is dependent on the previous sample i.e., each i) depends on i ). In general i) is generated as i) q i ) ) to sample from. Obviously, we want the overall behavior of the generated samples to be similar to those of p ). MCMC methods provide a way to do this with an arbitrary proposal q ). Metropolis Hastings Algorithm Generate an initial sample ) q ). For i=,..., Sample q i ) ). Sample u U, ). Set the new sample i) as ) p ) q i), if u < min, i ) ) = p i ) ) q i ) ). i ), otherwise

M-H Eample: Sampling from a Gaussian I/II) 37) M-H Eample: Sampling from a Gaussian II/II) samples σma. ρ = ;,. σ Choose the proposal density q z) as. i ) i ) q ) = ;,. samples p ) p ) samples p ) Suppose the target density over R is 3) min samples p ) MH Some General Issues II/II) ) samples Proposal selection is still an important problem. If the proposal is selected too narrow, then step-sizes get smaller and the burn-in period becomes longer. ρ If the proposal is too wide, then the burn-in gets shorter, however, the acceptance rate is decreased significantly. σma p ) p ) samples p ) 39) Diagnosing convergence to the target distribution with MCMC algorithms is still an active area of research. We generally have to use only the samples obtained after the burn-in period. After the burn-in period is over, the Markov chain is said to be mied. samples M-H Some General Issues I/II) The time that passes before the samples starting to represent the target density is called burn-in period. This version of the Metropolis-Hastings algorithm is called the Metropolis algorithm. M-H makes the samples converge to the samples of a stationary distribution which is the target distribution p ). p ) = min, p i ) )! oticing that q i ) ) = qi ) ) for all, we have p ) qi ) ) min, pi ) ) q i ) ) σma ρ σmin σmin

cation to discrete or continuous outcomes.) The use of dependent sequences contrasts with many classical Monte Carlo methods, which are based on independent samples. MCMC methods have the great advantage that they apply to a broader class of multivariate problems than methods based on independent sampling. Over the last - years, the approach has had a large impact on the theory and practice of statistical modeling. On the other hand, MCMC has had relatively little impact yet) on estimation problems in control. Background This article is a survey of popular implementations of MCMC, focusing particularly on the two most popular specific implementations of MCMC: Metropolis- Hastings M-H) and Gibbs sampling. Although the results presented have a rigorous basis, the presentation here is relatively informal in the hopes of conveying the main ideas relevant to implementation. Our aim is to provide the reader with some of the central motivation and the rudiments needed for a straightforward application. The cited references provide etensive detail not presented here, including the rigorous DIGITAL VISIO justification for many of the results. Although MCMC has general applicability, one area where it has had a revolutionary impact is arkov chain Monte Carlo MCMC) is a powerful means for generating random samples that can be used in problems for which Bayesian methods can be applied. Me- Bayesian analysis. MCMC has greatly epanded the range of computing statistical estimates and tropolis et al. [] introduced MCMC-type methods in the marginal and conditional probabilities. MCMC methods rely on depend- Geman and Geman [3], and Tanner and Wong [], the paper area of physics. Following key papers by Hastings [], ent Markov) sequences having a limiting distribution of Gelfand and Smith [] is largely credited with introducing corresponding to a distribution of interest. ote that the the applications of MCMC methods in modern statistics, use of the term Markov chain in MCMC is not entirely consistent with standard usage in stochastic processes. The Over the last decade, many papers and books have been specifically in Bayesian modeling. term is generally reserved for use with processes having discrete outcomes. For consistency with standard terminology istic problems in a wide variety of areas. Among the ecellent published displaying the power of MCMC in dealing with real- in the MCMC area, however, we follow suit in this article in review papers in this area are the survey by Besag et al. [] using the term Markov chain under the more general appli- and the vignettes by Cappé and Robert [7] and Gelfand []. The author james.spall@jhuapl.edu) is with the Johns Hopkins University, Applied Physics Laboratory, Johns Hopkins Rd., Laurel, MD 73-99, U.S.A. 7-7/3/$7. 3IEEE 3 IEEE Control Systems Magazine April 3 [ Arnaud Doucet and Xiaodong Wang] [A review in the statistical signal processing contet] IFIITY n many problems encountered in signal processing, it is possible to accurately describe the underlying statistical model using probability distributions. Statistical inference can then theoretically be performed based on the relevant likelihood function or posterior distribution in a Bayesian framework. However, most problems encountered in applied research require non-gaussian and/or nonlinear models to correctly account for the observed data. In these cases, it is typically impossible to obtain the required statistical estimates of interest [e.g., maimum likelihood ML) or conditional epectation] in closed form as it requires integration and/or maimization of comple multidimensional functions. A standard approach consists of making model simplifications or crude analytic approimations to obtain algorithms that can be easily implemented. With the recent availability of high-powered computers, numerical-simulation-based IEEE SIGAL PROCESSIG MAGAZIE [] OVEMBER 3-//$. IEEE Gibbs Sampling ) Optimization with MCMC ) Gibbs sampling is a special case of the Metropolis-Hastings algorithm where the proposal function is set to be the conditional distribution of the variables. It is especially useful when the dimension of the space to sample is very large e.g. images. Suppose, we are sampling in a two dimensional space = [, ] T. Then the Gibbs sampler works as follows. Gibbs Sampler for D Sample ) q ). For i =, 3,..., Sample i) p i ) ). Sample i) p i) ). Set i) = [ i), i) ]T. ote that due to the special proposal, a Gibbs sampler does not have an accept-reject step as M-H. Maimum a posteriori estimation requires ˆ MAP = arg ma p y) = arg ma p, y) The densities, in general, have multiple modes and local optima. MCMC methods can be used to find global optimum of such densities. For this purpose, a time-varying target density is selected in an MCMC iteration. References 3) A Few Concepts to Summarize Lecture ) M I Monte Carlo Methods for Signal Processing Spall, J.C., Estimation via Markov chain Monte Carlo, IEEE Control Systems Magazine, vol.3, no., pp. 3-, Apr. 3. http://ieeeplore.ieee.org/ stamp/stamp.jsp?tp=&arnumber=77&isnumber=9 Doucet, A.; Wang X., Monte Carlo methods for signal processing: a review in the statistical signal processing contet, IEEE Signal Processing Magazine, vol., no., pp. 7, ov.. http://ieeeplore.ieee.org/stamp/ stamp.jsp?tp=&arnumber=9&isnumber=33 Andrieu C.; De Freitas.; Doucet A.; Jordan M. I., An Introduction to MCMC for,, vol., pp. 3, 3. http://www.cs.ubc.ca/~arnaud/andrieu_defreitas_doucet_jordan_ intromontecarlomachinelearning.pdf Robert C. P.; Casella G., Monte Carlo Statistical Methods, Springer,. Some density functions in this lecture came from this one! Gilks W. R.; Richardson S.; Spiegelhalter D. J., Markov Chain Monte Carlo, Chapman & Hall, 99. Bishop C., Pattern Recognition and, Springer,. MacKay D. J. C., Information Theory, Inference and Learning Algorithms, Cambridge University Press, 3. available online) http://www.inference.phy.cam.ac.uk/mackay/itila/ Monte Carlo Methods: Approimate inference tools using the samples from the target densities. Basic Sampling Methods: The sampling methods to obtain independent samples from target densities. Though quite powerful, these would give bad results with high dimensions. MCMC: Monte Carlo methods which produce dependent samples but more robust in high dimensions. Metropolis-Hastings Algorithm: The most well-known MCMC algorithm using arbitrary proposal densities. Gibbs Sampler: A specific case of M-H algorithm which samples from conditionals iteratively and always accepts a new sample.