1 Geometry of high dimensional probability distributions

Size: px

Start display at page:

Download "1 Geometry of high dimensional probability distributions"

Della Ryan
5 years ago
Views:

Hamiltonian Monte Carlo October 20, 2018 Debdeep Pati References: Neal, Radford M. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo 2.11 (2011): 2. Betancourt, Michael.

1 Hamiltonian Monte Carlo October 20, 2018 Debdeep Pati References: Neal, Radford M. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo 2.11 (2011): 2. Betancourt, Michael. A conceptual introduction to Hamiltonian Monte Carlo. arxiv preprint arxiv: (2017). 1 Geometry of high dimensional probability distributions The neighborhood immediately around the mode features large densities, but in more than a few dimensions the small volume of that neighborhood prevents it from having much contribution to any expectation. On the other hand, the complimentary neighborhood far away from the mode features a much larger volume, but the vanishing densities lead to similarly negligible contributions expectations. The only significant contributions come from the neighborhood between these two extremes known as the typical set (Figure 1). Importantly, because probability densities and volumes transform oppositely under any reparameterization, the typical set is an invariant object that does not depend on the irrelevant details of any particular choice of parameters. Figure 1: A typical set As the dimension of parameter space increases, the tension between the density and the volume grows and the regions where the density and volume are both large enough to yield a significant contribution becomes more and more narrow. Consequently the typical set becomes more singular with increasing dimension, a manifestation of concentration of measure. The immediate consequence of concentration of measure is that the only significant contributions to any expectation 1

2 come from the typical set; evaluating the integrand outside of the typical set has negligible effect on expectations and hence is a waste of precious computational resources. In other words, we can accurately estimate expectations by averaging over the typical set instead of the entirety of parameter space. Consequently, in order to compute expectations efficiently, we have to be able to identify, and then focus our computational resources into, the typical set. 2 Returning to MCMC again Given a Markov transition that targets the desired distribution, Markov chain Monte Carlo defines a generic strategy for quantifying the typical set. Constructing such a transition, however, is itself a nontrivial problem. Fortunately there are various procedures for automatically constructing appropriate transitions for any given target distribution, with the foremost amongst these the Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970). The Metropolis-Hastings algorithm is comprised of two steps: a proposal and a correction. The proposal is any stochastic perturbation of the initial state while the correction rejects any proposals that stray too far away from the typical set of the target distribution. More formally, let q(x; x ) be the probability density defining each proposal. The probability of accepting a given proposal is then given by { α(x; x ) = min 1, q(x ; x)π(x } ) q(x; x )Π(x) The original Markov chain Monte Carlo algorithm, and one still commonly in use today, utilizes a Gaussian distribution as its proposal mechanism, Q(x; x ) = N(x ; x, Σ), an algorithm to which we will refer to as Random Walk Metropolis. Because the proposal mechanism is symmetric under the exchange of the initial and proposed points, the proposal density cancels and { } α(x; x ) = min 1, Π(x ). Π(x) Random Walk Metropolis is not only simple to implement, it also has a particularly nice intuition. The proposal distribution is biased towards large volumes, and hence the tails of the target distribution, while the Metropolis correction rejects those proposals that jump into neighborhoods where the density is too small. The combined procedure then preferentially selects out those proposals that fall into neighborhoods of high probability mass, concentrating towards the typical set as desired. Because of its conceptual simplicity and the ease in which it can be implemented by practitioners, Random Walk Metropolis is still popular in many applications. Unfortunately, that seductive simplicity hides a performance that scales poorly with increasing dimension and complexity of the target distribution. As the dimension of the target distribution increases, the volume exterior to the typical set overwhelms the volume interior to the typical set, and almost every Random Walk Metropolis proposal will produce a point on the outside of the typical set, towards the tails. The density of these points, however, is so small, that the acceptance probability becomes negligible. In this case almost all of the proposals will be rejected and the resulting Markov chain will only rarely move. We can induce 2

3 a larger acceptance probability by shrinking the size of the proposal to stay within the typical set but those small jumps will move the Markov chain extremely slowly. It thus makes sense to seek ways of accelerating (a) the convergence of a given MCMC algorithm to its stationary distribution, (b) the convergence of a given MCMC estimate to its expectation, and/or (c) the exploration of a given MCMC algorithm of the support of the target distribution. Those goals are related but still distinct. For instance, a chain initialised by simulating from the target distribution may still fail to explore the whole support in an acceptable number of iterations. While there is not an optimal and universal solution to this issue, we will discuss below approaches that are as generic as possible, as opposed to artificial ones taking advantage of the mathematical structure of a specific target distribution. Ideally, we aim at covering realistic situations when the target density is only known [up to a constant or an additional completion step] as the output of an existing computer code. Pragmatically, we also cover here solutions that require more efforts and calibration steps when they apply to a wide enough class of problems. 3 Hamiltonian Monte Carlo The guess-and-check strategy of Random Walk Metropolis is doomed to fail in highdimensional spaces where there are an exponential number of directions in which to guess but only a singular number of directions that stay within the typical set and pass the check. In order to make large jumps away from the initial point, and into new, unexplored regions of the typical set, we need to exploit information about the geometry of the typical set itself. Specifically, we need transitions that can follow those contours of high probability mass, coherently gliding through the typical set. How can we distill the geometry of the typical set into information about how to move through it? When the sample space is continuous, a natural way of encoding this direction information is with a vector field aligned with the typical set. A vector field is the assignment of a direction at every point in parameter space, and if those directions are aligned with the typical set then they act as a guide through this neighborhood of largest target probability. In other words, instead of fumbling around parameter space with random, uninformed jumps, we can follow the direction assigned to each at point for a small distance. By construction this will move us to a new point in the typical set, where we will find a new direction to follow. Continuing this process traces out a coherent trajectory through the typical set that efficiently moves us far away from the initial point to new, unexplored regions of the typical set as quickly as possible. From the point of view of this review, Hamiltonian (or hybrid) Monte Carlo (HMC) is an auxiliary variable technique that takes advantage of a continuous time Markov process to sample from the target π. This approach comes from physics (Duane et al., 1987) [Simon Duane in Imperial College London, Physics Review B ] and was popularized in statistics by Neal (1999, 2011) and MacKay (2002). Given a target π(θ), where θ R d an artificial auxiliary variable ν R d is introduced along with a density ω(ν θ) so that the joint distribution of (θ, ν) enjoys π(θ) as its marginal. While there is complete freedom in this representation, the HMC literature often calls ν the momentum of a particle located at θ by analogy with physics. Based on the representation of the joint distribution p(θ, ν) = π(θ)ω(ν θ) exp{ H(θ, ν)} where H( ) is called the Hamiltonian. Hamiltonian Monte Carlo is associated with the continuous 3

4 time process (θ t, ν t ) generated by the so-called Hamiltonian equations: dθ t dt = H ν (θ t, ν t ), which keeps the Hamiltonian target stable over time as dν t dt = H θ (θ t, ν t ) dh(θ t, ν t ) dt = H ν (θ t, ν t ) dν t dt + H θ (θ t, ν t ) dθ t dt = 0 Obviously, the above continuous time Markov process is deterministic and only explores a given level set, {(θ, ν) : H(θ, ν) = H(θ 0, ν 0 )}, instead of the whole augmented state space R 2d which induces an issue with irreducibility. An acceptable solution to this problem is to refresh the momentum, ν t (ν θ t ), at random times {τ n }, where θ t denotes denotes the location of θ immediately prior to time t, and the random durations {τ n τ n 1 } follow an exponential distribution. By construction, continuous-time Hamiltonian Markov chain can be regarded as a specific piecewise deterministic Markov process using Hamiltonian dynamics (Davis, 1984, 1993; Bou-Rabee et al., 2017) and our target, π is the marginal of its associated invariant distribution. Before moving to the practical implementation of the concept, let us point out that the free cog in the machinery is the conditional density (ν θ), which is usually chosen as a Gaussian density with either a constant covariance matrix M corresponding to the target covariance or as a local curvature depending on θ in Riemannian Hamiltonian Monte Carlo (Girolami and Calderhead, 2011). Betancourt (2017) argues in favour of these two cases against non-gaussian alternatives and Livingstone et al. (2017) analyse how different choices of kinetic energy in Hamiltonian Monte Carlo affect algorithm performances. For a fixed covariance matrix, the Hamiltonian equations become dθ t dt = M 1 ν t, dν t dt = tu(θ t ). where U(θ t ) = t log π(θ t ) is the score function. Henceforth for the ease of notations we shall denote ν t by ν(t) and θ t by θ(t). In the special case when π(θ) = exp{ 1/2 θ 2 } and ω(ν θ) = exp{ 1 2 ν M 1 ν} where M is a diagonal matrix, it is possible to solve the equations as θ j (t) = r j cos(a j + t) and ν j (t) = r j mj sin(a j + t). 4 Properties of Hamiltonian dynamics First, Hamiltonian dynamics is reversible - the mapping T s from the state at time t, (θ(t), ν(t)) to the state at time t + s (θ(t + s), ν(t + s)) is one-one and hence has an inverse T s. The inverse mapping is obtained by simply negating the time derivative in the Hamiltonian equations. The dynamics of course leads to the conservation of the Hamiltonian. For Metropolis updates using a proposal found by Hamiltonian dynamics, which form part of the HMC method, the acceptance probability is one if H is kept invariant. We will see later, however, that in practice we can only make H approximately invariant and hence we won t have acceptance probability of one. 4

5 A third fundamental property of Hamiltonian dynamics preserves volume in (θ, ν) space, a result known as the Liouville s Theorem. If we apply the mapping T s to the points in some region R of (θ, ν) space, with volume V, the image under T will also have volume V. The significance of the volume preservation for MCMC is that we needn t account for a Jacobian in the acceptance probability for Metropolis updates. The preservation of volume can be proved in several ways. One is to note that the divergence of the vector field defined by the Hamiltonian equations is zero, which can be readily seen as d j=1 [ dθ j θ j dt + ] dν j = 0. ν j dt Next, we will show that the Hamiltonian dynamics preserves volume without presuming this property of divergence. Consider the dimension to be 1. We can approximate T δ for δ near 0 as T δ (θ, ν) = Then the Jacobian can be written as [ Then B δ = [ θ ν 1 + δ 2 H θ ν δ 2 H θ 2 ] [ dθ + δ dt dν dt δ 2 H ν 2 1 δ 2 H ν θ ] + O(δ 2 ) ] + O(δ 2 ) det(b δ ) = 1 + δ 2 H θ ν δ 2 H ν θ + O(δ2 ) = 1 + O(δ 2 ) Since log(1+x) x for x near zero, log det(b δ ) is zero except perhaps for terms of order δ 2 (though we will see later that it is exactly zero). Now consider log det(b s ) for some time interval s that is not close to zero. Setting δ = s/n, for some integer n, we can write T s as the composition of T δ applied n times (from n points along the trajectory), so det(b s ) is the n-fold product of det(b δ ) evaluated at these points. We then find that log det(b s ) = n log det(b δ ) n/n 2 = 1/n i=1 Taking n to we get the result. 4.1 Numerically simulating the Hamiltonian dynamics In general, it is not possible to analytically solve Hamilton s equations as we did for the simple case above. Instead, it is common to discretize the simulation of the differential equations with some step size ɛs. We briefly discuss two options here: Euler s method (performs poorly) and the leapfrog method (performs better). Just assume that the conditional distribution of ν is independent of θ. Assume that H(θ, ν) = U(θ) + K(ν). 5

6 Euler s method: ν j (t + ɛ) = ν j (t) + ɛ dν j dt = ν j(t) ɛ du dθ j (θ(t)) θ j (t + ɛ) = θ j (t) + ɛ dθ j dt = θ j(t) + ɛ dk dν j (ν(t)) Unfortunately, Euler method performs poorly. The result often diverges, meaning that the approximation error grows causing the Hamiltonian to no longer be preserved. Instead, the leapfrog method is used in practice. Much better results can be obtained by slightly modifying Euler s method, as follows: ν t+ɛ = ν t ɛ U(θ t ) θ t+ɛ = θ t + ɛm 1 ν t+ɛ We simply use the new value for the momentum variables, ν t+ɛ, when computing the new value for the position variables, θ t+ɛ. The leapfrog method deals with this issue by only making a ɛ/2 step in ν first, using that to update θ, and then coming back to ν for the remaining update. It consists of the following updates: Markov chain and is wellsuited to the Hamiltonian equations in that it preserves the stationary distribution (Betancourt, 2017). It is called the symplectic integrator, and one version in the independent case with constant covariance consists in the following (so-called leapfrog) steps ν t+ɛ/2 = ν t ɛ U(θ t )/2 θ t+ɛ = θ t + ɛm 1 ν t+ɛ/2 ν t+ɛ = ν t+ɛ/2 + (ɛ/2) U(θ t+ɛ ) The leapfrog approach diverges far less quickly than Euler s method. Recall the similarity with approximating y(t + ɛ) = y(t) + y(t + ɛ) y(t) = t+ɛ t t+ɛ t y (s)ds ɛy (t). y (s)ds ɛy (t + ɛ/2) We now have the necessary tools to describe how to formulate a MCMC strategy using Hamiltonian dynamics. The first two steps can be combined to get θ t+ɛ = θ t + ɛ 2 M 1 U(θ t )/2 + ɛm 1 ν t which is similar to Langevin MC: Suppose we want to sample from π(θ) e U(θ). Then X t+1 = X t + ξ t U(X t ) + 2ξZ t+1 where ξ is the step size and Z t are iid N(0, 1) random variables will have π as the stationary distribution. If π is log-concave this X t has π as the target distribution. (Verify this when π e τ θ 2 /2 ). 6

7 5 Hamiltonian Monte Carlo algorithm Using Hamiltonian dynamics to sample from a distribution requires translating the density function for this distribution to a potential energy function and introducing momentum variables to go with the original variables of interest (now seen as position variables). We can then simulate a Markov chain in which each iteration resamples the momentum and then does a Metropolis update with a proposal found using Hamiltonian dynamics. We now have the background needed to present the Hamiltonian Monte Carlo (HMC) algorithm. HMC can be used to sample only from continuous distributions on R d for which the density function can be evaluated (perhaps up to an unknown normalizing constant). For the moment, we will also assume that the density is non-zero everywhere. We must also be able to compute the partial derivatives of the log of the density function. These derivatives must therefore exist, except perhaps on a set of points with probability zero, for which some arbitrary value could be returned. HMC samples from the canonical distribution for θ, ν, in which θ has the distribution of interest π(θ), as specified using the potential energy function U(θ). We can choose the distribution of the momentum variables, ν, which are independent of θ, as we wish, specifying the distribution via the kinetic energy function, K(ν). Current practice with HMC is to use a quadratic kinetic energy which leads ν to have a zero-mean multivariate Gaussian distribution. Most often, the components of ν are specified to be independent, with component i having variance m i The kinetic energy function producing this distribution (setting T = 1) is K(ν) = exp{ 0.5 j ν 2 j /m j } 5.1 The two steps of the HMC algorithm Each iteration of the HMC algorithm has two steps. The first changes only the momentum; the second may change both position and momentum. Both steps leave the canonical joint distribution of (θ, ν) invariant, and hence their combination also leaves this distribution invariant. In the first step, new values for the momentum variables are randomly drawn from their Gaussian distribution, independently of the current values of the position variables. For the kinetic energy, the d momentum variables are independent, with ν i having mean zero and variance m i. Since θ isn t changed, and ν is drawn from it s correct conditional distribution given θ (the same as its marginal distribution, due to independence), this step obviously leaves the canonical joint distribution invariant. In the second step, a Metropolis update is performed, using Hamiltonian dynamics to propose a new state. Starting with the current state, (θ, ν), Hamiltonian dynamics is simulated for L steps using the Leapfrog method (or some other reversible method that preserves volume), with a stepsize of ɛ. Here, L and are parameters of the algorithm, which need to be tuned to obtain good performance. The momentum variables at the end of this L-step trajectory are then negated, giving a proposed state (θ, ν ). This proposed state is accepted as the next state of the Markov chain with probability min { 1, exp{ H(θ, ν ) + H(θ, ν)} } If the proposed state is not accepted (i.e, it is rejected), the next state is the same as the current state (and is counted again when estimating the expectation of some function of state by its average over 7

8 states of the Markov chain). The negation of the momentum variables at the end of the trajectory makes the Metropolis proposal symmetrical, as needed for the acceptance probability above to be valid. This negation need not be done in practice, since K(ν) = K( ν), and the momentum will be replaced before it is used again, in the first step of the next iteration. (This assumes that these HMC updates are the only ones performed.) 8

17 : Optimization and Monte Carlo Methods

10-708: Probabilistic Graphical Models Spring 2017 17 : Optimization and Monte Carlo Methods Lecturer: Avinava Dubey Scribes: Neil Spencer, YJ Choe 1 Recap 1.1 Monte Carlo Monte Carlo methods such as rejection