Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1 Neural Networks 2 Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sequential Data 1 Sequential Data 2 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 828
Part XVI from the from Standard Rejection Importance Gibbs 588of 828
Approximating π - Buffon s Needle (1777) Drop length c needle on parallel lines distance d apart Needle falls perpendicular: Probability of crossing the line is c/d. Needle falls at an arbitrary angle a: Probability of crossing the line c sin(a)/d. Every angle is equally probable. Calculate the mean: p(crossing) = c d π 0 sin(a) dp(a) = 1 π c d π (iv) n crossings in N experiments results in n N 2 π 0 sin(a) da = 2 π c d c d from the from Standard Rejection Importance Gibbs d c d a c sin a (i) Needle falls perpendicular (a = π/2). (ii) Needle falls at arbitrary angle a. 589of 828
For most probabilistic models of practical interest, exact inference is intractable. Need approximation. Numerical sampling (Monte methods). Fundamental problem : Find the expectation of some function f (z) w.r.t. a probability distribution p(z) E [f ] = f (z) p(z) dz Key idea : Draw z (l), l = 1...., L independent samples from p(z) and approximate the expectation by f = 1 L L f (z (l) ) l=1 from the from Standard Rejection Importance Gibbs Problem: How to obtain independent samples from p(z)? 590of 828
Approximating the Expectation of f (z) Samples must be independent, otherwise the effective sample size is much smaller than the apparent sample size. If f (z) is small in regions where p(z) is large (or vice versa) : need large sample sizes to catch contributions from all regions. from the from Standard p(z) f(z) Rejection Importance Gibbs z 591of 828
from the In a computer usually via pseudorandom number generator : an algorithm generating a sequence of numbers that approximates the properties of random numbers. Use a mathematically well crafted pseudorandom number generator. From now on we will assume that we have a good pseudorandom number generator for uniformly distributed data available. If you don t trust any algorithm : Three carefully adjusted radio receivers picking up atmospheric noise to provide real random numbers at http://www.random.org/ from the from Standard Rejection Importance Gibbs 592of 828
from Standard Goal: Sample from p(y) which is given in analytical form. Suppose uniformly distributed samples of z in the interval (0, 1) are available. Calculate the cumulative distribution function h(y) = y p(x) dx Transform the samples from U(z 0, 1) by y = h 1 (z) to obtain samples y distributed according to p(y). Easy to check that h is then the CDF of y: from the from Standard Rejection Importance Gibbs p(y < x) = p(h 1 (z) < x) = p(z < h(x)) = h(x). 593of 828
from Standard Goal: Sample from p(y) which is given in analytical form. If a uniformly distributed random variable z is transformed using y = h 1 (z) then y will be distributed according to p(y). 1 h(y) from the from Standard Rejection p(y) Importance Gibbs 0 y 594of 828
from the Exponential Distribution Goal: Sample from the exponential distribution { λe λy 0 y p(y) = 0 y < 0 with rate parameter λ > 0. Suppose uniformly distributed samples of z in the interval (0, 1) are available. Calculate the cumulative distribution function h(y) = y p(x) dx = y Transform the samples from U(z 0, 1) by 0 λe λx dx = 1 e λy from the from Standard Rejection Importance Gibbs y = h 1 (z) = 1 ln(1 z) λ to obtain samples y distributed according to the exponential distribution. 595of 828
from Multivariate Generalisation to multiple variables is straightforward Consider change of variables via the Jacobian p(y 1,..., y M ) = p(z 1,..., z M ) (z 1,..., z M ) (y 1,..., y M ) Technical challenge: Multiple integrals; inverting nonlinear functions of multiple variables. from the from Standard Rejection Importance Gibbs 596of 828
the Gaussian Distribution - Box-Muller 1 Generate pairs of uniformly distributed random numbers z 1, z 2 ( 1, 1) (e.g. z i = 2z 1 for z from U(z 0, 1)) 2 Discard any pair (z 1, z 2 ) unless z 2 1 + z2 2 1. Results in a uniform distribution inside of the unit circle p(z 1, z 2 ) = 1/π. 3 Evaluate r 2 = z 2 1 + z2 2 and y 1 = z 1 ( 2 ln r 2 r 2 y 2 = z 2 ( 2 ln r 2 r 2 ) 1/2 ) 1/2 4 y 1 and y 2 are independent with joint distribution p(y 1, y 2 ) = p(z 1, z 2 ) (z 1, z 2 ) (y 1, y 2 ) = 1 e y2 1 /2 1 e y2 2 /2 2π 2π from the from Standard Rejection Importance Gibbs 597of 828
the Multivariate Gaussian Generate x N (0, I) let y = Ax + b E y = b E [(y E y) (y E y)] = A E [x x]a = A A from the from Standard Rejection Importance We know y is Gaussian (why?). So to sample y N (m, Σ) put b = m A A = Σ A is the Cholesky factor of Σ Gibbs 598of 828
Rejection Assumption 1 : directly from p(z) is difficult, but we can evaluate p(z) up to some unknown normalisation constant Z p p(z) = 1 Z p p(z) Assumption 2 : We can draw samples from a simpler distribution q(z) and for some constant k and all z holds kq(z) p(z) from the from Standard Rejection kq(z 0 ) kq(z) Importance Gibbs u 0 p(z) z 0 z 599of 828
Rejection 1 Generate a random number z 0 from the distribution q(z). 2 Generate a number from the u 0 from the uniform distribution over [0, k q(z 0 )]. 3 If u 0 > p(z 0 ) then reject the pair (z 0, u 0 ). 4 The remaining pairs have uniform distribution under the curve p(z). 5 The z values are distributed according to p(z). from the from Standard Rejection kq(z 0 ) kq(z) Importance Gibbs u 0 p(z) z 0 z 600of 828
Rejection - Example Consider the Gamma Distribution for a > 1 Gam(z a, b) = ba z a 1 exp( bz) Γ(a) Suitable q(z) could be like the Cauchy distribution q(z) = k 1 + (z c) 2 /b 2 Samples z from q(z) by using uniformly distributed y and transformation z = b tan y + c for c = a 1, b 2 = 2a 1 and k as small as possible for kq(z) p(z). 0.15 from the from Standard Rejection Importance p(z) 0.1 0.05 Gibbs 0 0 10 20 30 z 601of 828
Suitable form for the proposal distribution q(z) might be difficult to find. If p(z) is log-concave (ln p(z) has nonincreasing derivatives), use the derivatives to construct an envelope. 1 Start with an initial grid of points z 1,..., z M and construct the envelope using the tangents at the p(z i ), i = 1,..., M). 2 Draw a sample from the envelop function and if accepted use it to calculate p(z). Otherwise, use it to refine the grid. ln p(z) from the from Standard Rejection Importance Gibbs z 1 z 2 z 3 z 602of 828
- Example The piecewise exponential distribution is defined as p(z) = k m λ m e λm(z zm 1) ẑ m 1,m < z ẑ m,m+1 where ẑ m 1,m is the point of intersection of the tangent lines at z m 1 and z m, λ m is the slope of the tangent at z m and k m accounts for the corresponding offset. ln p(z) from the from Standard Rejection Importance Gibbs z 1 z 2 z 3 z 603of 828
Rejection - Problems Need to find a proposal distribution q(z) which is a close upper bound to p(z); otherwise many samples are rejected. Curse of dimensionality for multivariate distributions. from the kq(z 0 ) kq(z) from Standard Rejection Importance u 0 p(z) Gibbs z 0 z 604of 828
Importance Directly calculate E p [f (z)] w.r.t. some p(z). Does not sample p(z) as an intermediate step. Again use samples z (l) from some proposal q(z). E [f ] = f (z) p(z) dz = f (z) p(z) q(z) dz q(z) 1 L p(z (l) ) L q(z (l) ) f (z(l) ) l=1 from the from Standard Rejection Importance Gibbs 605of 828
Importance - Unnormalised Consider unnormalised p(z) = p(z) Z p It follows then that, with r l = p(z(l) ) q(z (l) ), and q(z) = q(z) Z q. E [f ] = f (z) p(z) dz = Z q f (z) p(z) q(z) dz Z p q(z) Z q 1 L r l f (z (l) ). Z p L l=1 Using the same samples (and f (z) = 1) gives Z p 1 L r l Z q L E [f ] L l=1 l=1 r l L m=1 r f (z (l) ) m from the from Standard Rejection Importance Gibbs 606of 828
Importance - Key Points Importance weights r l correct the bias introduced by sampling from the proposal distribution q(z) instead of the wanted distribution p(z). Success depends on how well q(z) approximates p(z). If p(z) > 0 in same region, then q(z) > 0 necessary. from the from Standard Rejection Importance Gibbs 607of 828
Goal: sample from p(z). Generate a sequence using a Markov chain. 1 Generate a new sample z q(z z (l) ), conditional on the previous sample z (l). 2 Accept or reject the new sample according to some appropriate criterion. { z (l+1) z if accepted = z (l) if rejected 3 For an appropriate proposal and corresponding acceptance criterion, as l, z (l) approaches an independent sample of p(z). from the from Standard Rejection Importance Gibbs 608of 828
Metropolis Algorithm 1 Choose a symmetric proposal distribution q(z A z B ) = q(z B z A ). 2 Accept the new sample z with probability ( ) A(z, z (r) ) = min 1, p(z ) p(z (r) ) e.g., let u Uniform(0, 1) and accept if p(z ) p(z (r) ) > u. 3 Unlike rejection sampling we include the previous sample on rejection of the proposal: { z (l+1) z if accepted = z (r) if rejected from the from Standard Rejection Importance Gibbs 609of 828
Metropolis Algorithm - Illustration from a Gaussian Distribution (black contour shows one standard deviation). Proposal is q(z z (l) ) = N (z z (l), σ 2 I) with σ = 0.2. 150 candidates generated; 43 rejected. 3 2.5 2 1.5 1 from the from Standard Rejection Importance Gibbs 0.5 0 0 0.5 1 1.5 2 2.5 3 accepted steps, rejected steps. 610of 828
- Why it works A First Order Markov Chain is a series of random variables z (1),..., z (M) such that the following property holds Marginal probability p(z (m+1) z (1),..., z (m) ) = p(z (m+1) z (m) ) p(z (m+1) ) = z (m) = z (m) p(z (m+1) z (m) ) p(z (m) ) T m (z (m) z (m+1) ) p(z (m) ) where T m (z (m) z (m+1) ) are the transition probabilities. from the from Standard Rejection Importance Gibbs z1 z2 zm zm+1 611of 828
- Why it works Marginal probability p(z (m+1) ) = z (m) T m (z (m) z (m+1) ) p(z (m) ) A Markov chain is called homogeneous if the transition probabilities are the same for all m, denoted by T(z, z). A distribution is invariant, or stationary, with respect to a Markov chain if each step leaves the distribution invariant. For a homogeneous Markov chain, the distribution p (z) is invariant if p (z) = z T(z, z) p (z ). from the from Standard Rejection Importance Gibbs (Note: There can be many. If T is the identity matrix, every distribution is invariant.) 612of 828
- Why it works Detailed balance p (z) T(z, z ) = p (z ) T(z, z). is sufficient (but not necessary) for p (z) to be invariant. (A Markov chain that respects the detailed balance is called reversible.) A Markov chain is ergodic if it converges to the invariant distribution irrespective of the choice of the initial conditions. The invariant distribution is then called equilibrum. An ergodic Markov chain can have only one equilibrium distribution. Why is it working? Choose the transition probabilities T to satisfy the detailed balance for our goal distribution p(z). from the from Standard Rejection Importance Gibbs 613of 828
- Metropolis-Hastings Generalisation of the Metropolis algorithm for nonsymmetric proposal distributions q k. At step τ, draw a sample z from the distribution q k (z z (τ) ) where k labels the set of possible transitions. Accept with probability ( A k (z, z (τ) p(z ) q k (z (τ) z ) ) ) = min 1, p(z (τ) ) q k (z z (τ) ) Choice of proposal distribution critical. Common choice : Gaussian centered on the current state. small variance high acceptance rate, but slow walk through the state space; samples not independent large variance high rejection rate from the from Standard Rejection Importance Gibbs 614of 828
- Metropolis-Hastings Transition probability of this Markov chain is T(z, z ) = q k (z z) A k (z, z) Prove that p(z) is the invariant distribution if the detailed balance holds p(z) T(z, z ) = T(z, z) p(z ). Using the symmetry min(a, b) = min(b, a) it can be shown that the detailed balance holds p(z) q k (z z) A k (z, z) = min (p(z) q k (z z), p(z ) q k (z z )) = min (p(z ) q k (z z ), p(z) q k (z z)) = p(z ) q k (z z ) A k (z, z ). from the from Standard Rejection Importance Gibbs 615of 828
- Metropolis-Hastings Isotropoic Gaussian proposal distribution (blue) In order to keep the rejection rate low, use the smallest standard deviation σ min of the multivariate Gaussian (red) for the proposal distribution. Leads to random walk behaviour slow exploration of the state space. Number of steps separating states that are approximately independent is (σ max /σ min ) 2. from the from Standard Rejection Importance σ max Gibbs ρ σ min 616of 828
Gibbs Goal: sample from a joint distribution p(z) = p(z 1,..., z M ) How? sample one variable from the distribution conditioned on all the other variable Example : given p(z 1, z 2, z 3 ) At step τ we have samples z (τ) 1, z (τ) 2 and z (τ) 3. Get samples for the next step τ + 1 z (τ+1) 1 p(z (τ) 1 z (τ) 2, z (τ) 3 ) z (τ+1) 2 p(z (τ) 2 z (τ+1) 1, z (τ) 3 ) z (τ+1) 3 p(z (τ) 3 z (τ+1) 1, z (τ+1) 2 ) from the from Standard Rejection Importance Gibbs 617of 828
Gibbs - Why does it work? 1 p(z) is invariant of each of the Gibbs sampling steps and hence of the whole Markov chain : a) p(z i {z \i }) is invariant because the marginal distribution p(z \i ) does not change, and b) by definition each step samples from p(z i {z \i }). 2 Ergodicity: sufficient that none of the conditional distribution is zero anywhere. 3 Gibbs sampling is a Metropolis-Hastings sampling in which each step is accepted. z 2 L from the from Standard Rejection Importance Gibbs l 618of 828