Computer Intensive Methods in Mathematical Statistics

Computer Intensive Methods in Mathematical Statistics Department of mathematics johawes@kth.se Lecture 16 Advanced topics in computational statistics 18 May 2017 Computer Intensive Methods (1)

Plan of today s lecture 1 Last time: the EM algorithm 2 Computer Intensive Methods (2)

Outline 1 Last time: the EM algorithm 2 Computer Intensive Methods (3)

Last time: the EM algorithm The algorithm goes as follows. Data: Initial value θ 0 Result: {θ l ; l N} for l 0, 1, 2,... do set Q θl (θ) E θl (log f θ (X, Y ) Y ); set θ l+1 arg max θ Θ Q θl (θ) end The two steps within the main loop are referred to as expectation and maximization steps, respectively. Computer Intensive Methods (4)

Last time: the EM inequality By rearranging the terms we obtain the following. Theorem (the EM inequality) For all (θ, θ ) Θ 2 it holds that l(θ) l(θ ) Q θ (θ) Q θ (θ ), where the equality is strict unless f θ (x Y ) = f θ (x Y ). Thus, by the very construction of {θ l ; l N} it is made sure that {l(θ l ); l N} is non-decreasing. Hence, the EM algorithm is a monotone optimization algorithm. Computer Intensive Methods (5)

EM in exponential families In order to be practically useful, the E- and M-steps of EM have to be feasible. A rather general context in which this is the case is the following. Definition (exponential family) The family {f θ (x, y); θ Θ} defines an exponential family if the complete data likelihood is of form f θ (x, y) = exp (ψ(θ) φ(x) c(θ)) h(x), where φ and ψ are (possibly) vector-valued functions on R d and Θ, respectively, and h is a non-negative real-valued function on R d. All these quantities may depend on y. Computer Intensive Methods (6)

EM in exponential families (cont d) The intermediate quantity becomes Q θ (θ) = ψ(θ) E θ (φ(x) Y ) c(θ) + E θ (log h(x) Y ), }{{} ( ) where ( ) does not depend on θ and may thus be ignored. Consequently, in order to be able to apply EM we need 1 to be able to compute the smoothed sufficient statistics τ = E θ (φ(x) Y ) = φ(x)f θ (x Y ), 2 maximization of θ ψ(θ) τ c(θ) to be feasible for all τ. Computer Intensive Methods (7)

Example: nonlinear Gaussian model Let (y i ) n i=1 be independent observations of a random variable Y = h(x) + σ y ε y, where h is a possibly nonlinear function and X = µ + σ x ε x is not observable. Here ε x and ε y are independent standard Gaussian noise variables. The parameters θ = (µ, σ 2 x) governing the distribution of the unobservable variable X are unknown while it is known that σ y =.4. Computer Intensive Methods (8)

Example: nonlinear Gaussian model (cont.) In this case the complete data likelihood belongs to an exponential family with ( n ) n ( φ(x 1:n ) = xi 2, x i, ψ(θ) = 1 2σx 2, µ ) σx 2, and i=1 i=1 c(θ) = n 2 log σ2 x + nµ2 2σx 2. Letting τ i = E θl (φ i (X 1:n ) Y 1:n ), i {1, 2}, leads to µ l+1 = τ 2 n, (σ 2 x) l+1 = τ 1 n µ2 l+1. Computer Intensive Methods (9)

Example: nonlinear Gaussian model (cont.) Since the smoothed sufficient statistics are of additive form it holds that, e.g., τ 1 = E θl ( n i=1 X 2 i Y 1:n ) = n i=1 E θl ( X 2 i Y i ) (and similarly for τ 2 ). However, computing expectations under ( ) f θl (x i y i ) exp 1 2(σy) 2 {y i h(x i )} 2 1 l 2σy 2 {x i µ l } is in general infeasible (i.e. when the transformation h is a general nonlinear function). Computer Intensive Methods (10)

Example: nonlinear Gaussian model (cont.) Thus, within each iteration of EM we sample each component f θl (x i y i ) using MH. For simplicity we use the independent proposal r(x i ) = f θl (x i ), yielding the MH acceptance probability α(x (k) i, Xi ) = 1 f θ l (Y i Xi ) f θl (Y i X (k) i ) = 1 exp ( ) 1 2(σy) 2 {h 2 (Xi ) h 2 (X (k) i ) + 2Y i (h(x (k) i ) h(xi ))} 2. l Computer Intensive Methods (11)

Example: nonlinear Gaussian model (cont.) Data: Initial value θ 0 ; Y 1:n Result: {θ l ; l N} for l 0, 1, 2,... do set ˆτ j 0, j; for i 1,..., n do run an MH sampler targeting f θl (x i Y i ) (X (k) i ) N l set ˆτ 1 ˆτ 1 + N l (k) k=1 (X i ) 2 /N l ; set ˆτ 2 ˆτ 2 + N l k=1 X (k) i /N l ; end set µ l+1 ˆτ 2 /n; set (σ 2 x) l+1 ˆτ 1 /n µ 2 l+1 ; end k=1 ; Computer Intensive Methods (12)

Example: nonlinear Gaussian model (cont.) In order to assess the performance of this Monte Carlo EM algorithm we focus on the linear case, h(x) = x. A data record comprising n = 40 values was produced by simulation under (µ, σ x) = (1,.4) with σ y =.4. In this case, the true MLE is known and given by (check this!) ˆµ(Y 1:n ) = 1 n ˆσ 2 x(y 1:n ) = 1 n n Y i = 1.04, i=1 n {Y i ˆµ(Y 1:n )} 2.4 2 =.27. i=1 Computer Intensive Methods (13)

Example: nonlinear Gaussian model (cont.) 1.4 1.2 1 µ estimate 0.8 0.6 0.4 0.2 0 0 10 20 30 40 50 60 70 80 90 100 EM interation index Figure: EM learning trajectory (blue curve) for µ in the case h(x) = x. Red-dashed line indicates true MLE. N l = 200 for all iterations. Computer Intensive Methods (14)

Example: nonlinear Gaussian model (cont.) 0.5 0.45 0.4 σ 2 x estimate 0.35 0.3 0.25 0.2 0 10 20 30 40 50 60 70 80 90 100 EM interation index Figure: EM learning trajectory (blue curve) for σ 2 x in the case h(x) = x. Red-dashed line indicates true MLE. N l = 200 for all iterations. Computer Intensive Methods (15)

Averaging Roughly, for the MCEM algorithm one may prove that the variance of θ l ˆθ (where ˆθ is the MLE) is of order 1/N l. In the idealised situation where the estimates (θ l ) l are uncorrelated (which is not the case here) one may obtain an improved estimator of ˆθ by combining the individual estimates θ l in proportion of the inverse of their variance. Starting the averaging at iteration l 0 leads to θ l = l l =l 0 N l l l =l 0 N l θ l, l l 0. In the idealised situation the variance of θ l decreases as 1/ l l =l 0 N l (= total number of simulations). Computer Intensive Methods (16)

Example: nonlinear Gaussian model (cont.) 1.4 1.2 1 µ estimate 0.8 0.6 0.4 0.2 0 0 10 20 30 40 50 60 70 80 90 100 EM interation index Figure: EM learning trajectory (blue curve) for µ in the case h(x) = x. Red-dashed line indicates true MLE. Averaging after l 0 = 30 iterations. Computer Intensive Methods (17)

Example: nonlinear Gaussian model (cont.) 0.5 0.45 0.4 σ 2 x estimate 0.35 0.3 0.25 0 10 20 30 40 50 60 70 80 90 100 EM interation index Figure: EM learning trajectory (blue curve) for σ 2 x in the case h(x) = x. Red-dashed line indicates true MLE. Averaging after l 0 = 30 iterations. Computer Intensive Methods (18)

Outline 1 Last time: the EM algorithm 2 Computer Intensive Methods (19)

Outline 1 Last time: the EM algorithm 2 Computer Intensive Methods (20)

General hidden Markov models (HMMs) A hidden Markov model (HMM) comprises 1 a Markov chain (X k ) k 0 with transition density q, i.e. X k+1 X k = x k q(x k+1 x k ), which is hidden away from us but partially observed through 2 an observation process (Y k ) k 0 such that conditionally on the chain (X k ) k 0, (i) the Y k s are independent with (ii) conditional distribution of each Y k depending on the corresponding X k only. The density of the conditional distribution Y k (X k ) k 0 d. = Y k X k will be denoted by p(y k x k ). Computer Intensive Methods (21)

General HMMs (cont.) Graphically: Y k 1 Y k Y k+1 (Observations)... X k 1 X k X k+1... (Markov chain) Computer Intensive Methods (22) Y k X k = x k p(y k x k ) X k+1 X k = x k q(x k+1 x k ) X 0 χ(x 0 ) (Observation density) (Transition density) (Initial distribution)

The smoothing distribution In an HMM, the smoothing distribution f n (x 0:n y 0:n ) is the conditional distribution of X 0:n given Y 0:n = y 0:n. Theorem (Smoothing distribution) f n (x 0:n y 0:n ) = χ(x 0)p(y 0 x 0 ) n k=1 p(y k x k )q(x k x k 1 ), L n (y 0:n ) where L n (y 0:n ) = density of the observations y 0:n n = χ(x 0 )p(y 0 x 0 ) p(y k x k )q(x k x k 1 ) dx 0:n. k=1 Computer Intensive Methods (23)

The goal We wish, for n N, to estimate the expected value τ n = E[h n (X 0:n ) Y 0:n ], under the smoothing distribution. Here h n (x 0:n ) is of additive form, that is n 1 h n (x 0:n ) = h i (x i:i+1 ). i=0 For instance sufficient statistics which are needed for the EM algorithm. Computer Intensive Methods (24)

Using basic SISR Given particles and weights {(X i 0:n, ωi n)} N i=1 targeting f (x 0:n y 0:n ) we update this to estimate at time n + 1 using the following steps: Selection: Draw indices {I i n} N i=1 Mult({ωi n} N i=1 ) Mutation: For i = 1,..., N draw X i n+1 q( X Ii n n ) and set X i 0:n+1 = (X Ii n 0:n, X i n+1 ) Weighting: For i = 1,..., N set ω i n+1 = p(y n+1 X i n+1 ). Estimation: We estimate τ n+1 using ˆτ n+1 = N i=1 ω i n+1 N l=1 ωl n+1 h n+1 (X i 0:n+1). Computer Intensive Methods (25)

The problem of resampling We saw when comparing with the SIS algorithm that the resampling was necessary to achieve a stable estimate. But the resampling also depletes the historical trajectories. There will be an integer k such that at time n in the algorithm X0:k i = X j 0:k for all i and j. Computer Intensive Methods (26)

Fixing the degeneracy The goal to fixing this problem is the following observation: Conditioned on the observations the hidden Markov chain is also a Markov chain in the reverse direction. We denote the transition density of this Markov chain by q (xk x k+1, y 0:k ), this density can be written as q (xk x k+1, y 0:k ) = f (x k y 0:k )q(x k+1 x k ) f (x y 0:k )q(x k+1 x )dx. Since we can estimate f (x k y 0:k ) well using SISR we can use the following particle approximation of the backward distribution q N (x i k x j k+1, y 0:k) = ω i k q(x j k+1 x i k ) N l=1 ωl k q(x j k+1 x l k ). Computer Intensive Methods (27)

Forward Filtering Backward Smoothing (FFBSm) Using this approximation of the transition density we get the Forward Filtering Backward Smoothing algorithm. First we need to run the SISR algorithm and save all the filter estimates {(xk i, ωi k )}N i=1 for k = 0,..., N. We then estimate τ n using ˆτ n = N i 0 =1 N n 1 i n=0 s=0 ω is s q(x i s+1 s+1 x is N l=1 ωl sq(x i s+1 s+1 x s) l s ) ωn in N h n (x i 0 0, x i 1 1,..., x l=1 ωl n in ) n Computer Intensive Methods (28)

A Forward only version The previous algorithm requires two passes of the data, first the forward filtering, then the backward smoothing. When working with functions of additive form it is possible to perform smoothing in the following way. Introduce the auxiliary function T k (x), which we define as T k (x) = E[h k (X 0:k ) Y 0:k, X k = x] We can update this function recursively by noting that T k+1 (x) = E[T k (X k ) + h k (X k, X k+1 ) Y 0:k, X k+1 = x], that is the expected value of the previous function with the next additive part under the backward distribution! Computer Intensive Methods (29)

Using this decomposition Given that we have estimates for the filter at time k, {(xk i, ωi k )}N i=1, and estimates of {T k(xk i )}N i=1. Propagate the filter estimates using one step of SISR algorithm to get {(xk+1 i, ωi k+1 )}N i=1, now for i = 1,..., N estimate the function T k+1 (xk+1 i ) using T k+1 (x i k+1 ) = N j=1 τ k+1 is estimated using ˆτ k+1 = ω j k q(x i k+1 x j k ) N l=1 ωl k q(x i k+1 x l k )(T k(x j k ) + h(x j k, x i k+1 )) N i=1 ω i k+1 N l=1 ωl k+1 T k+1 (x i k+1 ) Computer Intensive Methods (30)

Speeding up the Algorithm This algorithm works, but it is quite slow. Idea! Can we replace calculating the expected value with an MC estimator? Can we sample sufficiently fast from the backward distribution? Formally we wish to sample J such that given all particles and weights at time k and k + 1 and an index i P(J = j) = ω j k q(x i k+1 x j k ) N l=1 ωl k q(x i k+1 x l k ) We can do this efficiently using accept-reject sampling! Computer Intensive Methods (31)

The PaRIS algorithm Now we can describe the PaRIS algorithm. Given particles and weights {(xk i, ωi k )}N i=1 targeting the filter distribution at time k. Together with values {T k (xk i )}N i=1 estimating T k (x). We proceed by first taking one step using the SISR algorithm to get particles and weights {(xk+1 i, ωi k+1 )}n i=1 targeting the filter at time k + 1. For i = 1,..., N we draw Ñ indices {Ji }Ñi =1 from the previous slide and calculate T k+1 (x i k+1) = 1 Ñ Ñ i =1 (T k (x Ji k ) + h ) k (xk Ji, x k+1) i The estimate of τ k+1 is then given by ˆτ k+1 = N ω i k+1 N i=1 l=1 l k+1 T k+1 (x i k+1) Computer Intensive Methods (32)

The PaRIS algorithm (cont.) Using this algorithm we have an efficient algorithm for online estimation of the expected value under the joint smoothing distribution. We require the target function to be of additive form. The number of backward samples (Ñ) needed in the algorithm turns out to be 2. We can (under some additional assumptions) prove the following: The computational complexity grows linearly with N. The estimate is asymptotically consistent. We can find a CLT for the error which behaves like 1/ N. Computer Intensive Methods (33)

An Example, Paramter Estimation. It turns out that the PaRIS algorithm can be used efficently when performing online parameter estimation. We tested our method on the stochastic volatility model X t+1 = φx t + σv t+1, Y t = β exp(x t /2)U t, t N, where {V t } t N and {U t } t N are independent sequences of mutually independent standard Gaussian noise variables. Parameters to be estimated are θ = (φ, σ 2, β 2 ). Computer Intensive Methods (34)

Last time: the EM algorithm Parameter Estimation (cont.) 1 0.5 φ 0-0.5-1 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 10 5 3 3.5 4 4.5 5 10 5 3 3.5 4 4.5 5 10 5 time 0.6 0.5 σ2 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 time 2 β2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 time PaRIS-based RML Computer Intensive Methods (35)

End of the course That is it for the lectures in this course. Hope you have enjoyed it. Review of HA2 sent by mail to johawes@kth.se by 24 May 12:00:00 Exam 30 May, 14:00:00 19:00:00 Re-Exam 14 August, 08:00:00 13:00:00 Computer Intensive Methods (36)