Computer Intensive Methods in Mathematical Statistics

Similar documents
Computer Intensive Methods in Mathematical Statistics

Computer Intensive Methods in Mathematical Statistics

Sequential Monte Carlo Methods for Bayesian Computation

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Consistency of the maximum likelihood estimator for general hidden Markov models

Note Set 5: Hidden Markov Models

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

State-Space Methods for Inferring Spike Trains from Calcium Imaging

Lecture Particle Filters. Magnus Wiktorsson

An introduction to Sequential Monte Carlo

Markov Chains and Hidden Markov Models

Auxiliary Particle Methods

cappe/

An Brief Overview of Particle Filtering

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Basic math for biology

Why do we care? Measurements. Handling uncertainty over time: predicting, estimating, recognizing, learning. Dealing with time

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Lecture 4: Dynamic models

Sequential Monte Carlo Methods (for DSGE Models)

Graphical Models and Kernel Methods

Particle Filtering for Data-Driven Simulation and Optimization

Bayesian Machine Learning - Lecture 7

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Stat 451 Lecture Notes Monte Carlo Integration

Linear Dynamical Systems

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Bayesian Methods for Machine Learning

Machine Learning 4771

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Spring 2012 Math 541B Exam 1

Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Expectation Maximization

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

Monte Carlo Approximation of Monte Carlo Filters

9 Forward-backward algorithm, sum-product on factor graphs

Lecture 6: Bayesian Inference in SDE Models

Advanced Computational Methods in Statistics: Lecture 5 Sequential Monte Carlo/Particle Filtering

Factor Graphs and Message Passing Algorithms Part 1: Introduction

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Sequential Bayesian Updating

Why do we care? Examples. Bayes Rule. What room am I in? Handling uncertainty over time: predicting, estimating, recognizing, learning

Monte Carlo Methods. Leon Gu CSD, CMU

Adaptive Population Monte Carlo

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

State Estimation of Linear and Nonlinear Dynamic Systems

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Variational Inference (11/04/13)

L09. PARTICLE FILTERING. NA568 Mobile Robotics: Methods & Algorithms

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

Linear Dynamical Systems (Kalman filter)

Hidden Markov Models Part 2: Algorithms

Lecture 7: Optimal Smoothing

Blind Equalization via Particle Filtering

An Introduction to Sequential Monte Carlo for Filtering and Smoothing

STA 4273H: Statistical Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

TSRT14: Sensor Fusion Lecture 8

Answers and expectations

The Kalman Filter ImPr Talk

Conditional Random Field

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Introduction to Machine Learning

Variational Autoencoder

State Space and Hidden Markov Models

Lecture 6: Gaussian Mixture Models (GMM)

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Unsupervised Learning

Introduction to Estimation Methods for Time Series models Lecture 2

A Note on Auxiliary Particle Filters

Markov Chain Monte Carlo Methods for Stochastic

Bayesian Inference and MCMC

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Theory of Stochastic Processes 8. Markov chain Monte Carlo

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

an introduction to bayesian inference

Based on slides by Richard Zemel

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Hidden Markov Models

Pattern Recognition and Machine Learning

Lecture 4: State Estimation in Hidden Markov Models (cont.)

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

A minimalist s exposition of EM

Hidden Markov Models in Language Processing

Factor Analysis and Kalman Filtering (11/2/04)

Advanced Introduction to Machine Learning

19 : Slice Sampling and HMC

ECE 5984: Introduction to Machine Learning

The Hierarchical Particle Filter

EM Algorithm II. September 11, 2018

Hidden Markov Models for precipitation

Approximate Bayesian Computation and Particle Filters

Introduction to Machine Learning Midterm, Tues April 8

Chapter 8.8.1: A factorization theorem

Controlled sequential Monte Carlo

Transcription:

Computer Intensive Methods in Mathematical Statistics Department of mathematics johawes@kth.se Lecture 16 Advanced topics in computational statistics 18 May 2017 Computer Intensive Methods (1)

Plan of today s lecture 1 Last time: the EM algorithm 2 Computer Intensive Methods (2)

Outline 1 Last time: the EM algorithm 2 Computer Intensive Methods (3)

Last time: the EM algorithm The algorithm goes as follows. Data: Initial value θ 0 Result: {θ l ; l N} for l 0, 1, 2,... do set Q θl (θ) E θl (log f θ (X, Y ) Y ); set θ l+1 arg max θ Θ Q θl (θ) end The two steps within the main loop are referred to as expectation and maximization steps, respectively. Computer Intensive Methods (4)

Last time: the EM inequality By rearranging the terms we obtain the following. Theorem (the EM inequality) For all (θ, θ ) Θ 2 it holds that l(θ) l(θ ) Q θ (θ) Q θ (θ ), where the equality is strict unless f θ (x Y ) = f θ (x Y ). Thus, by the very construction of {θ l ; l N} it is made sure that {l(θ l ); l N} is non-decreasing. Hence, the EM algorithm is a monotone optimization algorithm. Computer Intensive Methods (5)

EM in exponential families In order to be practically useful, the E- and M-steps of EM have to be feasible. A rather general context in which this is the case is the following. Definition (exponential family) The family {f θ (x, y); θ Θ} defines an exponential family if the complete data likelihood is of form f θ (x, y) = exp (ψ(θ) φ(x) c(θ)) h(x), where φ and ψ are (possibly) vector-valued functions on R d and Θ, respectively, and h is a non-negative real-valued function on R d. All these quantities may depend on y. Computer Intensive Methods (6)

EM in exponential families (cont d) The intermediate quantity becomes Q θ (θ) = ψ(θ) E θ (φ(x) Y ) c(θ) + E θ (log h(x) Y ), }{{} ( ) where ( ) does not depend on θ and may thus be ignored. Consequently, in order to be able to apply EM we need 1 to be able to compute the smoothed sufficient statistics τ = E θ (φ(x) Y ) = φ(x)f θ (x Y ), 2 maximization of θ ψ(θ) τ c(θ) to be feasible for all τ. Computer Intensive Methods (7)

Example: nonlinear Gaussian model Let (y i ) n i=1 be independent observations of a random variable Y = h(x) + σ y ε y, where h is a possibly nonlinear function and X = µ + σ x ε x is not observable. Here ε x and ε y are independent standard Gaussian noise variables. The parameters θ = (µ, σ 2 x) governing the distribution of the unobservable variable X are unknown while it is known that σ y =.4. Computer Intensive Methods (8)

Example: nonlinear Gaussian model (cont.) In this case the complete data likelihood belongs to an exponential family with ( n ) n ( φ(x 1:n ) = xi 2, x i, ψ(θ) = 1 2σx 2, µ ) σx 2, and i=1 i=1 c(θ) = n 2 log σ2 x + nµ2 2σx 2. Letting τ i = E θl (φ i (X 1:n ) Y 1:n ), i {1, 2}, leads to µ l+1 = τ 2 n, (σ 2 x) l+1 = τ 1 n µ2 l+1. Computer Intensive Methods (9)

Example: nonlinear Gaussian model (cont.) Since the smoothed sufficient statistics are of additive form it holds that, e.g., τ 1 = E θl ( n i=1 X 2 i Y 1:n ) = n i=1 E θl ( X 2 i Y i ) (and similarly for τ 2 ). However, computing expectations under ( ) f θl (x i y i ) exp 1 2(σy) 2 {y i h(x i )} 2 1 l 2σy 2 {x i µ l } is in general infeasible (i.e. when the transformation h is a general nonlinear function). Computer Intensive Methods (10)

Example: nonlinear Gaussian model (cont.) Thus, within each iteration of EM we sample each component f θl (x i y i ) using MH. For simplicity we use the independent proposal r(x i ) = f θl (x i ), yielding the MH acceptance probability α(x (k) i, Xi ) = 1 f θ l (Y i Xi ) f θl (Y i X (k) i ) = 1 exp ( ) 1 2(σy) 2 {h 2 (Xi ) h 2 (X (k) i ) + 2Y i (h(x (k) i ) h(xi ))} 2. l Computer Intensive Methods (11)

Example: nonlinear Gaussian model (cont.) Data: Initial value θ 0 ; Y 1:n Result: {θ l ; l N} for l 0, 1, 2,... do set ˆτ j 0, j; for i 1,..., n do run an MH sampler targeting f θl (x i Y i ) (X (k) i ) N l set ˆτ 1 ˆτ 1 + N l (k) k=1 (X i ) 2 /N l ; set ˆτ 2 ˆτ 2 + N l k=1 X (k) i /N l ; end set µ l+1 ˆτ 2 /n; set (σ 2 x) l+1 ˆτ 1 /n µ 2 l+1 ; end k=1 ; Computer Intensive Methods (12)

Example: nonlinear Gaussian model (cont.) In order to assess the performance of this Monte Carlo EM algorithm we focus on the linear case, h(x) = x. A data record comprising n = 40 values was produced by simulation under (µ, σ x) = (1,.4) with σ y =.4. In this case, the true MLE is known and given by (check this!) ˆµ(Y 1:n ) = 1 n ˆσ 2 x(y 1:n ) = 1 n n Y i = 1.04, i=1 n {Y i ˆµ(Y 1:n )} 2.4 2 =.27. i=1 Computer Intensive Methods (13)

Example: nonlinear Gaussian model (cont.) 1.4 1.2 1 µ estimate 0.8 0.6 0.4 0.2 0 0 10 20 30 40 50 60 70 80 90 100 EM interation index Figure: EM learning trajectory (blue curve) for µ in the case h(x) = x. Red-dashed line indicates true MLE. N l = 200 for all iterations. Computer Intensive Methods (14)

Example: nonlinear Gaussian model (cont.) 0.5 0.45 0.4 σ 2 x estimate 0.35 0.3 0.25 0.2 0 10 20 30 40 50 60 70 80 90 100 EM interation index Figure: EM learning trajectory (blue curve) for σ 2 x in the case h(x) = x. Red-dashed line indicates true MLE. N l = 200 for all iterations. Computer Intensive Methods (15)

Averaging Roughly, for the MCEM algorithm one may prove that the variance of θ l ˆθ (where ˆθ is the MLE) is of order 1/N l. In the idealised situation where the estimates (θ l ) l are uncorrelated (which is not the case here) one may obtain an improved estimator of ˆθ by combining the individual estimates θ l in proportion of the inverse of their variance. Starting the averaging at iteration l 0 leads to θ l = l l =l 0 N l l l =l 0 N l θ l, l l 0. In the idealised situation the variance of θ l decreases as 1/ l l =l 0 N l (= total number of simulations). Computer Intensive Methods (16)

Example: nonlinear Gaussian model (cont.) 1.4 1.2 1 µ estimate 0.8 0.6 0.4 0.2 0 0 10 20 30 40 50 60 70 80 90 100 EM interation index Figure: EM learning trajectory (blue curve) for µ in the case h(x) = x. Red-dashed line indicates true MLE. Averaging after l 0 = 30 iterations. Computer Intensive Methods (17)

Example: nonlinear Gaussian model (cont.) 0.5 0.45 0.4 σ 2 x estimate 0.35 0.3 0.25 0 10 20 30 40 50 60 70 80 90 100 EM interation index Figure: EM learning trajectory (blue curve) for σ 2 x in the case h(x) = x. Red-dashed line indicates true MLE. Averaging after l 0 = 30 iterations. Computer Intensive Methods (18)

Outline 1 Last time: the EM algorithm 2 Computer Intensive Methods (19)

Outline 1 Last time: the EM algorithm 2 Computer Intensive Methods (20)

General hidden Markov models (HMMs) A hidden Markov model (HMM) comprises 1 a Markov chain (X k ) k 0 with transition density q, i.e. X k+1 X k = x k q(x k+1 x k ), which is hidden away from us but partially observed through 2 an observation process (Y k ) k 0 such that conditionally on the chain (X k ) k 0, (i) the Y k s are independent with (ii) conditional distribution of each Y k depending on the corresponding X k only. The density of the conditional distribution Y k (X k ) k 0 d. = Y k X k will be denoted by p(y k x k ). Computer Intensive Methods (21)

General hidden Markov models (HMMs) A hidden Markov model (HMM) comprises 1 a Markov chain (X k ) k 0 with transition density q, i.e. X k+1 X k = x k q(x k+1 x k ), which is hidden away from us but partially observed through 2 an observation process (Y k ) k 0 such that conditionally on the chain (X k ) k 0, (i) the Y k s are independent with (ii) conditional distribution of each Y k depending on the corresponding X k only. The density of the conditional distribution Y k (X k ) k 0 d. = Y k X k will be denoted by p(y k x k ). Computer Intensive Methods (21)

General HMMs (cont.) Graphically: Y k 1 Y k Y k+1 (Observations)... X k 1 X k X k+1... (Markov chain) Computer Intensive Methods (22) Y k X k = x k p(y k x k ) X k+1 X k = x k q(x k+1 x k ) X 0 χ(x 0 ) (Observation density) (Transition density) (Initial distribution)

The smoothing distribution In an HMM, the smoothing distribution f n (x 0:n y 0:n ) is the conditional distribution of X 0:n given Y 0:n = y 0:n. Theorem (Smoothing distribution) f n (x 0:n y 0:n ) = χ(x 0)p(y 0 x 0 ) n k=1 p(y k x k )q(x k x k 1 ), L n (y 0:n ) where L n (y 0:n ) = density of the observations y 0:n n = χ(x 0 )p(y 0 x 0 ) p(y k x k )q(x k x k 1 ) dx 0:n. k=1 Computer Intensive Methods (23)

The goal We wish, for n N, to estimate the expected value τ n = E[h n (X 0:n ) Y 0:n ], under the smoothing distribution. Here h n (x 0:n ) is of additive form, that is n 1 h n (x 0:n ) = h i (x i:i+1 ). i=0 For instance sufficient statistics which are needed for the EM algorithm. Computer Intensive Methods (24)

The goal We wish, for n N, to estimate the expected value τ n = E[h n (X 0:n ) Y 0:n ], under the smoothing distribution. Here h n (x 0:n ) is of additive form, that is n 1 h n (x 0:n ) = h i (x i:i+1 ). i=0 For instance sufficient statistics which are needed for the EM algorithm. Computer Intensive Methods (24)

Using basic SISR Given particles and weights {(X i 0:n, ωi n)} N i=1 targeting f (x 0:n y 0:n ) we update this to estimate at time n + 1 using the following steps: Selection: Draw indices {I i n} N i=1 Mult({ωi n} N i=1 ) Mutation: For i = 1,..., N draw X i n+1 q( X Ii n n ) and set X i 0:n+1 = (X Ii n 0:n, X i n+1 ) Weighting: For i = 1,..., N set ω i n+1 = p(y n+1 X i n+1 ). Estimation: We estimate τ n+1 using ˆτ n+1 = N i=1 ω i n+1 N l=1 ωl n+1 h n+1 (X i 0:n+1). Computer Intensive Methods (25)

The problem of resampling We saw when comparing with the SIS algorithm that the resampling was necessary to achieve a stable estimate. But the resampling also depletes the historical trajectories. There will be an integer k such that at time n in the algorithm X0:k i = X j 0:k for all i and j. Computer Intensive Methods (26)

The problem of resampling We saw when comparing with the SIS algorithm that the resampling was necessary to achieve a stable estimate. But the resampling also depletes the historical trajectories. There will be an integer k such that at time n in the algorithm X0:k i = X j 0:k for all i and j. 1.5 D 1 0.5 State Value 0 0.5 1 1.5 Computer Intensive Methods (26) 2 0 20 40 60 80 100 120 Time Index

Fixing the degeneracy The goal to fixing this problem is the following observation: Conditioned on the observations the hidden Markov chain is also a Markov chain in the reverse direction. We denote the transition density of this Markov chain by q (xk x k+1, y 0:k ), this density can be written as q (xk x k+1, y 0:k ) = f (x k y 0:k )q(x k+1 x k ) f (x y 0:k )q(x k+1 x )dx. Since we can estimate f (x k y 0:k ) well using SISR we can use the following particle approximation of the backward distribution q N (x i k x j k+1, y 0:k) = ω i k q(x j k+1 x i k ) N l=1 ωl k q(x j k+1 x l k ). Computer Intensive Methods (27)

Fixing the degeneracy The goal to fixing this problem is the following observation: Conditioned on the observations the hidden Markov chain is also a Markov chain in the reverse direction. We denote the transition density of this Markov chain by q (xk x k+1, y 0:k ), this density can be written as q (xk x k+1, y 0:k ) = f (x k y 0:k )q(x k+1 x k ) f (x y 0:k )q(x k+1 x )dx. Since we can estimate f (x k y 0:k ) well using SISR we can use the following particle approximation of the backward distribution q N (x i k x j k+1, y 0:k) = ω i k q(x j k+1 x i k ) N l=1 ωl k q(x j k+1 x l k ). Computer Intensive Methods (27)

Forward Filtering Backward Smoothing (FFBSm) Using this approximation of the transition density we get the Forward Filtering Backward Smoothing algorithm. First we need to run the SISR algorithm and save all the filter estimates {(xk i, ωi k )}N i=1 for k = 0,..., N. We then estimate τ n using ˆτ n = N i 0 =1 N n 1 i n=0 s=0 ω is s q(x i s+1 s+1 x is N l=1 ωl sq(x i s+1 s+1 x s) l s ) ωn in N h n (x i 0 0, x i 1 1,..., x l=1 ωl n in ) n Computer Intensive Methods (28)

A Forward only version The previous algorithm requires two passes of the data, first the forward filtering, then the backward smoothing. When working with functions of additive form it is possible to perform smoothing in the following way. Introduce the auxiliary function T k (x), which we define as T k (x) = E[h k (X 0:k ) Y 0:k, X k = x] We can update this function recursively by noting that T k+1 (x) = E[T k (X k ) + h k (X k, X k+1 ) Y 0:k, X k+1 = x], that is the expected value of the previous function with the next additive part under the backward distribution! Computer Intensive Methods (29)

Using this decomposition Given that we have estimates for the filter at time k, {(xk i, ωi k )}N i=1, and estimates of {T k(xk i )}N i=1. Propagate the filter estimates using one step of SISR algorithm to get {(xk+1 i, ωi k+1 )}N i=1, now for i = 1,..., N estimate the function T k+1 (xk+1 i ) using T k+1 (x i k+1 ) = N j=1 τ k+1 is estimated using ˆτ k+1 = ω j k q(x i k+1 x j k ) N l=1 ωl k q(x i k+1 x l k )(T k(x j k ) + h(x j k, x i k+1 )) N i=1 ω i k+1 N l=1 ωl k+1 T k+1 (x i k+1 ) Computer Intensive Methods (30)

Speeding up the Algorithm This algorithm works, but it is quite slow. Idea! Can we replace calculating the expected value with an MC estimator? Can we sample sufficiently fast from the backward distribution? Formally we wish to sample J such that given all particles and weights at time k and k + 1 and an index i P(J = j) = ω j k q(x i k+1 x j k ) N l=1 ωl k q(x i k+1 x l k ) We can do this efficiently using accept-reject sampling! Computer Intensive Methods (31)

Speeding up the Algorithm This algorithm works, but it is quite slow. Idea! Can we replace calculating the expected value with an MC estimator? Can we sample sufficiently fast from the backward distribution? Formally we wish to sample J such that given all particles and weights at time k and k + 1 and an index i P(J = j) = ω j k q(x i k+1 x j k ) N l=1 ωl k q(x i k+1 x l k ) We can do this efficiently using accept-reject sampling! Computer Intensive Methods (31)

Speeding up the Algorithm This algorithm works, but it is quite slow. Idea! Can we replace calculating the expected value with an MC estimator? Can we sample sufficiently fast from the backward distribution? Formally we wish to sample J such that given all particles and weights at time k and k + 1 and an index i P(J = j) = ω j k q(x i k+1 x j k ) N l=1 ωl k q(x i k+1 x l k ) We can do this efficiently using accept-reject sampling! Computer Intensive Methods (31)

The PaRIS algorithm Now we can describe the PaRIS algorithm. Given particles and weights {(xk i, ωi k )}N i=1 targeting the filter distribution at time k. Together with values {T k (xk i )}N i=1 estimating T k (x). We proceed by first taking one step using the SISR algorithm to get particles and weights {(xk+1 i, ωi k+1 )}n i=1 targeting the filter at time k + 1. For i = 1,..., N we draw Ñ indices {Ji }Ñi =1 from the previous slide and calculate T k+1 (x i k+1) = 1 Ñ Ñ i =1 (T k (x Ji k ) + h ) k (xk Ji, x k+1) i The estimate of τ k+1 is then given by ˆτ k+1 = N ω i k+1 N i=1 l=1 l k+1 T k+1 (x i k+1) Computer Intensive Methods (32)

The PaRIS algorithm (cont.) Using this algorithm we have an efficient algorithm for online estimation of the expected value under the joint smoothing distribution. We require the target function to be of additive form. The number of backward samples (Ñ) needed in the algorithm turns out to be 2. We can (under some additional assumptions) prove the following: The computational complexity grows linearly with N. The estimate is asymptotically consistent. We can find a CLT for the error which behaves like 1/ N. Computer Intensive Methods (33)

An Example, Paramter Estimation. It turns out that the PaRIS algorithm can be used efficently when performing online parameter estimation. We tested our method on the stochastic volatility model X t+1 = φx t + σv t+1, Y t = β exp(x t /2)U t, t N, where {V t } t N and {U t } t N are independent sequences of mutually independent standard Gaussian noise variables. Parameters to be estimated are θ = (φ, σ 2, β 2 ). Computer Intensive Methods (34)

Last time: the EM algorithm Parameter Estimation (cont.) 1 0.5 φ 0-0.5-1 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 10 5 3 3.5 4 4.5 5 10 5 3 3.5 4 4.5 5 10 5 time 0.6 0.5 σ2 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 time 2 β2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 time PaRIS-based RML Computer Intensive Methods (35)

End of the course That is it for the lectures in this course. Hope you have enjoyed it. Review of HA2 sent by mail to johawes@kth.se by 24 May 12:00:00 Exam 30 May, 14:00:00 19:00:00 Re-Exam 14 August, 08:00:00 13:00:00 Computer Intensive Methods (36)