Temporal probability models. Chapter 15

Similar documents
PROBABILISTIC REASONING OVER TIME

Hidden Markov models 1

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1

Temporal probability models. Chapter 15, Sections 1 5 1

Dynamic Bayesian Networks and Hidden Markov Models Decision Trees

Course Overview. Summary. Performance of approximation algorithms

Hidden Markov Models (recap BNs)

Temporal probability models

CS532, Winter 2010 Hidden Markov Models

CS 7180: Behavioral Modeling and Decision- making in AI

Extensions of Bayesian Networks. Outline. Bayesian Network. Reasoning under Uncertainty. Features of Bayesian Networks.

PROBABILISTIC REASONING Outline

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on. Professor Wei-Min Shen Week 8.1 and 8.2

Hidden Markov Models. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 19 Apr 2012

CSEP 573: Artificial Intelligence

Bayesian Networks BY: MOHAMAD ALSABBAGH

Artificial Intelligence

Temporal probability models. Chapter 15, Sections 1 5 1

CS 343: Artificial Intelligence

CSE 473: Artificial Intelligence

Markov Models. CS 188: Artificial Intelligence Fall Example. Mini-Forward Algorithm. Stationary Distributions.

Probabilistic Graphical Models

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Announcements. CS 188: Artificial Intelligence Fall VPI Example. VPI Properties. Reasoning over Time. Markov Models. Lecture 19: HMMs 11/4/2008

Bayesian Methods in Artificial Intelligence

Hidden Markov Models. Vibhav Gogate The University of Texas at Dallas

CS 188: Artificial Intelligence Spring Announcements

CS 343: Artificial Intelligence

Stochastic inference in Bayesian networks, Markov chain Monte Carlo methods

Introduction to Artificial Intelligence (AI)

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Inference in Bayesian networks

Linear Dynamical Systems (Kalman filter)

Reasoning Under Uncertainty Over Time. CS 486/686: Introduction to Artificial Intelligence

Approximate Inference

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

CS 5522: Artificial Intelligence II

STA 4273H: Statistical Machine Learning

CS 5522: Artificial Intelligence II

Intelligent Systems (AI-2)

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

Kalman filtering and friends: Inference in time series models. Herke van Hoof slides mostly by Michael Rubinstein

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

CS 188: Artificial Intelligence Spring 2009

Lecture 4: State Estimation in Hidden Markov Models (cont.)

Intelligent Systems (AI-2)

CS 188: Artificial Intelligence Fall Recap: Inference Example

CS 7180: Behavioral Modeling and Decision- making in AI

Linear Dynamical Systems

Markov Models and Hidden Markov Models

Announcements. CS 188: Artificial Intelligence Fall Markov Models. Example: Markov Chain. Mini-Forward Algorithm. Example

CS 343: Artificial Intelligence

Probability, Markov models and HMMs. Vibhav Gogate The University of Texas at Dallas

STA 414/2104: Machine Learning

CSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule.

Hidden Markov Models,99,100! Markov, here I come!

Introduction to Artificial Intelligence (AI)

Sequential Monte Carlo Methods for Bayesian Computation

CS 188: Artificial Intelligence Fall 2011

CSE 473: Artificial Intelligence Spring 2014

Hidden Markov Models. George Konidaris

CS 5522: Artificial Intelligence II

Why do we care? Measurements. Handling uncertainty over time: predicting, estimating, recognizing, learning. Dealing with time

Markov Chains and Hidden Markov Models

CS 5522: Artificial Intelligence II

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

Kalman Filters, Dynamic Bayesian Networks

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Approximate Inference Part 1 of 2

Why do we care? Examples. Bayes Rule. What room am I in? Handling uncertainty over time: predicting, estimating, recognizing, learning

Approximate Inference Part 1 of 2

Advanced Data Science

Intelligent Systems (AI-2)

CS 188: Artificial Intelligence

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

CS188 Outline. We re done with Part I: Search and Planning! Part II: Probabilistic Reasoning. Part III: Machine Learning

Lecture 6: Bayesian Inference in SDE Models

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Outline } Exact inference by enumeration } Exact inference by variable elimination } Approximate inference by stochastic simulation } Approximate infe

Mini-project 2 (really) due today! Turn in a printout of your work at the end of the class

CS 188: Artificial Intelligence. Our Status in CS188

L09. PARTICLE FILTERING. NA568 Mobile Robotics: Methods & Algorithms

Our Status in CSE 5522

order is number of previous outputs

Forward algorithm vs. particle filtering

Dynamic Approaches: The Hidden Markov Model

CSE 473: Artificial Intelligence Probability Review à Markov Models. Outline

COMP90051 Statistical Machine Learning

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

9 Forward-backward algorithm, sum-product on factor graphs

Bayesian Machine Learning - Lecture 7

CS 188: Artificial Intelligence Spring Announcements

Inference in Bayesian Networks

Artificial Intelligence Methods. Inference in Bayesian networks

CS188 Outline. CS 188: Artificial Intelligence. Today. Inference in Ghostbusters. Probability. We re done with Part I: Search and Planning!

Inference in Bayesian networks

Template-Based Representations. Sargur Srihari

Artificial Intelligence

CSE 473: Ar+ficial Intelligence

Transcription:

Temporal probability models Chapter 15

Outline Time and uncertainty Inference: filtering, prediction, smoothing Hidden Markov models Kalman filters (a brief mention) Dynamic Bayesian networks Particle filtering

Time and uncertainty The world changes; we need to track and predict it Diabetes management vs vehicle diagnosis Basic idea: copy state and evidence variables for each time step X t = set of unobservable state variables at time t e.g., BloodSugar t, StomachContents t, etc. E t = set of observable evidence variables at time t e.g., MeasuredBloodSugar t, PulseRate t, FoodEaten t This assumes discrete time; step size depends on problem Notation: X a:b = X a,x a+1,...,x b 1,X b

Reducing Causal Dependencies Only the essential causal relationships should be in our models. Else, too complex. Keys to reducing number of variable dependencies: Sensor Model - evidence is totally dependent upon current state. Removes all red links. Markov Assump - Current state = f(finite history of earlier states). Reduces blue links. Stationary Process Assump - Same causal mechs at each step. X = state of heart, kidneys, lungs, liver, etc. Transition X(t-1) X(t) X(t+1) Model Sensor Model e(t-1) e(t) e(t+1) e = pulse rate, blood pressure, temperature, etc.

Markov processes (Markov chains) Construct a Bayes net from these variables: parents? Markov assumption: X t depends on bounded subset of X 0:t 1 First-order Markov process: P(X t X 0:t 1 ) = P(X t X t 1 ) Second-order Markov process: P(X t X 0:t 1 ) = P(X t X t 2,X t 1 ) First order X t 2 X t 1 X t X t +1 X t +2 Second order X t 2 X t 1 X t X t +1 X t +2 Sensor Markov assumption: P(E t X 0:t,E 0:t 1 ) = P(E t X t ) Stationary process: transition model P(X t X t 1 ) and sensor model P(E t X t ) fixed for all t

Example Rain t 1 R t 1 t f P(R t ) 0.7 0.3 Rain t Rain t +1 Rt t f P(U t ) 0.9 0.2 Umbrella t 1 Umbrella t Umbrella t +1 First-order Markov assumption not exactly true in real world! Possible fixes: 1. Increase order of Markov process 2. Augment state, e.g., add Temp t, Pressure t Example: robot motion. Augment position and velocity with Battery t

Inference tasks Filtering: P(X t e 1:t ) belief state input to the decision process of a rational agent Prediction: P(X t+k e 1:t ) for k > 0 evaluation of possible action sequences; like filtering without the evidence Smoothing: P(X k e 1:t ) for 0 k < t better estimate of past states, essential for learning Most likely explanation: arg max x1:t P(x 1:t e 1:t ) speech recognition, decoding with a noisy channel All are Abductive tasks: Given evidence, calc probs of the causes (states). All use Bayes Rule: Calc p(x) from E (evidence) using p(x e) (sensor model). All use the transition model: p(x t x t+1 ).

Filtering Overview Transition Model X(0) X(1) X(2) X(t-1) X(t) X(t+k) Sensor Model?? e(1) e(2) e(t-1) e(t) e(t+k) Given: Evidence from time 0 to present. (E.g., tiredness, fever, runny nose) Compute: Probability of a given CURRENT state (E.g. Influenza?)

Filtering Aim: devise a recursive state estimation algorithm: P(X t+1 e 1:t+1 ) = f(e t+1,p(x t e 1:t )) P(X t+1 e 1:t+1 ) = P(X t+1 e 1:t,e t+1 ) = αp(e t+1 X t+1,e 1:t )P(X t+1 e 1:t ) = αp(e t+1 X t+1 )P(X t+1 e 1:t ) I.e., prediction + estimation. Prediction by summing out X t : P(X t+1 e 1:t+1 ) = αp(e t+1 X t+1 )Σ xt P(X t+1 x t,e 1:t )P(x t e 1:t ) = αp(e t+1 X t+1 )Σ xt P(X t+1 x t )P(x t e 1:t ) f 1:t+1 = Forward(f 1:t,e t+1 ) where f 1:t =P(X t e 1:t ) Time and space constant (independent of t)

Generalized Bayes Rule P(Y X, e) = Proof: P(Y X, e) = = = = P(X Y, e)p(y e) P(X e) P(X, Y, e) P(X, e) P(X Y, e)p(y, e) P(X,e) P(X Y, e)p(y e)p(e) P(X e)p(e) P(X Y, e)p(y e) P(X e)

Big win: Now we are using the causal (sensor) model for e t+1 (via P(e t+1 X t+1,e 1:t ) which reduces to P(e t+1 X t+1 )) Dealing with P(X t+1 e 1:t ) is easy, since it is reduces to Markov transitions and a recursive call. Generalized Bayes Rule in Filtering P(X t+1 e 1:t+1,e t ) = P(e t+1 X t+1,e 1:t )P(X t+1 e 1:t ) P(e t+1 e 1:t ) = αp(e t+1 X t+1,e 1:t )P(X t+1 e 1:t ) P(e t+1 e 1:t ) is the correct normalizing constant because: In the course of generating the distribution x t+1 X t+1, we go through all possible routes from e 1:t to e t+1, via the use of all possible x t+1 in the numerator of the General Bayes Equation. So the sum of these numerators for all possible x t+1 is P(e t+1 e 1:t ).

Recursive Nature of Filtering X(t) X(t+1) P(X t+1 I e 1:t+1 ) e(t) e(t+1) Forall x t P(X t+1 I x t ) Markov Transition Model Recursion X(t) X(t+1) P(X t I e 1:t ) P(e t+1 I X t+1 ) Sensor Model e(t-1) e(t) e(t+1) Forward Message

Filtering example 0.500 0.500 0.627 0.373 True False 0.500 0.500 0.818 0.182 0.883 0.117 Rain 0 Rain 1 Rain 2 Umbrella 1 Umbrella 2 Markovian Transition X t => X t+1 Bayesian Update of State Probs based on Evidence

Forward Messaging For n k-valued state vars: k n elems in state vector Size(circle i ) = probability that atomic-state event x i is explained by the evidence from times 1 to i Transition Model Sensor Model Transition Model Sensor Model Evidence@t1 Evidence@t2 For m j-valued evidence vars: j m elems in evidence vector

Prediction Overview Transition Model X(0) X(1) X(2) X(t-1) X(t) X(t+k) Sensor Model?? e(1) e(2) e(t-1) e(t) Given: Evidence (E.g., high sugar intake, sedentary lifestyle) Compute: Probability of a FUTURE state (e.g. diabetes)

Prediction Recursion X(t-1) X(t) X(t+1) X(t+2) X(t+k) Transition Model e(t-1) e(t) Forward Message P(X t+k+1 e 1:t ) = Σ xt+k P(X t+k+1 x t+k )P(x t+k e 1:t ) 2nd term of the summation = Recursion Base case = forward message: P(x t e 1:t )

Smoothing Overview Transition Model X(0) X(1) X(k) X(t) Sensor Model e(1) e(k) e(t) Given: Evidence (E.g., memory-loss today, loss of balance yesterday) Compute: Probability of an EARLIER state (e.g. stroke 2 days ago)

Smoothing and Bi-Directional Recursion Transition Model X(0) X(1) X(k) X(t) Sensor Model e(1) e(k) e(k+1) e(t-1) e(t) Forward Message Backward Message Forward actually means that we recurse backwards in time, but then the base-case probability is propagated forward from time 0 and modified by the transition and sensor models. Here, the base case is P(X 0 e 0 ) = P(X 0 ), since there is no evidence (by convention) at time 0. Backward involves forward recursion in time, with the eventual base-case probability propagating backwards from time = t. Here, the base case is P(e t+1 X t ) = 1, since the evidence at time t+1 is the empty set.

Smoothing X 0 X 1 X k X t Divide evidence e 1:t into e 1:k, e k+1:t : E1 E k E t P(X k e 1:t ) = P(X k e 1:k,e k+1:t ) = αp(x k e 1:k )P(e k+1:t X k,e 1:k ) = αp(x k e 1:k )P(e k+1:t X k ) = αf 1:k b k+1:t Backward message computed by a backwards recursion: P(e k+1:t X k ) = Σ xk+1 P(e k+1:t X k,x k+1 )P(x k+1 X k ) = Σ xk+1 P(e k+1:t x k+1 )P(x k+1 X k ) = Σ xk+1 P(e k+1 x k+1 )P(e k+2:t x k+1 )P(x k+1 X k ) 3 parts of the final sum: Sensor Model, Recursive Step, Transition Model

Smoothing Recursion X(k) p(e k+1:t I X k ) e(k) e(k+1) e(t) Forward Message Backward Message Forall x k-1 P(X k I x k-1 ) Forall x k+1 : p(x k+1 I x k ) Bi-directional Recursion X(k-1) X(k) X(k+1) P(x k-1 I e 1:k-1 ) p(e k I X k ) p(e k+1 I X k+1 ) p(e k+2:t I X k+1 ) e(k-1) e(k) e(k+1) e(k+2) e(t) Forward Message Backward

Backward Messaging Size(circle i ) = probability that atomic-state event x i explains/causes the remaining evidence Transition Model Sensor Model Transition Model Sensor Model Evidence@t-1 Evidence@t

Smoothing example 0.500 0.500 0.627 0.373 True False 0.500 0.500 0.818 0.182 0.883 0.117 forward 0.883 0.117 0.883 0.117 smoothed 0.690 0.410 1.000 1.000 backward Rain 0 Rain 1 Rain 2 Umbrella 1 Umbrella 2 Forward backward algorithm: cache forward messages along the way Time linear in t (polytree inference), space O(t f )

Combining Forward and Backward Messages Forward Messaging Transition Sensor Transition Sensor Transition Sensor p(x k I e 1:k ) p(x k+1 I e 1:k+1 ) 1) Multiply 2) Normalize product vector p(x k I e 1:t ) p(e k+1:t I x k ) p(e k+2:t I x k+1 ) Sensor Transition Sensor Transition Backward Messaging

Most likely explanation Most likely sequence sequence of most likely states!!!! Most likely path to each x t+1 = most likely path to some x t plus one more step max P(x x 1...x 1,...,x t,x t+1 e 1:t+1 ) t = P(e t+1 X t+1 ) max P(Xt+1 x x t ) max P(x t x 1...x 1,...,x t 1,x t e 1:t ) t 1 Identical to filtering, except f 1:t replaced by m 1:t = max x 1...x t 1 P(x 1,...,x t 1,X t e 1:t ), I.e., m 1:t (i) gives the probability of the most likely path to state i. Update has sum replaced by max, giving the Viterbi algorithm: m 1:t+1 = P(e t+1 X t+1 ) max x t (P(X t+1 x t )m 1:t )

Viterbi example Rain 1 Rain 2 Rain 3 Rain 4 Rain 5 state space paths true false true false true false true false true false umbrella true true false true true most likely paths.8182.5155.0361.0334.0210.1818.0491.1237.0173.0024 m 1:1 m 1:2 m 1:3 m 1:4 m 1:5

Hidden Markov models X t is a single, discrete variable (usually E t is too) Domain of X t is {1,...,S} Transition matrix T ij = P(X t = j X t 1 =i), e.g., 0.7 0.3 0.3 0.7 Sensor matrix O t for each time step, diagonal elements P(e t X t = i) 0.9 0 e.g., with U 1 = true, O 1 = 0 0.2 Forward and backward messages as column vectors: f 1:t+1 = αo t+1 T f 1:t b k+1:t = TO k+1 b k+2:t Forward-backward algorithm needs time O(S 2 t) and space O(St)

Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t O 1 t+1f 1:t+1 = αt f 1:t α (T ) 1 O 1 t+1f 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i

Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t O 1 t+1f 1:t+1 = αt f 1:t α (T ) 1 O 1 t+1f 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i

Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t O 1 t+1f 1:t+1 = αt f 1:t α (T ) 1 O 1 t+1f 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i

Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t O 1 t+1f 1:t+1 = αt f 1:t α (T ) 1 O 1 t+1f 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i

Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t O 1 t+1f 1:t+1 = αt f 1:t α (T ) 1 O 1 t+1f 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i

Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t O 1 t+1f 1:t+1 = αt f 1:t α (T ) 1 O 1 t+1f 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i

Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t O 1 t+1f 1:t+1 = αt f 1:t α (T ) 1 O 1 t+1f 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i

Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t O 1 t+1f 1:t+1 = αt f 1:t α (T ) 1 O 1 t+1f 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i

Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t O 1 t+1f 1:t+1 = αt f 1:t α (T ) 1 O 1 t+1f 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i

Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t O 1 t+1f 1:t+1 = αt f 1:t α (T ) 1 O 1 t+1f 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i

Dynamic Bayesian networks (DBNs) X t, E t contain arbitrarily many variables in a replicated Bayes net Battery Level Battery Level Transition Model Motor RPMs Motor RPMs Sensor Model Acceleration Time t Acceleration Internal Model Velocity Velocity Time t+1 Position Position Battery Gauge Velocity Gauge GPS Battery Gauge Velocity Gauge GPS

To build a DBN, you need: DBN Requirements Prior probabilities for the internal state variables: X 0 values. Transition model: P(X t+1 X t ). Sensor Model: P(E t X t ). Given this, we can do filtering, predictive, smoothing and best-explanation tasks. However, this can be expensive. Each conditional prob query invokes the Bayesian Reasoner, which may: Generate an exact answer via enumeration. Generate an approximate answer via sampling.

DBNs vs. HMMs Every HMM is a single-variable DBN; every discrete DBN is an HMM X t X t+1 Y t Yt+1 Z t Z t+1 Sparse dependencies exponentially fewer parameters; e.g., 20 state variables, three parents each DBN has 20 2 3 = 160 parameters, HMM has 2 20 2 20 10 12

Exact inference in DBNs Naive method: unroll the network and run any exact algorithm P(R 0) 0.7 Rain 0 R 0 t f P(R 1) 0.7 0.3 Rain 1 P(R 0) 0.7 Rain 0 R 0 t f P(R 1) 0.7 0.3 Rain 1 R 0 t f P(R 1) 0.7 0.3 Rain 2 R 0 t f P(R 1) 0.7 0.3 Rain 3 R 0 t f P(R 1) 0.7 0.3 Rain 4 R 0 t f P(R 1) 0.7 0.3 Rain 5 R 0 t f P(R 1) 0.7 0.3 Rain 6 R 0 t f P(R 1) 0.7 0.3 Rain 7 R 1 t f P(U 1) 0.9 0.2 R 1 t f P(U 1) 0.9 0.2 R 1 t f P(U 1) 0.9 0.2 R 1 t f P(U 1) 0.9 0.2 R 1 t f P(U 1) 0.9 0.2 R 1 t f P(U 1) 0.9 0.2 R 1 t f P(U 1) 0.9 0.2 R 1 t f P(U 1) 0.9 0.2 Umbrella 1 Umbrella 1 Umbrella 2 Umbrella 3 Umbrella 4 Umbrella 5 Umbrella 6 Umbrella 7 Problem: inference cost for each update grows with t Rollup filtering: add slice t + 1, sum out slice t using variable elimination Largest factor is O(d n+1 ), update cost O(d n+2 ) (cf. HMM update cost O(d 2n ))

Likelihood weighting for DBNs Set of weighted samples approximates the belief state Rain 0 Rain 1 Rain 2 Rain 3 Rain 4 Rain 5 Umbrella 1 Umbrella 2 Umbrella 3 Umbrella 4 Umbrella 5 LW samples pay no attention to the evidence! fraction agreeing falls exponentially with t number of samples required grows exponentially with t 1 0.8 LW(10) LW(100) LW(1000) LW(10000) RMS error 0.6 0.4 0.2 0 0 5 10 15 20 25 30 35 40 45 50 Time step

Particle filtering Basic idea: ensure that the population of particles (i.e., sample events covering all state variables) tracks the high-likelihood regions of the state-space Replicate particles proportional to likelihood for e t : how well the states explain the evidence. true Rain t Rain t +1 Rain t +1 Rain t +1 false (a) Propagate (b) Weight (c) Resample Here, evidence at t+1 is not(umbrella) Widely used for tracking nonlinear systems, esp. in vision Also used for simultaneous localization and mapping in mobile robots 10 5 - dimensional state space

Particle filtering Algorithm 1. Generate N sample events over all the state variables, using prior probs X 0 as basis. 2. t = 0; 3. For each sample, x t, propagate it forward to x t+1 using the Transition Model, P(X t+1 X t ). 4. weight(x t+1 ) = P(e t+1 x t+1 ). How well does the sample agree with evidence? 5. Resample the population based on the weights of samples at t+1: Choose N new sample events from the current pool of N samples. Highly-weighted samples will be chosen (hence replicated) many times. Low-weight samples may not be chosen much (if at all). 6. If at final final time step: Stop, Base state probs on particle ratios. 7. ELSE: t = t + 1, goto step 3

Particle filtering contd. Assume consistent at time t: N(x t e 1:t )/N = P(x t e 1:t ) Propagate forward: populations of x t+1 are N(x t+1 e 1:t ) = Σ xt P(x t+1 x t )N(x t e 1:t ) Weight samples by their likelihood for e t+1 : W(x t+1 e 1:t+1 ) = P(e t+1 x t+1 )N(x t+1 e 1:t ) Resample to obtain populations proportional to W: N(x t+1 e 1:t+1 )/N = αw(x t+1 e 1:t+1 ) = αp(e t+1 x t+1 )N(x t+1 e 1:t ) = αp(e t+1 x t+1 )Σ xt P(x t+1 x t )N(x t e 1:t ) = α P(e t+1 x t+1 )Σ xt P(x t+1 x t )P(x t e 1:t ) = P(x t+1 e 1:t+1 )

Particle filtering performance Approximation error of particle filtering remains bounded over time, at least empirically theoretical analysis is difficult Avg absolute error 1 0.8 0.6 0.4 0.2 LW(25) LW(100) LW(1000) LW(10000) ER/SOF(25) 0

Summary Temporal models use state and sensor variables replicated over time Markov assumptions and stationarity assumption, so we need transition modelp(x t X t 1 ) sensor model P(E t X t ) Tasks are filtering, prediction, smoothing, most likely sequence; all done recursively with constant cost per time step Hidden Markov models have a single discrete state variable; used for speech recognition Kalman filters allow n state variables, linear Gaussian, O(n 3 ) update Dynamic Bayes nets subsume HMMs, Kalman filters; exact update intractable Particle filtering is a good approximate filtering algorithm for DBNs

Island algorithm Idea: run forward-backward storing f t, b t at only k 1 points Call recursively (depth-first) on k subtasks O(k f log k t) space, O(k log k t) more time

Online fixed-lag smoothing t d t d+1 t t+1 f b f b Obvious method runs forward backward for d steps each time Recursively compute f 1:t d+1, b t d+2:t+1 from f 1:t d, b t d+1:t? Forward message OK, backward message not directly obtainable

Online fixed-lag smoothing contd. Define B j:k = Π k i =jto i, so b t d+1:t = B t d+1:t 1 b t d+2:t+1 = B t d+2:t+1 1 Now we can get a recursive update for B: B t d+2:t+1 = O 1 t d+1t 1 B t d+1:t TO t+1 Hence update cost is constant, independent of lag d

Approximate inference in DBNs Particle filtering (Gordon, 1994; Kanazawa, Koller, and Russell, 1995; Blake and Isard, 1996) Factored approximation (Boyen and Koller, 1999) Loopy propagation (Pearl, 1988; Yedidia, Freeman, and Weiss, 2000) Variational approximation (Ghahramani and Jordan, 1997) Decayed MCMC (unpublished)

Evidence reversal Better to propose new samples conditioned on the new evidence Minimizes the variance of the posterior estimates (Kong & Liu, 1996) Rain 0 Rain 1 Rain 2 Rain 3 Rain 4 Rain 5 Umbrella 1 Umbrella 2 Umbrella 3 Umbrella 4 Umbrella 5 Avg absolute error 0.2 0.15 0.1 0.05 LW/ER 25 PF 25 PF/ER 25 PF/ER 1000 0 0 5 10 15 20 25 30 35 40 45 50 Time step

Example: DBN for speech recognition phoneme index end-of-word observation 1 1 1 2 2 P(OBS 2) = 1 P(OBS not 2) = 0 deterministic, fixed transition 0 0 1 0 stochastic, learned phoneme n n n o o deterministic, fixed articulators tongue, lips a u a u a u b r b r stochastic, learned observation stochastic, learned Also easy to add variables for, e.g., gender, accent, speed. Zweig and Russell (1998) show up to 40% error reduction over HMMs