PROBABILISTIC REASONING Outline

PROBABILISTIC REASONING Outline Danica Kragic Uncertainty Probability Syntax and Semantics Inference Independence and Bayes Rule September 23, 2005 Uncertainty - Let action A t = leave for airport t minutes before flight - Will A t get me there on time? Problems: 1) partial observability (road state, other drivers plans, etc.) 2) noisy sensors (KCBS traffic reports) 3) uncertainty in action outcomes (flat tire, etc.) 4) immense complexity of modelling and predicting traffic A purely logical approach either Uncertainty 1) false: A 25 will get me there on time, or 2) leads to conclusions that are too weak for decision making: A 25 will get me there on time if there s no accident on the bridge and it doesn t rain and my tires remain intact etc etc. (A 1440 might reasonably be said to get me there on time but I d have to stay overnight in the airport...)

Methods for handling uncertainty - Default or nonmonotonic logic: * Assume my car does not have a flat tire * Assume A 25 works unless contradicted by evidence Issues: What assumptions are reasonable? How to handle contradiction? - Rules: A 25 0.3 AtAirportOnTime Sprinkler 0.99 WetGrass WetGrass 0.7 Rain Issues: Problems with combination, e.g., Sprinkler causes Rain?? - Probability Given the available evidence, A 25 will get me there on time with probability 0.04 Subjective or Bayesian probability: Probability Probabilities relate propositions to one s own state of knowledge e.g., P(A 25 no reported accidents) = 0.06 These are not claims of a probabilistic tendency in the current situation (but might be learned from past experience of similar situations) Probabilities of propositions change with new evidence: e.g., P(A 25 no reported accidents, 5 a.m.) = 0.15 Suppose I believe the following: Which action to choose? Making decisions under uncertainty P(A 25 gets me there on time...) = 0.04 P(A 90 gets me there on time...) = 0.70 P(A 120 gets me there on time...) = 0.95 P(A 1440 gets me there on time...) = 0.9999 Depends on my preferences for missing flight vs. airport cuisine, etc. Utility theory is used to represent and infer preferences Decision theory = utility theory + probability theory Probability basics A set Ω the sample space e.g., 6 possible rolls of a die. ω Ω is a sample point/atomic event A probability space or probability model is a sample space with an assignment P(ω) for every ω Ω s.t. 0 P(ω) 1, ω P(ω) = 1 e.g., P(1)=P(2)=P(3)=P(4)=P(5)=P(6)=1/6. An event A is any subset of Ω P(A) = P(ω) {ω A} E.g., P(die roll < 4) = P(1) + P(2) + P(3) = 1/6 + 1/6 + 1/6 = 1/2

Propositions Random variables A random variable is a function from sample points to some range, e.g., the reals or Booleans e.g., Odd(1) =true. P induces a probability distribution for any r.v. X: P(X =x i ) = {ω:x(ω)=x i } P(ω) e.g., P(Odd =true) = P(1) + P(3) + P(5) = 1/6 + 1/6 + 1/6 = 1/2 Think of a proposition as the event (set of sample points) where the proposition is true Given Boolean random variables A and B: event a = set of sample points where A(ω)=true event a = set of sample points where A(ω)= f alse event a b = points where A(ω)=true and B(ω)=true The sample points often defined by the values of a set of random variables, i.e., the sample space is the Cartesian product of the ranges of the variables Boolean variables: sample point = propositional logic model e.g., A=true, B= f alse, or a b Proposition = disjunction of atomic events in which it is true e.g., (a b) ( a b) (a b) (a b) = P(a b) = P( a b) + P(a b) + P(a b) Prior probability Syntax for propositions Propositional or Boolean random variables e.g., Cavity (do I have a cavity?) Cavity =true is a proposition, also written cavity Discrete random variables (finite or infinite) e.g., Weather is one of sunny, rain, cloudy, snow Weather = rain is a proposition Values must be exhaustive and mutually exclusive Continuous random variables (bounded or unbounded) e.g., Temp=21.6; also allow, e.g., Temp < 22.0. Prior or unconditional probabilities of propositions e.g., P(Cavity=true) = 0.1 and P(Weather =sunny) = 0.72 correspond to belief prior to arrival of any (new) evidence Probability distribution gives values for all possible assignments: P(Weather) = 0.72, 0.1, 0.08, 0.1 (normalized, i.e., sums to 1) Joint probability distribution for a set of values gives the probability of every atomic event on those values P(Weather,Cavity) = a 4 2 matrix of values: Weather = sunny rain cloudy snow Cavity =true 0.144 0.02 0.016 0.02 Cavity = f alse 0.576 0.08 0.064 0.08 Every question about a domain can be answered by the joint distribution because every event is a sum of sample points

Conditional probability Conditional or posterior probabilities e.g., P(cavity toothache) = 0.8 i.e., given that toothache is all I know NOT if toothache then 80% chance of cavity Notation for conditional distributions: P(Cavity Toothache) = 2-element vector of 2-element vectors Note: the less specific belief remains valid after more evidence arrives, but is not always useful New evidence may be irrelevant, allowing simplification, e.g., P(cavity toothache, sunny) = P(cavity toothache) = 0.8 This kind of inference, sanctioned by domain knowledge, is crucial Definition of conditional probability: Conditional probability P(a b) = P(a b) P(b) Product rule gives an alternative formulation: P(a b) = P(a b)p(b) = P(b a)p(a) if P(b) 0 A general version holds for whole distributions, e.g., P(Weather,Cavity) = P(Weather Cavity)P(Cavity) View as 4 2 set of equations, not matrix mult Chain rule is derived by successive application of product rule: P(X 1,...,X n ) = P(X 1,...,X n 1 ) P(X n X 1,...,X n 1 ) = P(X 1,...,X n 2 ) P(X n 1 X 1,...,X n 2 ) P(X n X 1,...,X n 1 ) =... = n i=1 P(X i X 1,...,X i 1 ) Start with the joint distribution: Inference by enumeration Start with the joint distribution: Inference by enumeration For any proposition φ, sum the atomic events where it is true: P(φ) =Σ ω:ω =φ P(ω) For any proposition φ, sum the atomic events where it is true: P(φ) = ω:ω =φ P(ω) P(cavity toothache) = 0.108 + 0.012 + 0.072 + 0.008 + 0.016 + 0.064 = 0.28

Inference by enumeration Normalization Start with the joint distribution: Computing conditional probabilities: P( cavity toothache) P( cavity toothache) = P(toothache) 0.016 + 0.064 = 0.108 + 0.012 + 0.016 + 0.064 = 0.4 Denominator can be viewed as a normalization constant α P(Cavity toothache) = α P(Cavity, toothache) = α[p(cavity, toothache, catch) + P(Cavity, toothache, catch)] = α[ 0.108, 0.016 + 0.012, 0.064 ] = α 0.12,0.08 = 0.6,0.4 Independence A and B are independent iff P(A B) = P(A) or P(B A) = P(B) or P(A, B) = P(A)P(B) P(Toothache,Catch,Cavity,Weather) = P(Toothache,Catch,Cavity)P(Weather) 32 entries reduced to 12 Absolute independence powerful but rare Conditional independence If I have a cavity, the probability that the probe catches in it doesn t depend on whether I have a toothache: (1) P(catch toothache, cavity) = P(catch cavity) The same independence holds if I haven t got a cavity: (2) P(catch toothache, cavity) = P(catch cavity) Catch is conditionally independent of Toothache given Cavity: P(Catch Toothache,Cavity) = P(Catch Cavity) Equivalent statements: P(Toothache Catch,Cavity) = P(Toothache Cavity) P(Toothache,Catch Cavity) = P(Toothache Cavity)P(Catch Cavity)

Conditional independence contd. Write out full joint distribution using chain rule: P(Toothache,Catch,Cavity) = P(Toothache Catch,Cavity)P(Catch,Cavity) = P(Toothache Catch,Cavity)P(Catch Cavity)P(Cavity) = P(Toothache Cavity)P(Catch Cavity)P(Cavity) I.e., 2 + 2 + 1 = 5 independent numbers (equations 1 and 2 remove 2) In most cases, the use of conditional independence reduces the size of the representation of the joint distribution from exponential in n to linear in n. Conditional independence is our most basic and robust form of knowledge about uncertain environments. Bayes Rule Product rule P(a b) = P(a b)p(b) = P(b a)p(a) or in distribution form = Bayes rule P(a b) = P(b a)p(a) P(b) P(Y X) = P(X Y )P(Y ) P(X) = αp(x Y )P(Y ) Bayes Rule and conditional independence Bayes Rule Useful for assessing diagnostic probability from causal probability: P(Cause E f f ect) = E.g., let M be meningitis, S be stiff neck: P(m s) = P(s m)p(m) P(s) P(E f f ect Cause)P(Cause) P(E f f ect) = 0.8 0.0001 0.1 = 0.0008 P(Cavity toothache catch) = α P(toothache catch Cavity)P(Cavity) = α P(toothache Cavity)P(catch Cavity)P(Cavity) This is an example of a naive Bayes model: P(Cause,E f f ect 1,...,E f f ect n ) = P(Cause) P(E f f ect i Cause) i Note: posterior probability of meningitis still very small! Total number of parameters is linear in n

Wumpus World a pit causes breezes in the neighboring cells all cells but [1,1] may contain a pit with equal probability P i j =true iff [i, j] contains a pit B i j =true iff [i, j] is breezy Include only B 1,1,B 1,2,B 2,1 in the probability model Specifying the probability model The full joint distribution is P(P 1,1,...,P 4,4,B 1,1,B 1,2,B 2,1 ) Apply product rule: P(B 1,1,B 1,2,B 2,1 P 1,1,...,P 4,4 )P(P 1,1,...,P 4,4 ) (Do it this way to get P(E f f ect Cause).) First term: 1 if pits are adjacent to breezes, 0 otherwise Second term: pits are placed randomly, probability 0.2 per square: for n pits. P(P 1,1,...,P 4,4 ) = 4,4 i, j =1,1 P(P i, j ) = 0.2 n 0.8 16 n We know the following facts: b = b 1,1 b 1,2 b 2,1 known = p 1,1 p 1,2 p 2,1 Query is P(P 1,3 known,b) Observations and query Define Unknown = P i j s other than P 1,3 and Known For inference by enumeration, we have P(P 1,3 known,b) = α unknown P(P 1,3,unknown,known,b) Using conditional independence Basic insight: observations are conditionally independent of other hidden squares given neighbouring hidden squares Define Unknown = Fringe Other P(b P 1,3,Known,Unknown) = P(b P 1,3,Known,Fringe) Manipulate query into a form where we can use this! Grows exponentially with number of squares!

P(P 1,3 known,b) = α Using conditional independence contd. unknown P(P 1,3,unknown,known,b) = α unknown P(b P 1,3,known,unknown)P(P 1,3,known,unknown) = α P(b known,p 1,3, f ringe,other)p(p 1,3,known, f ringe,other) = α f ringe other f ringe other = α f ringe = α f ringe P(b known,p 1,3, f ringe)p(p 1,3,known, f ringe,other) P(b known,p 1,3, f ringe) P(P 1,3,known, f ringe,other) other P(b known,p 1,3, f ringe) P(P 1,3 )P(known)P( f ringe)p(other) other = αp(known)p(p 1,3 ) f ringe P(b known,p 1,3, f ringe)p( f ringe) P(other) other = α P(P 1,3 ) P(b known,p 1,3, f ringe)p( f ringe) f ringe Using conditional independence contd. P(P 1,3 known,b) = α 0.2(0.04 + 0.16 + 0.16), 0.8(0.04 + 0.16) 0.31, 0.69 P(P 2,2 known,b) 0.86,0.14 Summary Probability is a rigorous formalism for uncertain knowledge Joint probability distribution specifies probability of every atomic event Queries can be answered by summing over atomic events For nontrivial domains, we must find a way to reduce the joint size Independence and conditional independence provide the tools PROBABILISTIC REASONING Danica Kragic September 23, 2005

Bayesian networks Simple, graphical notation for conditional independence assertions Compact specification of full joint distributions Syntax: * a set of nodes, one per variable * a directed, acyclic graph (link directly influences ) * a conditional distribution for each node given its parents: P(X i Parents(X i )) In the simplest case, conditional distribution represented as a conditional probability table (CPT) giving the distribution over X i for each combination of parent values Example Topology of network encodes conditional independence assertions: Weather is independent of the other variables Toothache and Catch are conditionally independent given Cavity Example Example contd. I m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn t call. Sometimes it s set off by minor earthquakes. Is there a burglar? Variables: Burglar, Earthquake, Alarm, JohnCalls, MaryCalls Network topology reflects causal knowledge: A burglar can start the alarm An earthquake can start the alarm The alarm can cause Mary to call The alarm can cause John to call

Compactness A CPT for Boolean X i with k Boolean parents has 2 k rows for the combinations of parent values Each row requires one number p for X i =true (the number for X i = f alse is just 1 p) If each variable has no more than k parents, the complete network requires O(n 2 k ) numbers I.e., grows linearly with n, vs. O(2 n ) for the full joint distribution For burglary net, 1 + 1 + 4 + 2 + 2=10 numbers (vs. 2 5 1 = 31) Global semantics Global semantics defines the full joint distribution as the product of the local conditional distributions: e.g., P( j m a b e) P(x 1,...,x n ) = n i=1 P(x i parents(x i )) = P( j a)p(m a)p(a b, e)p( b)p( e) = 0.9 0.7 0.001 0.999 0.998 0.00063 Numerical and topological semantics (left) Local semantics: each node is conditionally independent of its nondescendants given its parents (right) Each node is conditionally independent of all others given its Markov blanket: parents + children + children s parents Constructing Bayesian networks Need a method such that a series of locally testable assertions of conditional independence guarantees the required global semantics 1. Choose an ordering of variables X 1,...,X n 2. For i = 1 to n add X i to the network select parents from X 1,...,X i 1 such that P(X i Parents(X i )) = P(X i X 1,..., X i 1 ) This choice of parents guarantees the global semantics: P(X 1,...,X n ) = = n i=1 n i=1 P(X i X 1,..., X i 1 ) P(X i Parents(X i )) (chain rule) (by construction)

Example Suppose we choose the ordering M, J, A, B, E Compact conditional distributions CPT grows exponentially with number of parents CPT becomes infinite with continuous-valued parent or child Solution: canonical distributions that are defined compactly Deterministic nodes are the simplest case: X = f (Parents(X)) for some function f E.g., Boolean functions NorthAmerican Canadian US Mexican Hybrid (discrete+continuous) networks Discrete (Subsidy? and Buys?); continuous (Harvest and Cost) Option 1: discretization possibly large errors, large CPTs Option 2: finitely parameterized canonical families 1) Continuous variable, discrete+continuous parents (e.g., Cost) 2) Discrete variable, continuous parents (e.g., Buys?) Summary Bayes nets provide a natural representation for (causally induced) conditional independence Topology + CPTs = compact representation of joint distribution Generally easy for (non)experts to construct Canonical distributions = compact representation of CPTs Continuous variables = parameterized distributions (e.g., linear Gaussian)

Inference in Bayesian networks Exact inference by enumeration Exact inference by variable elimination Approximate inference by stochastic simulation Approximate inference by Markov chain Monte Carlo Inference tasks compute posterior probability distribution for a set of query variables, given some observed event P(X e) Inference by enumeration Slightly intelligent way to sum out variables from the joint without actually constructing its explicit representation Simple query on the burglary network: P(B j,m) = = P(B, j,m)/p( j,m)= = αp(b, j,m)= = α e a P(B,e,a, j,m) Rewrite full joint entries using product of CPT entries: P(B j,m) = α e a P(B)P(e)P(a B,e)P( j a)p(m a) = αp(b) e P(e) a P(a B,e)P( j a)p(m a) Recursive depth-first enumeration Evaluation tree Enumeration is inefficient: repeated computation e.g., computes P( j a)p(m a) for each value of e

Complexity of exact inference Inference by variable elimination Variable elimination: carry out summations right-to-left, storing intermediate results (factors) to avoid recomputation a form of dynamic programming Singly connected networks (or polytrees): any two nodes are connected by at most one (undirected) path time and space complexity of exact inference in linear in the size of network Multiply connected networks: variable elimination may have exponential time and space complexity in the worst case Steps: Approximate inference by stochastic simulation 1) Draw N samples from a sampling distribution S 2) Compute an approximate posterior probability ˆP 3) Show this converges to the true probability P Approaches: Sampling from an empty network Rejection sampling: reject samples disagreeing with evidence Likelihood weighting: use evidence to weight samples Markov chain Monte Carlo (MCMC): sample from a stochastic process whose stationary distribution is the true posterior Example - sampling from an empty network sample each variable in turn in topological order

Example Example Example Example

Example Example Sampling from an empty network contd. Probability that PRIORSAMPLE generates a particular event S PS (x 1...x n ) = n i=1 P(x i parents(x i )) = P(x 1...x n ) i.e., the true prior probability E.g., S PS (t, f,t,t) = 0.5 0.9 0.8 0.9 = 0.324 = P(t, f,t,t) Let N PS (x 1...x n ) be the number of samples generated for event x 1,...,x n Then we have lim ˆP(x 1,...,x n ) = lim N PS (x 1,...,x n )/N N N = S PS (x 1,...,x n ) = P(x 1...x n ) Other approximate inference approaches Rejection sampling: ˆP(X e) estimated from samples agreeing with e - generates samples given the prior distribution specified by the network - all samples that do not match the evidence are rejected Likelihood weighting: - generates only samples that are consistent with the evidence - uses all the generated samles Shorthand: ˆP(x 1,...,x n ) P(x 1...x n )

The Markov chain With Sprinkler =true,wetgrass =true, there are four states: Approximate inference using MCMC Markov Chain Monte Carlo State of network = current assignment to all variables. Generate next state by sampling one variable given Markov blanket; Sample each variable in turn, keeping evidence fixed Can also choose a variable to sample at random each time Wander about for a while, average what you see PROBABILISTIC REASONING OVER TIME Danica Kragic Time and uncertainty The world changes; we need to track and predict it Diabetes management vs vehicle diagnosis Basic idea: copy state and evidence variables for each time step X t = set of unobservable state variables at time t e.g., BloodSugar t, StomachContents t, etc. E t = set of observable evidence variables at time t e.g., MeasuredBloodSugar t, PulseRate t, FoodEaten t This assumes discrete time; step size depends on problem Notation: X a:b = X a,x a+1,...,x b 1,X b September 23, 2005

Markov processes (Markov chains) Construct a Bayes net from these variables: parents? Markov assumption: X t depends on bounded subset of X 0:t 1 First-order Markov process: P(X t X 0:t 1 ) = P(X t X t 1 ) Second-order Markov process: P(X t X 0:t 1 ) = P(X t X t 2,X t 1 ) Sensor Markov assumption: P(E t X 0:t,E 0:t 1 ) = P(E t X t ) Stationary process: transition model P(X t X t 1 ) and sensor model P(E t X t ) fixed for all t Example First-order Markov assumption not exactly true in real world! Possible fixes: 1. Increase order of Markov process 2. Augment state, e.g., add Temp t, Pressure t Example: robot motion. Augment position and velocity with Battery t Inference tasks Filtering Filtering: P(X t e 1:t ) belief state input to the decision process of a rational agent Prediction: P(X t+k e 1:t ) for k > 0 evaluation of possible action sequences; like filtering without the evidence Smoothing: P(X k e 1:t ) for 0 k < t better estimate of past states, essential for learning Most likely explanation: argmaxx 1:t P(x 1:t e 1:t ) speech recognition, decoding with a noisy channel Aim: devise a recursive state estimation algorithm: P(X t+1 e 1:t+1 ) = f (e t+1,p(x t e 1:t )) P(X t+1 e 1:t+1 ) = P(X t+1 e 1:t,e t+1 ) = αp(e t+1 X t+1,e 1:t )P(X t+1 e 1:t ) = αp(e t+1 X t+1 )P(X t+1 e 1:t )

Filtering Filtering example I.e., prediction + estimation. Prediction by summing out X t : P(X t+1 e 1:t+1 ) = αp(e t+1 X t+1 ) xt P(X t+1 x t,e 1:t )P(x t e 1:t ) = αp(e t+1 X t+1 ) xt P(X t+1 x t )P(x t e 1:t ) f 1:t+1 = FORWARD(f 1:t,e t+1 ) where f 1:t =P(X t e 1:t ) Time and space constant (independent of t) Smoothing Smoothing computing the distribution over past states given evidence up to the present Backward message computed by a backwards recursion: Divide evidence e 1:t into e 1:k, e k+1:t : P(X k e 1:t ) = P(X k e 1:k,e k+1:t ) = αp(x k e 1:k )P(e k+1:t X k,e 1:k ) = αp(x k e 1:k )P(e k+1:t X k ) = αf 1:k b k+1:t P(e k+1:t X k ) = xk+1 P(e k+1:t X k,x k+1 )P(x k+1 X k ) = xk+1 P(e k+1:t x k+1 )P(x k+1 X k ) = xk+1 P(e k+1 x k+1 )P(e k+2:t x k+1 )P(x k+1 X k )

Smoothing example Forward backward algorithm: cache forward messages along the way O(t f ) Most likely explanation Most likely sequence sequence of most likely states!!!! Most likely path to each x t+1 = most likely path to some x t plus one more step max P(x x 1...x 1,...,x t,x t+1 e 1:t+1 ) t = P(e t+1 X t+1 )max x t ( P(X t+1 x t ) max x 1...x t 1 P(x 1,...,x t 1,x t e 1:t ) ) Most likely explanation Viterbi example Identical to filtering, except f 1:t replaced by m 1:t = max x 1...x t 1 P(x 1,...,x t 1,X t e 1:t ), I.e., m 1:t (i) gives the probability of the most likely path to state i. Update has sum replaced by max, giving the Viterbi algorithm: m 1:t+1 = P(e t+1 X t+1 )max x t (P(X t+1 x t )m 1:t )

Hidden Markov models - X t is a single, discrete variable (usually E t is too) Domain of X t is {1,...,S} ( ) 0.7 0.3 - Transition matrix T i j = P(X t = j X t 1 =i), e.g., 0.3 0.7 - Sensor matrix O t for each ( time step, ) diagonal elements P(e t X t =i) 0.9 0 e.g., with U 1 =true, O 1 = 0 0.2 - Forward and backward messages as column vectors: Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t Ot+1 1 1:t+1 = αt f 1:t α (T ) 1 Ot+1 1 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i f 1:t+1 = αo t+1 T f 1:t b k+1:t = TO k+1 b k+2:t Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t Ot+1 1 1:t+1 = αt f 1:t α (T ) 1 Ot+1 1 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t Ot+1 1 1:t+1 = αt f 1:t α (T ) 1 Ot+1 1 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i

Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t Ot+1 1 1:t+1 = αt f 1:t α (T ) 1 Ot+1 1 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t Ot+1 1 1:t+1 = αt f 1:t α (T ) 1 Ot+1 1 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t Ot+1 1 1:t+1 = αt f 1:t α (T ) 1 Ot+1 1 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t Ot+1 1 1:t+1 = αt f 1:t α (T ) 1 Ot+1 1 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i

Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t Ot+1 1 1:t+1 = αt f 1:t α (T ) 1 Ot+1 1 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t Ot+1 1 1:t+1 = αt f 1:t α (T ) 1 Ot+1 1 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t Ot+1 1 1:t+1 = αt f 1:t α (T ) 1 Ot+1 1 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i Summary Temporal models use state and sensor variables replicated over time Markov assumptions and stationarity assumption, so we need transition modelp(x t X t 1 ) sensor model P(E t X t ) Tasks are filtering, prediction, smoothing, most likely sequence; all done recursively with constant cost per time step