PROBABILISTIC REASONING Outline

Similar documents
COMP9414/ 9814/ 3411: Artificial Intelligence. 14. Uncertainty. Russell & Norvig, Chapter 13. UNSW c AIMA, 2004, Alan Blair, 2012

Uncertainty and Bayesian Networks

Uncertainty. Chapter 13, Sections 1 6

Uncertainty. Chapter 13

Probabilistic Reasoning

Pengju XJTU 2016

Uncertainty. Outline. Probability Syntax and Semantics Inference Independence and Bayes Rule. AIMA2e Chapter 13

Outline. Uncertainty. Methods for handling uncertainty. Uncertainty. Making decisions under uncertainty. Probability. Uncertainty

Uncertain Knowledge and Reasoning

Uncertainty. Chapter 13. Chapter 13 1

Artificial Intelligence

Uncertainty. Outline

Artificial Intelligence Uncertainty

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

Probabilistic Reasoning Systems

CS 561: Artificial Intelligence

Uncertainty. Chapter 13

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

Uncertainty. Introduction to Artificial Intelligence CS 151 Lecture 2 April 1, CS151, Spring 2004

Uncertainty. 22c:145 Artificial Intelligence. Problem of Logic Agents. Foundations of Probability. Axioms of Probability

Chapter 13 Quantifying Uncertainty

Quantifying uncertainty & Bayesian networks

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

Bayesian networks. Chapter 14, Sections 1 4

Pengju

Hidden Markov models 1

Bayesian networks. Chapter Chapter

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Probabilistic representation and reasoning

Graphical Models - Part I

Probabilistic Robotics

Course Overview. Summary. Outline

Uncertainty (Chapter 13, Russell & Norvig) Introduction to Artificial Intelligence CS 150 Lecture 14

Bayesian Networks. Philipp Koehn. 6 April 2017

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

PROBABILISTIC REASONING OVER TIME

Web-Mining Agents Data Mining

CS Belief networks. Chapter

CS 380: ARTIFICIAL INTELLIGENCE UNCERTAINTY. Santiago Ontañón

PROBABILISTIC REASONING SYSTEMS

Basic Probability and Decisions

Bayesian networks. Chapter Chapter

Bayesian networks. Chapter AIMA2e Slides, Stuart Russell and Peter Norvig, Completed by Kazim Fouladi, Fall 2008 Chapter 14.

Bayesian Networks. Philipp Koehn. 29 October 2015

Uncertainty. Logic and Uncertainty. Russell & Norvig. Readings: Chapter 13. One problem with logical-agent approaches: C:145 Artificial

Bayesian networks. Chapter Chapter Outline. Syntax Semantics Parameterized distributions. Chapter

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Probabilistic representation and reasoning

This lecture. Reading. Conditional Independence Bayesian (Belief) Networks: Syntax and semantics. Chapter CS151, Spring 2004

Quantifying Uncertainty & Probabilistic Reasoning. Abdulla AlKhenji Khaled AlEmadi Mohammed AlAnsari

Temporal probability models. Chapter 15

Uncertainty and Belief Networks. Introduction to Artificial Intelligence CS 151 Lecture 1 continued Ok, Lecture 2!

Resolution or modus ponens are exact there is no possibility of mistake if the rules are followed exactly.

Bayesian networks: Modeling

Introduction to Artificial Intelligence Belief networks

Brief Intro. to Bayesian Networks. Extracted and modified from four lectures in Intro to AI, Spring 2008 Paul S. Rosenbloom

Probabilistic Models

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

Bayesian Networks BY: MOHAMAD ALSABBAGH

14 PROBABILISTIC REASONING

UNCERTAINTY. In which we see what an agent should do when not all is crystal-clear.

Uncertainty (Chapter 13, Russell & Norvig)

Uncertainty. Russell & Norvig Chapter 13.

An AI-ish view of Probability, Conditional Probability & Bayes Theorem

10/18/2017. An AI-ish view of Probability, Conditional Probability & Bayes Theorem. Making decisions under uncertainty.

Review: Bayesian learning and inference

ARTIFICIAL INTELLIGENCE. Uncertainty: probabilistic reasoning

Dynamic Bayesian Networks and Hidden Markov Models Decision Trees

Belief Networks for Probabilistic Inference

Reasoning with Uncertainty. Chapter 13

Informatics 2D Reasoning and Agents Semester 2,

Lecture 10: Bayesian Networks and Inference

CS 188: Artificial Intelligence Fall 2009

Artificial Intelligence

Probabilistic Models. Models describe how (a portion of) the world works

Uncertainty. CmpE 540 Principles of Artificial Intelligence Pınar Yolum Uncertainty. Sources of Uncertainty

Inference in Bayesian Networks

Inference in Bayesian networks

CSE 473: Artificial Intelligence Autumn 2011

Cartesian-product sample spaces and independence

CS 5522: Artificial Intelligence II

CS 5100: Founda.ons of Ar.ficial Intelligence

School of EECS Washington State University. Artificial Intelligence

CS 188: Artificial Intelligence Spring Announcements

Bayesian networks (1) Lirong Xia

Bayesian Networks. Motivation

Course Introduction. Probabilistic Modelling and Reasoning. Relationships between courses. Dealing with Uncertainty. Chris Williams.

Stochastic inference in Bayesian networks, Markov chain Monte Carlo methods

Reasoning Under Uncertainty

CS 343: Artificial Intelligence

Y. Xiang, Inference with Uncertain Knowledge 1

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Outline } Conditional independence } Bayesian networks: syntax and semantics } Exact inference } Approximate inference AIMA Slides cstuart Russell and

Ch.6 Uncertain Knowledge. Logic and Uncertainty. Representation. One problem with logical approaches: Department of Computer Science

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Fusion in simple models

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

CS 188: Artificial Intelligence. Our Status in CS188

Basic Probability. Robert Platt Northeastern University. Some images and slides are used from: 1. AIMA 2. Chris Amato

Transcription:

PROBABILISTIC REASONING Outline Danica Kragic Uncertainty Probability Syntax and Semantics Inference Independence and Bayes Rule September 23, 2005 Uncertainty - Let action A t = leave for airport t minutes before flight - Will A t get me there on time? Problems: 1) partial observability (road state, other drivers plans, etc.) 2) noisy sensors (KCBS traffic reports) 3) uncertainty in action outcomes (flat tire, etc.) 4) immense complexity of modelling and predicting traffic A purely logical approach either Uncertainty 1) false: A 25 will get me there on time, or 2) leads to conclusions that are too weak for decision making: A 25 will get me there on time if there s no accident on the bridge and it doesn t rain and my tires remain intact etc etc. (A 1440 might reasonably be said to get me there on time but I d have to stay overnight in the airport...)

Methods for handling uncertainty - Default or nonmonotonic logic: * Assume my car does not have a flat tire * Assume A 25 works unless contradicted by evidence Issues: What assumptions are reasonable? How to handle contradiction? - Rules: A 25 0.3 AtAirportOnTime Sprinkler 0.99 WetGrass WetGrass 0.7 Rain Issues: Problems with combination, e.g., Sprinkler causes Rain?? - Probability Given the available evidence, A 25 will get me there on time with probability 0.04 Subjective or Bayesian probability: Probability Probabilities relate propositions to one s own state of knowledge e.g., P(A 25 no reported accidents) = 0.06 These are not claims of a probabilistic tendency in the current situation (but might be learned from past experience of similar situations) Probabilities of propositions change with new evidence: e.g., P(A 25 no reported accidents, 5 a.m.) = 0.15 Suppose I believe the following: Which action to choose? Making decisions under uncertainty P(A 25 gets me there on time...) = 0.04 P(A 90 gets me there on time...) = 0.70 P(A 120 gets me there on time...) = 0.95 P(A 1440 gets me there on time...) = 0.9999 Depends on my preferences for missing flight vs. airport cuisine, etc. Utility theory is used to represent and infer preferences Decision theory = utility theory + probability theory Probability basics A set Ω the sample space e.g., 6 possible rolls of a die. ω Ω is a sample point/atomic event A probability space or probability model is a sample space with an assignment P(ω) for every ω Ω s.t. 0 P(ω) 1, ω P(ω) = 1 e.g., P(1)=P(2)=P(3)=P(4)=P(5)=P(6)=1/6. An event A is any subset of Ω P(A) = P(ω) {ω A} E.g., P(die roll < 4) = P(1) + P(2) + P(3) = 1/6 + 1/6 + 1/6 = 1/2

Propositions Random variables A random variable is a function from sample points to some range, e.g., the reals or Booleans e.g., Odd(1) =true. P induces a probability distribution for any r.v. X: P(X =x i ) = {ω:x(ω)=x i } P(ω) e.g., P(Odd =true) = P(1) + P(3) + P(5) = 1/6 + 1/6 + 1/6 = 1/2 Think of a proposition as the event (set of sample points) where the proposition is true Given Boolean random variables A and B: event a = set of sample points where A(ω)=true event a = set of sample points where A(ω)= f alse event a b = points where A(ω)=true and B(ω)=true The sample points often defined by the values of a set of random variables, i.e., the sample space is the Cartesian product of the ranges of the variables Boolean variables: sample point = propositional logic model e.g., A=true, B= f alse, or a b Proposition = disjunction of atomic events in which it is true e.g., (a b) ( a b) (a b) (a b) = P(a b) = P( a b) + P(a b) + P(a b) Prior probability Syntax for propositions Propositional or Boolean random variables e.g., Cavity (do I have a cavity?) Cavity =true is a proposition, also written cavity Discrete random variables (finite or infinite) e.g., Weather is one of sunny, rain, cloudy, snow Weather = rain is a proposition Values must be exhaustive and mutually exclusive Continuous random variables (bounded or unbounded) e.g., Temp=21.6; also allow, e.g., Temp < 22.0. Prior or unconditional probabilities of propositions e.g., P(Cavity=true) = 0.1 and P(Weather =sunny) = 0.72 correspond to belief prior to arrival of any (new) evidence Probability distribution gives values for all possible assignments: P(Weather) = 0.72, 0.1, 0.08, 0.1 (normalized, i.e., sums to 1) Joint probability distribution for a set of values gives the probability of every atomic event on those values P(Weather,Cavity) = a 4 2 matrix of values: Weather = sunny rain cloudy snow Cavity =true 0.144 0.02 0.016 0.02 Cavity = f alse 0.576 0.08 0.064 0.08 Every question about a domain can be answered by the joint distribution because every event is a sum of sample points

Conditional probability Conditional or posterior probabilities e.g., P(cavity toothache) = 0.8 i.e., given that toothache is all I know NOT if toothache then 80% chance of cavity Notation for conditional distributions: P(Cavity Toothache) = 2-element vector of 2-element vectors Note: the less specific belief remains valid after more evidence arrives, but is not always useful New evidence may be irrelevant, allowing simplification, e.g., P(cavity toothache, sunny) = P(cavity toothache) = 0.8 This kind of inference, sanctioned by domain knowledge, is crucial Definition of conditional probability: Conditional probability P(a b) = P(a b) P(b) Product rule gives an alternative formulation: P(a b) = P(a b)p(b) = P(b a)p(a) if P(b) 0 A general version holds for whole distributions, e.g., P(Weather,Cavity) = P(Weather Cavity)P(Cavity) View as 4 2 set of equations, not matrix mult Chain rule is derived by successive application of product rule: P(X 1,...,X n ) = P(X 1,...,X n 1 ) P(X n X 1,...,X n 1 ) = P(X 1,...,X n 2 ) P(X n 1 X 1,...,X n 2 ) P(X n X 1,...,X n 1 ) =... = n i=1 P(X i X 1,...,X i 1 ) Start with the joint distribution: Inference by enumeration Start with the joint distribution: Inference by enumeration For any proposition φ, sum the atomic events where it is true: P(φ) =Σ ω:ω =φ P(ω) For any proposition φ, sum the atomic events where it is true: P(φ) = ω:ω =φ P(ω) P(cavity toothache) = 0.108 + 0.012 + 0.072 + 0.008 + 0.016 + 0.064 = 0.28

Inference by enumeration Normalization Start with the joint distribution: Computing conditional probabilities: P( cavity toothache) P( cavity toothache) = P(toothache) 0.016 + 0.064 = 0.108 + 0.012 + 0.016 + 0.064 = 0.4 Denominator can be viewed as a normalization constant α P(Cavity toothache) = α P(Cavity, toothache) = α[p(cavity, toothache, catch) + P(Cavity, toothache, catch)] = α[ 0.108, 0.016 + 0.012, 0.064 ] = α 0.12,0.08 = 0.6,0.4 Independence A and B are independent iff P(A B) = P(A) or P(B A) = P(B) or P(A, B) = P(A)P(B) P(Toothache,Catch,Cavity,Weather) = P(Toothache,Catch,Cavity)P(Weather) 32 entries reduced to 12 Absolute independence powerful but rare Conditional independence If I have a cavity, the probability that the probe catches in it doesn t depend on whether I have a toothache: (1) P(catch toothache, cavity) = P(catch cavity) The same independence holds if I haven t got a cavity: (2) P(catch toothache, cavity) = P(catch cavity) Catch is conditionally independent of Toothache given Cavity: P(Catch Toothache,Cavity) = P(Catch Cavity) Equivalent statements: P(Toothache Catch,Cavity) = P(Toothache Cavity) P(Toothache,Catch Cavity) = P(Toothache Cavity)P(Catch Cavity)

Conditional independence contd. Write out full joint distribution using chain rule: P(Toothache,Catch,Cavity) = P(Toothache Catch,Cavity)P(Catch,Cavity) = P(Toothache Catch,Cavity)P(Catch Cavity)P(Cavity) = P(Toothache Cavity)P(Catch Cavity)P(Cavity) I.e., 2 + 2 + 1 = 5 independent numbers (equations 1 and 2 remove 2) In most cases, the use of conditional independence reduces the size of the representation of the joint distribution from exponential in n to linear in n. Conditional independence is our most basic and robust form of knowledge about uncertain environments. Bayes Rule Product rule P(a b) = P(a b)p(b) = P(b a)p(a) or in distribution form = Bayes rule P(a b) = P(b a)p(a) P(b) P(Y X) = P(X Y )P(Y ) P(X) = αp(x Y )P(Y ) Bayes Rule and conditional independence Bayes Rule Useful for assessing diagnostic probability from causal probability: P(Cause E f f ect) = E.g., let M be meningitis, S be stiff neck: P(m s) = P(s m)p(m) P(s) P(E f f ect Cause)P(Cause) P(E f f ect) = 0.8 0.0001 0.1 = 0.0008 P(Cavity toothache catch) = α P(toothache catch Cavity)P(Cavity) = α P(toothache Cavity)P(catch Cavity)P(Cavity) This is an example of a naive Bayes model: P(Cause,E f f ect 1,...,E f f ect n ) = P(Cause) P(E f f ect i Cause) i Note: posterior probability of meningitis still very small! Total number of parameters is linear in n

Wumpus World a pit causes breezes in the neighboring cells all cells but [1,1] may contain a pit with equal probability P i j =true iff [i, j] contains a pit B i j =true iff [i, j] is breezy Include only B 1,1,B 1,2,B 2,1 in the probability model Specifying the probability model The full joint distribution is P(P 1,1,...,P 4,4,B 1,1,B 1,2,B 2,1 ) Apply product rule: P(B 1,1,B 1,2,B 2,1 P 1,1,...,P 4,4 )P(P 1,1,...,P 4,4 ) (Do it this way to get P(E f f ect Cause).) First term: 1 if pits are adjacent to breezes, 0 otherwise Second term: pits are placed randomly, probability 0.2 per square: for n pits. P(P 1,1,...,P 4,4 ) = 4,4 i, j =1,1 P(P i, j ) = 0.2 n 0.8 16 n We know the following facts: b = b 1,1 b 1,2 b 2,1 known = p 1,1 p 1,2 p 2,1 Query is P(P 1,3 known,b) Observations and query Define Unknown = P i j s other than P 1,3 and Known For inference by enumeration, we have P(P 1,3 known,b) = α unknown P(P 1,3,unknown,known,b) Using conditional independence Basic insight: observations are conditionally independent of other hidden squares given neighbouring hidden squares Define Unknown = Fringe Other P(b P 1,3,Known,Unknown) = P(b P 1,3,Known,Fringe) Manipulate query into a form where we can use this! Grows exponentially with number of squares!

P(P 1,3 known,b) = α Using conditional independence contd. unknown P(P 1,3,unknown,known,b) = α unknown P(b P 1,3,known,unknown)P(P 1,3,known,unknown) = α P(b known,p 1,3, f ringe,other)p(p 1,3,known, f ringe,other) = α f ringe other f ringe other = α f ringe = α f ringe P(b known,p 1,3, f ringe)p(p 1,3,known, f ringe,other) P(b known,p 1,3, f ringe) P(P 1,3,known, f ringe,other) other P(b known,p 1,3, f ringe) P(P 1,3 )P(known)P( f ringe)p(other) other = αp(known)p(p 1,3 ) f ringe P(b known,p 1,3, f ringe)p( f ringe) P(other) other = α P(P 1,3 ) P(b known,p 1,3, f ringe)p( f ringe) f ringe Using conditional independence contd. P(P 1,3 known,b) = α 0.2(0.04 + 0.16 + 0.16), 0.8(0.04 + 0.16) 0.31, 0.69 P(P 2,2 known,b) 0.86,0.14 Summary Probability is a rigorous formalism for uncertain knowledge Joint probability distribution specifies probability of every atomic event Queries can be answered by summing over atomic events For nontrivial domains, we must find a way to reduce the joint size Independence and conditional independence provide the tools PROBABILISTIC REASONING Danica Kragic September 23, 2005

Bayesian networks Simple, graphical notation for conditional independence assertions Compact specification of full joint distributions Syntax: * a set of nodes, one per variable * a directed, acyclic graph (link directly influences ) * a conditional distribution for each node given its parents: P(X i Parents(X i )) In the simplest case, conditional distribution represented as a conditional probability table (CPT) giving the distribution over X i for each combination of parent values Example Topology of network encodes conditional independence assertions: Weather is independent of the other variables Toothache and Catch are conditionally independent given Cavity Example Example contd. I m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn t call. Sometimes it s set off by minor earthquakes. Is there a burglar? Variables: Burglar, Earthquake, Alarm, JohnCalls, MaryCalls Network topology reflects causal knowledge: A burglar can start the alarm An earthquake can start the alarm The alarm can cause Mary to call The alarm can cause John to call

Compactness A CPT for Boolean X i with k Boolean parents has 2 k rows for the combinations of parent values Each row requires one number p for X i =true (the number for X i = f alse is just 1 p) If each variable has no more than k parents, the complete network requires O(n 2 k ) numbers I.e., grows linearly with n, vs. O(2 n ) for the full joint distribution For burglary net, 1 + 1 + 4 + 2 + 2=10 numbers (vs. 2 5 1 = 31) Global semantics Global semantics defines the full joint distribution as the product of the local conditional distributions: e.g., P( j m a b e) P(x 1,...,x n ) = n i=1 P(x i parents(x i )) = P( j a)p(m a)p(a b, e)p( b)p( e) = 0.9 0.7 0.001 0.999 0.998 0.00063 Numerical and topological semantics (left) Local semantics: each node is conditionally independent of its nondescendants given its parents (right) Each node is conditionally independent of all others given its Markov blanket: parents + children + children s parents Constructing Bayesian networks Need a method such that a series of locally testable assertions of conditional independence guarantees the required global semantics 1. Choose an ordering of variables X 1,...,X n 2. For i = 1 to n add X i to the network select parents from X 1,...,X i 1 such that P(X i Parents(X i )) = P(X i X 1,..., X i 1 ) This choice of parents guarantees the global semantics: P(X 1,...,X n ) = = n i=1 n i=1 P(X i X 1,..., X i 1 ) P(X i Parents(X i )) (chain rule) (by construction)

Example Suppose we choose the ordering M, J, A, B, E Compact conditional distributions CPT grows exponentially with number of parents CPT becomes infinite with continuous-valued parent or child Solution: canonical distributions that are defined compactly Deterministic nodes are the simplest case: X = f (Parents(X)) for some function f E.g., Boolean functions NorthAmerican Canadian US Mexican Hybrid (discrete+continuous) networks Discrete (Subsidy? and Buys?); continuous (Harvest and Cost) Option 1: discretization possibly large errors, large CPTs Option 2: finitely parameterized canonical families 1) Continuous variable, discrete+continuous parents (e.g., Cost) 2) Discrete variable, continuous parents (e.g., Buys?) Summary Bayes nets provide a natural representation for (causally induced) conditional independence Topology + CPTs = compact representation of joint distribution Generally easy for (non)experts to construct Canonical distributions = compact representation of CPTs Continuous variables = parameterized distributions (e.g., linear Gaussian)

Inference in Bayesian networks Exact inference by enumeration Exact inference by variable elimination Approximate inference by stochastic simulation Approximate inference by Markov chain Monte Carlo Inference tasks compute posterior probability distribution for a set of query variables, given some observed event P(X e) Inference by enumeration Slightly intelligent way to sum out variables from the joint without actually constructing its explicit representation Simple query on the burglary network: P(B j,m) = = P(B, j,m)/p( j,m)= = αp(b, j,m)= = α e a P(B,e,a, j,m) Rewrite full joint entries using product of CPT entries: P(B j,m) = α e a P(B)P(e)P(a B,e)P( j a)p(m a) = αp(b) e P(e) a P(a B,e)P( j a)p(m a) Recursive depth-first enumeration Evaluation tree Enumeration is inefficient: repeated computation e.g., computes P( j a)p(m a) for each value of e

Complexity of exact inference Inference by variable elimination Variable elimination: carry out summations right-to-left, storing intermediate results (factors) to avoid recomputation a form of dynamic programming Singly connected networks (or polytrees): any two nodes are connected by at most one (undirected) path time and space complexity of exact inference in linear in the size of network Multiply connected networks: variable elimination may have exponential time and space complexity in the worst case Steps: Approximate inference by stochastic simulation 1) Draw N samples from a sampling distribution S 2) Compute an approximate posterior probability ˆP 3) Show this converges to the true probability P Approaches: Sampling from an empty network Rejection sampling: reject samples disagreeing with evidence Likelihood weighting: use evidence to weight samples Markov chain Monte Carlo (MCMC): sample from a stochastic process whose stationary distribution is the true posterior Example - sampling from an empty network sample each variable in turn in topological order

Example Example Example Example

Example Example Sampling from an empty network contd. Probability that PRIORSAMPLE generates a particular event S PS (x 1...x n ) = n i=1 P(x i parents(x i )) = P(x 1...x n ) i.e., the true prior probability E.g., S PS (t, f,t,t) = 0.5 0.9 0.8 0.9 = 0.324 = P(t, f,t,t) Let N PS (x 1...x n ) be the number of samples generated for event x 1,...,x n Then we have lim ˆP(x 1,...,x n ) = lim N PS (x 1,...,x n )/N N N = S PS (x 1,...,x n ) = P(x 1...x n ) Other approximate inference approaches Rejection sampling: ˆP(X e) estimated from samples agreeing with e - generates samples given the prior distribution specified by the network - all samples that do not match the evidence are rejected Likelihood weighting: - generates only samples that are consistent with the evidence - uses all the generated samles Shorthand: ˆP(x 1,...,x n ) P(x 1...x n )

The Markov chain With Sprinkler =true,wetgrass =true, there are four states: Approximate inference using MCMC Markov Chain Monte Carlo State of network = current assignment to all variables. Generate next state by sampling one variable given Markov blanket; Sample each variable in turn, keeping evidence fixed Can also choose a variable to sample at random each time Wander about for a while, average what you see PROBABILISTIC REASONING OVER TIME Danica Kragic Time and uncertainty The world changes; we need to track and predict it Diabetes management vs vehicle diagnosis Basic idea: copy state and evidence variables for each time step X t = set of unobservable state variables at time t e.g., BloodSugar t, StomachContents t, etc. E t = set of observable evidence variables at time t e.g., MeasuredBloodSugar t, PulseRate t, FoodEaten t This assumes discrete time; step size depends on problem Notation: X a:b = X a,x a+1,...,x b 1,X b September 23, 2005

Markov processes (Markov chains) Construct a Bayes net from these variables: parents? Markov assumption: X t depends on bounded subset of X 0:t 1 First-order Markov process: P(X t X 0:t 1 ) = P(X t X t 1 ) Second-order Markov process: P(X t X 0:t 1 ) = P(X t X t 2,X t 1 ) Sensor Markov assumption: P(E t X 0:t,E 0:t 1 ) = P(E t X t ) Stationary process: transition model P(X t X t 1 ) and sensor model P(E t X t ) fixed for all t Example First-order Markov assumption not exactly true in real world! Possible fixes: 1. Increase order of Markov process 2. Augment state, e.g., add Temp t, Pressure t Example: robot motion. Augment position and velocity with Battery t Inference tasks Filtering Filtering: P(X t e 1:t ) belief state input to the decision process of a rational agent Prediction: P(X t+k e 1:t ) for k > 0 evaluation of possible action sequences; like filtering without the evidence Smoothing: P(X k e 1:t ) for 0 k < t better estimate of past states, essential for learning Most likely explanation: argmaxx 1:t P(x 1:t e 1:t ) speech recognition, decoding with a noisy channel Aim: devise a recursive state estimation algorithm: P(X t+1 e 1:t+1 ) = f (e t+1,p(x t e 1:t )) P(X t+1 e 1:t+1 ) = P(X t+1 e 1:t,e t+1 ) = αp(e t+1 X t+1,e 1:t )P(X t+1 e 1:t ) = αp(e t+1 X t+1 )P(X t+1 e 1:t )

Filtering Filtering example I.e., prediction + estimation. Prediction by summing out X t : P(X t+1 e 1:t+1 ) = αp(e t+1 X t+1 ) xt P(X t+1 x t,e 1:t )P(x t e 1:t ) = αp(e t+1 X t+1 ) xt P(X t+1 x t )P(x t e 1:t ) f 1:t+1 = FORWARD(f 1:t,e t+1 ) where f 1:t =P(X t e 1:t ) Time and space constant (independent of t) Smoothing Smoothing computing the distribution over past states given evidence up to the present Backward message computed by a backwards recursion: Divide evidence e 1:t into e 1:k, e k+1:t : P(X k e 1:t ) = P(X k e 1:k,e k+1:t ) = αp(x k e 1:k )P(e k+1:t X k,e 1:k ) = αp(x k e 1:k )P(e k+1:t X k ) = αf 1:k b k+1:t P(e k+1:t X k ) = xk+1 P(e k+1:t X k,x k+1 )P(x k+1 X k ) = xk+1 P(e k+1:t x k+1 )P(x k+1 X k ) = xk+1 P(e k+1 x k+1 )P(e k+2:t x k+1 )P(x k+1 X k )

Smoothing example Forward backward algorithm: cache forward messages along the way O(t f ) Most likely explanation Most likely sequence sequence of most likely states!!!! Most likely path to each x t+1 = most likely path to some x t plus one more step max P(x x 1...x 1,...,x t,x t+1 e 1:t+1 ) t = P(e t+1 X t+1 )max x t ( P(X t+1 x t ) max x 1...x t 1 P(x 1,...,x t 1,x t e 1:t ) ) Most likely explanation Viterbi example Identical to filtering, except f 1:t replaced by m 1:t = max x 1...x t 1 P(x 1,...,x t 1,X t e 1:t ), I.e., m 1:t (i) gives the probability of the most likely path to state i. Update has sum replaced by max, giving the Viterbi algorithm: m 1:t+1 = P(e t+1 X t+1 )max x t (P(X t+1 x t )m 1:t )

Hidden Markov models - X t is a single, discrete variable (usually E t is too) Domain of X t is {1,...,S} ( ) 0.7 0.3 - Transition matrix T i j = P(X t = j X t 1 =i), e.g., 0.3 0.7 - Sensor matrix O t for each ( time step, ) diagonal elements P(e t X t =i) 0.9 0 e.g., with U 1 =true, O 1 = 0 0.2 - Forward and backward messages as column vectors: Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t Ot+1 1 1:t+1 = αt f 1:t α (T ) 1 Ot+1 1 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i f 1:t+1 = αo t+1 T f 1:t b k+1:t = TO k+1 b k+2:t Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t Ot+1 1 1:t+1 = αt f 1:t α (T ) 1 Ot+1 1 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t Ot+1 1 1:t+1 = αt f 1:t α (T ) 1 Ot+1 1 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i

Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t Ot+1 1 1:t+1 = αt f 1:t α (T ) 1 Ot+1 1 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t Ot+1 1 1:t+1 = αt f 1:t α (T ) 1 Ot+1 1 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t Ot+1 1 1:t+1 = αt f 1:t α (T ) 1 Ot+1 1 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t Ot+1 1 1:t+1 = αt f 1:t α (T ) 1 Ot+1 1 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i

Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t Ot+1 1 1:t+1 = αt f 1:t α (T ) 1 Ot+1 1 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t Ot+1 1 1:t+1 = αt f 1:t α (T ) 1 Ot+1 1 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i Country dance algorithm Can avoid storing all forward messages in smoothing by running forward algorithm backwards: f 1:t+1 = αo t+1 T f 1:t Ot+1 1 1:t+1 = αt f 1:t α (T ) 1 Ot+1 1 1:t+1 = f 1:t Algorithm: forward pass computes f t, backward pass does f i, b i Summary Temporal models use state and sensor variables replicated over time Markov assumptions and stationarity assumption, so we need transition modelp(x t X t 1 ) sensor model P(E t X t ) Tasks are filtering, prediction, smoothing, most likely sequence; all done recursively with constant cost per time step