Approximate Inference

Similar documents
CS 188: Artificial Intelligence Spring 2009

Announcements. CS 188: Artificial Intelligence Fall Markov Models. Example: Markov Chain. Mini-Forward Algorithm. Example

CS 188: Artificial Intelligence Spring Announcements

Hidden Markov Models. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 19 Apr 2012

CS 343: Artificial Intelligence

Markov Models. CS 188: Artificial Intelligence Fall Example. Mini-Forward Algorithm. Stationary Distributions.

Markov Chains and Hidden Markov Models

Announcements. CS 188: Artificial Intelligence Fall VPI Example. VPI Properties. Reasoning over Time. Markov Models. Lecture 19: HMMs 11/4/2008

CSEP 573: Artificial Intelligence

CS 188: Artificial Intelligence Fall Recap: Inference Example

Hidden Markov Models. Vibhav Gogate The University of Texas at Dallas

Bayes Nets: Sampling

CS 343: Artificial Intelligence

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

CS 188: Artificial Intelligence

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

Bayes Nets III: Inference

CS 188: Artificial Intelligence. Bayes Nets

Introduction to Artificial Intelligence (AI)

CSE 473: Ar+ficial Intelligence

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Mini-project 2 (really) due today! Turn in a printout of your work at the end of the class

CS 5522: Artificial Intelligence II

CS 188: Artificial Intelligence

CS 5522: Artificial Intelligence II

CS 5522: Artificial Intelligence II

CSE 473: Artificial Intelligence Spring 2014

CSE 473: Artificial Intelligence

Probability, Markov models and HMMs. Vibhav Gogate The University of Texas at Dallas

CSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule.

Bayes Net Representation. CS 188: Artificial Intelligence. Approximate Inference: Sampling. Variable Elimination. Sampling.

Intelligent Systems (AI-2)

Bayesian Networks BY: MOHAMAD ALSABBAGH

CSE 473: Artificial Intelligence Probability Review à Markov Models. Outline

CS 343: Artificial Intelligence

Markov Models and Hidden Markov Models

CS 343: Artificial Intelligence

CSE 473: Artificial Intelligence

CS 188: Artificial Intelligence Spring Announcements

CS188 Outline. We re done with Part I: Search and Planning! Part II: Probabilistic Reasoning. Part III: Machine Learning

Our Status. We re done with Part I Search and Planning!

CS 188: Artificial Intelligence. Our Status in CS188

CS 5522: Artificial Intelligence II

CSE 473: Ar+ficial Intelligence. Example. Par+cle Filters for HMMs. An HMM is defined by: Ini+al distribu+on: Transi+ons: Emissions:

CS 5522: Artificial Intelligence II

Our Status in CSE 5522

CSE 473: Ar+ficial Intelligence. Hidden Markov Models. Bayes Nets. Two random variable at each +me step Hidden state, X i Observa+on, E i

qfundamental Assumption qbeyond the Bayes net Chain rule conditional independence assumptions,

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Sampling from Bayes Nets

CS188 Outline. CS 188: Artificial Intelligence. Today. Inference in Ghostbusters. Probability. We re done with Part I: Search and Planning!

Informatics 2D Reasoning and Agents Semester 2,

CS 188: Artificial Intelligence Fall 2009

Probability. CS 3793/5233 Artificial Intelligence Probability 1

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence

Why do we care? Measurements. Handling uncertainty over time: predicting, estimating, recognizing, learning. Dealing with time

Hidden Markov models 1

CS 188: Artificial Intelligence Fall 2011

Artificial Intelligence

Bayes Nets: Independence

CS 188: Artificial Intelligence Fall 2008

CS 5522: Artificial Intelligence II

Reinforcement Learning Wrap-up

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1

CS 188: Artificial Intelligence Spring Announcements

Why do we care? Examples. Bayes Rule. What room am I in? Handling uncertainty over time: predicting, estimating, recognizing, learning

CS 188: Artificial Intelligence Spring Announcements

CSE 473: Artificial Intelligence Autumn 2011

Probabilistic Models

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on. Professor Wei-Min Shen Week 8.1 and 8.2

Probabilistic Models. Models describe how (a portion of) the world works

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Hidden Markov Models (recap BNs)

CS 5522: Artificial Intelligence II

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Introduction to Machine Learning CMU-10701

PROBABILISTIC REASONING SYSTEMS

CS 188: Artificial Intelligence Fall 2011

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

CS 188: Artificial Intelligence Fall 2009

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Forward algorithm vs. particle filtering

Probabilistic Robotics

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Artificial Intelligence Bayes Nets: Independence

Reasoning Under Uncertainty: Bayesian networks intro

CS325 Artificial Intelligence Ch. 15,20 Hidden Markov Models and Particle Filtering

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Stochastic Simulation

Machine Learning for Data Science (CS4786) Lecture 24

CSE 473: Artificial Intelligence Autumn Topics

Transcription:

Approximate Inference Simulation has a name: sampling Sampling is a hot topic in machine learning, and it s really simple Basic idea: Draw N samples from a sampling distribution S Compute an approximate posterior probability Show this converges to the true probability P F S A Why sample? Learning: get samples from a distribution you don t know Inference: getting a sample is faster than computing the right answer (e.g. with variable elimination)` 1 This slide deck courtesy of Dan Klein at UC Berkeley

Prior Sampling +c 0.5 -c 0.5 Cloudy +c +s 0.1 -s 0.9 -c +s 0.5 -s 0.5 Sprinkler Rain +c +r 0.8 -r 0.2 -c +r 0.2 -r 0.8 +s +r +w 0.99 -w 0.01 -r +w 0.90 -w 0.10 -s +r +w 0.90 -w 0.10 -r +w 0.01 -w 0.99 WetGrass Samples: +c, s, +r, +w c, +s, r, +w 2

Prior Sampling This process generates samples with probability: i.e. the BN s joint probability Let the number of samples of an event be Then I.e., the sampling procedure is consistent 3

Example First: Get a bunch of samples from the BN: +c, s, +r, +w +c, +s, +r, +w c, +s, +r, w +c, s, +r, +w c, s, r, +w Example: we want to know P(W) Cloudy C Sprinkler S Rain R WetGrass W We have counts <+w:4, w:1> Normalize to get approximate P(W) = <+w:0.8, w:0.2> This will get closer to the true distribution with more samples Can estimate anything else, too What about P(C +w)? P(C +r, +w)? P(C r, w)? Fast: can use fewer samples if less time (what s the drawback?) 4

Rejection Sampling Let s say we want P(C) No point keeping all samples around Just tally counts of C as we go Sprinkler S Cloudy C Rain R WetGrass W Let s say we want P(C +s) Same thing: tally C outcomes, but ignore (reject) samples which don t have S=+s This is called rejection sampling It is also consistent for conditional probabilities (i.e., correct in the limit) +c, s, +r, +w +c, +s, +r, +w c, +s, +r, w +c, s, +r, +w c, s, r, +w 5

Sampling Example There are 2 cups. The first contains 1 penny and 1 quarter The second contains 2 quarters Say I pick a cup uniformly at random, then pick a coin randomly from that cup. It's a quarter (yes!). What is the probability that the other coin in that cup is also a quarter?

Likelihood Weighting Problem with rejection sampling: If evidence is unlikely, you reject a lot of samples You don t exploit your evidence as you sample Consider P(B +a) Burglary Alarm b, a b, a b, a b, a +b, +a Idea: fix evidence variables and sample the rest Burglary Alarm b +a b, +a b, +a b, +a +b, +a Problem: sample distribution not consistent! Solution: weight by probability of evidence given parents 7

Likelihood Weighting +c 0.5 -c 0.5 +c +s 0.1 -s 0.9 -c +s 0.5 -s 0.5 Sprinkler Cloudy Rain +c +r 0.8 -r 0.2 -c +r 0.2 -r 0.8 +s +r +w 0.99 -w 0.01 -r +w 0.90 -w 0.10 -s +r +w 0.90 -w 0.10 -r +w 0.01 -w 0.99 WetGrass Samples: +c, +s, +r, +w 8

Likelihood Weighting Sampling distribution if z sampled and e fixed evidence Cloudy C Now, samples have weights S R W Together, weighted sampling distribution is consistent 9

Likelihood Weighting Likelihood weighting is good We have taken evidence into account as we generate the sample E.g. here, W s value will get picked based on the evidence values of S, R More of our samples will reflect the state of the world suggested by the evidence Likelihood weighting doesn t solve all our problems S Cloudy C W Rain R Evidence influences the choice of downstream variables, but not upstream ones (C isn t more likely to get a value matching the evidence) We would like to consider evidence when we sample every variable 10

Markov Chain Monte Carlo* Idea: instead of sampling from scratch, create samples that are each like the last one. Procedure: resample one variable at a time, conditioned on all the rest, but keep evidence fixed. E.g., for P(B +c): +b +a +c b +a +c b a +c Properties: Now samples are not independent (in fact they re nearly identical), but sample averages are still consistent estimators! What s the point: both upstream and downstream variables condition on evidence. 11

Reasoning over Time Often, we want to reason about a sequence of observations Speech recognition Robot localization User attention Medical monitoring Need to introduce time into our models Basic approach: hidden Markov models (HMMs) More general: dynamic Bayes nets This slide deck courtesy of Dan Klein at UC Berkeley 13

Markov Models A Markov model is a chain structured BN Each node is identically distributed (stationarity) Value of X at a given time is called the state As a BN: X 1 X 2 X 3 X 4 Parameters: called transition probabilities or dynamics, specify how the state evolves over time (also, initial probs)

Conditional Independence X 1 X 2 X 3 X 4 Basic conditional independence: Past and future independent of the present Each time step only depends on the previous This is called the (first order) Markov property Note that the chain is just a (growing) BN We can always use generic BN reasoning on it if we truncate the chain at a fixed length 15

Example: Markov Chain Weather: States: X = {rain, sun} Transitions: 0.9 rain sun Initial distribution: 1.0 sun What s the probability distribution after one step? 0.1 0.1 0.9 This is a CPT, not a BN! 16

Mini Forward Algorithm Question: probability of being in state x at time t? Slow answer: Enumerate all sequences of length t which end in s Add up their probabilities 17

Mini Forward Algorithm Question: What s P(X) on some day t? sun sun sun sun rain rain rain rain Forward simulation 18

Example From initial observation of sun P(X 1 ) P(X 2 ) P(X 3 ) P(X ) From initial observation of rain P(X 1 ) P(X 2 ) P(X 3 ) P(X ) 19

Stationary Distributions If we simulate the chain long enough: What happens? Uncertainty accumulates Eventually, we have no idea what the state is! Stationary distributions: For most chains, the distribution we end up in is independent of the initial distribution Called the stationary distribution of the chain Usually, can only predict a short time out

Web Link Analysis PageRank over a web graph Each web page is a state Initial distribution: uniform over pages Transitions: With prob. c, uniform jump to a random page (dotted lines, not all shown) With prob. 1 c, follow a random outlink (solid lines) Stationary distribution Will spend more time on highly reachable pages E.g. many ways to get to the Acrobat Reader download page Somewhat robust to link spam Google 1.0 returned the set of pages containing all your keywords in decreasing rank, now all search engines use link analysis along with many other factors (rank actually getting less important over time) 21

Hidden Markov Models Markov chains not so useful for most agents Eventually you don t know anything anymore Need observations to update your beliefs Hidden Markov models (HMMs) Underlying Markov chain over states S You observe outputs (effects) at each time step As a Bayes net: X 1 X 2 X 3 X 4 X 5 E 1 E 2 E 3 E 4 E 5

Example An HMM is defined by: Initial distribution: Transitions: Emissions:

Ghostbusters HMM P(X 1 ) = uniform P(X X ) = usually move clockwise, but sometimes move in a random direction or stay in place P(R ij X) = same sensor model as before: red means close, green means far away. 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 P(X 1 ) X 1 X 2 X 3 X 4 1/6 1/6 0 1/6 1/2 0 X 5 0 0 0 R i,j R i,j R i,j R i,j P(X X =<1,2>) E 5

Conditional Independence HMMs have two important independence properties: Markov hidden process, future depends on past via the present Current observation independent of all else given current state X 1 X 2 X 3 X 4 X 5 E 1 E 2 E 3 E 4 E 5 Quiz: does this mean that observations are independent? [No, correlated by the hidden state]

Real HMM Examples Speech recognition HMMs: Observations are acoustic signals (continuous valued) States are specific positions in specific words (so, tens of thousands) Machine translation HMMs: Observations are words (tens of thousands) States are translation options Robot tracking: Observations are range readings (continuous) States are positions on a map (continuous)

Filtering / Monitoring Filtering, or monitoring, is the task of tracking the distribution B(X) (the belief state) over time We start with B(X) in an initial setting, usually uniform As time passes, or we get observations, we update B(X) The Kalman filter was invented in the 60 s and first implemented as a method of trajectory estimation for the Apollo program