Why do we care? Examples. Bayes Rule. What room am I in? Handling uncertainty over time: predicting, estimating, recognizing, learning

Similar documents
Why do we care? Measurements. Handling uncertainty over time: predicting, estimating, recognizing, learning. Dealing with time

Approximate Inference

Lecture 2: From Linear Regression to Kalman Filter and Beyond

STA 4273H: Statistical Machine Learning

Markov Models. CS 188: Artificial Intelligence Fall Example. Mini-Forward Algorithm. Stationary Distributions.

Hidden Markov Models. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 19 Apr 2012

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Introduction to Machine Learning

2D Image Processing (Extended) Kalman and particle filter

Announcements. CS 188: Artificial Intelligence Fall VPI Example. VPI Properties. Reasoning over Time. Markov Models. Lecture 19: HMMs 11/4/2008

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Mobile Robot Localization

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

L09. PARTICLE FILTERING. NA568 Mobile Robotics: Methods & Algorithms

Introduction to Artificial Intelligence (AI)

Introduction to Machine Learning CMU-10701

CS 188: Artificial Intelligence Spring 2009

State Space and Hidden Markov Models

CS 188: Artificial Intelligence Spring Announcements

Announcements. CS 188: Artificial Intelligence Fall Markov Models. Example: Markov Chain. Mini-Forward Algorithm. Example

CS 343: Artificial Intelligence

STA 414/2104: Machine Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

CSE 473: Artificial Intelligence

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

CSEP 573: Artificial Intelligence

Kalman filtering and friends: Inference in time series models. Herke van Hoof slides mostly by Michael Rubinstein

AUTOMOTIVE ENVIRONMENT SENSORS

Modeling and state estimation Examples State estimation Probabilities Bayes filter Particle filter. Modeling. CSC752 Autonomous Robotic Systems

The Kalman Filter ImPr Talk

Sensor Fusion: Particle Filter

State-Space Methods for Inferring Spike Trains from Calcium Imaging

Sequential Monte Carlo Methods for Bayesian Computation

Intro to Probability. Andrei Barbu

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Bayesian Methods / G.D. Hager S. Leonard

Markov Chains and Hidden Markov Models

Hidden Markov Models. Vibhav Gogate The University of Texas at Dallas

Computer Intensive Methods in Mathematical Statistics

CS 5522: Artificial Intelligence II

Hidden Markov models

Intelligent Systems (AI-2)

CS325 Artificial Intelligence Ch. 15,20 Hidden Markov Models and Particle Filtering

CS 5522: Artificial Intelligence II

Bayes Filter Reminder. Kalman Filter Localization. Properties of Gaussians. Gaussians. Prediction. Correction. σ 2. Univariate. 1 2πσ e.

2D Image Processing. Bayes filter implementation: Kalman filter

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization

Lecture 6: Multiple Model Filtering, Particle Filtering and Other Approximations

Gaussian Process Approximations of Stochastic Differential Equations

Machine Learning for OR & FE

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Robert Collins CSE586, PSU Intro to Sampling Methods

Particle Filters. Pieter Abbeel UC Berkeley EECS. Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Mobile Robot Localization

Math 350: An exploration of HMMs through doodles.

Hidden Markov Models. George Konidaris

Hidden Markov Models

Hidden Markov Models (recap BNs)

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

CPSC 340: Machine Learning and Data Mining

Lecture : Probabilistic Machine Learning

PROBABILISTIC REASONING OVER TIME

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1

Approximate Bayesian Computation and Particle Filters

Content.

L06. LINEAR KALMAN FILTERS. NA568 Mobile Robotics: Methods & Algorithms

2D Image Processing. Bayes filter implementation: Kalman filter

An example to illustrate frequentist and Bayesian approches

Hidden Markov Models,99,100! Markov, here I come!

Rao-Blackwellized Particle Filter for Multiple Target Tracking

Introduction to Artificial Intelligence (AI)

CS491/691: Introduction to Aerial Robotics

Linear Dynamical Systems

Bayesian Networks BY: MOHAMAD ALSABBAGH

Advanced Data Science

CS 188: Artificial Intelligence Fall Recap: Inference Example

Probabilistic Machine Learning

Probabilistic Graphical Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

CS 343: Artificial Intelligence

Probabilistic Graphical Models

Answers and expectations

Probabilistic Graphical Models

Expectation Propagation in Dynamical Systems

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Robotics. Lecture 4: Probabilistic Robotics. See course website for up to date information.

E190Q Lecture 10 Autonomous Robot Navigation

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Robotics. Mobile Robotics. Marc Toussaint U Stuttgart

Probability Theory for Machine Learning. Chris Cremer September 2015

Sequential Bayesian Updating

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Particle Filtering for Data-Driven Simulation and Optimization

Temporal probability models. Chapter 15

Markov Models and Hidden Markov Models

CIS 390 Fall 2016 Robotics: Planning and Perception Final Review Questions

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Transcription:

Handling uncertainty over time: predicting, estimating, recognizing, learning Chris Atkeson 004 Why do we care? Speech recognition makes use of dependence of words and phonemes across time. Knowing where your robot is makes use of reasoning about processes that unfold over time. So does most reasoning, actually. Medical diagnosis, politics, stock market, and the choices you make every day, for eample. Eamples Discrete states and measurements How tall am I? (continuous) What room am I in? (discrete) What room am I in? State: Kitchen vs. Den Measurement: Do I smell food? Priors: p(k) = 0.7, p(d) =? = 1-p(K) Measurement model: p(f K) = 0.8, p(f D) = 0.1, p(~f K) =?, p(~f D) =? What is p(k F)? p(k ~F)? Bayes Rule p(a&b) = p(a B)p(B) = p(b A)p(A) So p(a B) = p(b A)p(A)/p(B) Eample: p(k F) = p(f K)p(K)/p(F) Note that p(f) = p(f K)p(K) + p(f D)p(D) 1

The Numbers p(f) = p(f K)p(K) + p(f D)p(D) 0.59 = 0.8*0.7 + 0.1*0.3 p(k F) = p(f K)p(K)/p(F) 0.95 = 0.8*0.7/0.59 p(k ~F) = p(~f K)p(K)/p(~F) p(~f) = 1 p(f) So p(k ~F) = 0.34 = 0.*0.7/0.41 What happens with a second measurement? Same formulas, new prior. First measurement: F: p(k F) = 0.95 State -> p(f,f K) = p(f K), etc. p(f,f F) = p(f,f K)p(K F) + p(f,f D)p(D F) = 0.77 p(k F,F) = p(f,f K)*p(K F)/p(F,F F) 0.99 = 0.8*0.95/0.77 p(k F,~F) = p(f,~f K)*p(K F)/p(F,~F F) 0.83 = 0.*0.95/0.3 Draw a picture. Continuous states and measurements Unit variance Gaussian 1 p( ) = ep π General Gaussian p( ) 1 ( µ ) ep πσ σ = E[ X ] = 0 Var[ X ] = 1 σ=15 E [ X ] = µ Var[ X ] = σ µ=100

General Gaussian Also known as the normal distribution or Bellshaped curve µ=100 p( ) σ=15 1 ( µ ) ep πσ σ = E [ X ] = µ Var[ X ] = σ Shorthand: We say X ~ N(µ,σ ) to mean X is distributed as a Gaussian with parameters µ and σ. In the above figure, X ~ N(100,15 ) The Central Limit Theorem If (X 1,X, X n ) are i.i.d. continuous random variables n 1 Then define z = f ( 1,,... ) = n n i i= 1 As n-->infinity, p(z)--->gaussian with mean E[X i ] and variance Var[X i ] Somewhat of a justification for assuming Gaussian noise is common Combining Measurements: 1D True value Measurements m 1, m: E(m 1 -) = 0, Var(m 1 ) = σ 1, E(m -) = 0, Var(m ) = σ, independent Linear estimate = k 1 m 1 + k m Unbiased estimate means k = 1 - k 1 so E( ) = Minimize Var( ) = k 1 σ 1 + (1 k 1 ) σ So Var( )/ k 1 = 0 -> k 1 (σ 1 + σ ) - σ = 0 So k 1 = σ /(σ 1 + σ ), k = σ 1 /(σ 1 + σ ) So Var( ) = σ 1 σ /(σ 1 + σ ) What happens when σ = 0? σ = infinity? BLUE: Best Linear Unbiased Estimator Handling Measurements Over Time Just keep doing what we did before: p(state new-measurement) = What is a state Everything you need to know to make the best prediction about what happens net. Depends how you define the system you care about. States are called or s. Dependence on time can be indicated by (t). States can be discrete or continuous. AI researchers tend to say state when they mean some features derived from the state. This should be discouraged. A belief state is your knowledge about the state, which is typically a probability p(). Processes with state are called Markov processes. Dealing with time The concept of state gives us a handy way of thinking about how things evolve over time. We will use discrete time, for eample 0.001, 0.00, 0.003, State at time t k, (t k ), will be written k or [k]. Deterministic state transition function k+1 = f( k ) Stochastic state transition function p( k+1 k ) Mildly stochastic state transition function k+1 = f( k ) + ε, with ε being Gaussian. 3

Hidden state Sometimes the state is directly measurable/observable. Sometimes it isn t. Then you have hidden state and a hidden Markov model or HMM. Eamples: Do you have a disease? What am I thinking about? What is wrong with the Mars rover? Where is the Mars rover? measurements k-1 k k+1 y k-1 y k y k+1 Measurements Measurements (y) are also called evidence (e) and observables (o). Measurements can be discrete or continuous. Deterministic measurement function y k = g( k ) Stochastic measurement function p(y k k ) Mildly stochastic measurement function y k = g( k ) + υ, with υ being Gaussian. Standard problems Predict the future. Estimate the current state (filtering). Estimate what happened in the past (smoothing). Find the most likely state trajectory (sequence/trajectory (speech) recognition). Learn about the process (learn state transition and measurement models). Prediction, Case 0 Deterministic state transition function k+1 = f( k ) and known state k : Just apply f() n times to get k+n. When we worry about learning the state transition function and the fact that it will always have errors, the question will arise: To predict k+n, is it better to learn k+1 = f 1 ( k ) and iterate, or learn k+n = f n ( k ) directly? Prediction, Case 1 Stochastic state transition function p( k+1 k ), discrete states, belief state p( k ) Use tables to represent p( k ) Propagate belief state: p( k+1 ) = Σp( k+1 k )p( k ) Matri notation: Vector p k, Transition matri M, M ij = p( i j ); i, j, components, not time. Propagate belief k state: p k+1 = Mp k Stationary distribution M = lim(n-> ) M n Miing time: n for which M n M Prediction, Case Stochastic state transition function p( k+1 k ), continuous states, belief state p( k ) Propagate belief state analytically if possible p( k+1 ) = p( k+1 k )p( k )d k Particle filtering (actually many ways to implement). Sample p( k ). For each sample, sample p( k+1 k ). Normalize/resample resulting samples to get p( k+1 ). Iterate to get p( k+n ) 4

Prediction, Case 3 Mildly stochastic state transition function with p( k ) being N(µ,Σ ), k+1 = f( k ) + ε, with ε being N(0,Σ ε ), ε independent of process. E( k+1 ) f(µ) A = f/ Var( k+1 ) AΣ A T + Σ ε p( k+1 ) is N(E( k+1 ),Var( k+1 )). Eact if f() linear. Iterate to get p( k+n ). Much simpler than particle filtering. Filtering, in general Start with p( k-1 + ) Predict p( k- ) Apply measurement using Bayes Rule to get p( k + )= p( k y k ) p( k y k ) = p(y k k )p( k )/p(y k ) Sometimes we ignore p(y k ) and just renormalize as necessary, so all we have to do is p( k y k ) = αp(y k k )p( k ) Filtering, Case 1 Stochastic state transition function p( k+1 k ), discrete states, belief state p( k ), p(y k k ) Use tables to represent p( k ) Propagate belief state: p( k+1- ) = Σ p( k+1 k )p( k+ ) Weight each entry by p(y k k ): p( k+1+ ) p(y k k )p( k+1- ) Normalize so sum of p() = 1 This is called a Discrete Bayes Filter Filtering, Case Stochastic state transition function p( k+1 k ), continuous states, belief state p( k ) Particle filtering (actually many ways to implement). Sample p( k ). For each sample, sample p( k+1 k ). Weight each sample by p(y k k ). Normalize/resample resulting samples to get p( k+1 ). Iterate to get p( k+n ) Particle Filter Algorithm EEE 581 Lecture 16 Particle Filters: a Gentle Introduction http://www.fulton.asu.edu/~morrell/581/ Create particles as samples from the initial state distribution p( 0 ). For k going from 1 to K Update each particle using the state update equation. Compute weights for each particle using the observation value. (Optionally) resample particles. 5

Initial State Distribution: Samples Only Samples and Weights Each particle has a value and a weight 0 0 Importance Sampling State Update Ideally, the particles would represent samples drawn from the distribution p(). In practice, we usually cannot get p() in closed form; in any case, it would usually be difficult to draw samples from p(). We use importance sampling: Particles are drawn from an importance distribution. Particles are weighted by importance weights. k+1 = f 0 ( k ) + noise Things are more complicated if have multimodal p( k+1 k ) 0 1 Compute Weights Resampling Use p(y ) to alter weights 1 1 Before After Can also draw samples with replacement using p(y )*weight as p(selection) In inference problems, most weights tend to zero ecept a few (from particles that closely match observations), which become large. We resample to concentrate particles in regions where p( y) is larger. 6

Advantages of Particle Filters Under general conditions, the particle filter estimate becomes asymptotically optimal as the number of particles goes to infinity. Non-linear, non-gaussian state update and observation equations can be used. Multi-modal distributions are not a problem. Particle filter solutions to inference problems are often easy to formulate. Disadvantages of Particle Filters Naïve formulations of problems usually result in significant computation times. It is hard to tell if you have enough particles. The best importance distribution and/or resampling methods may be very problem specific. Conclusions Particle filters (and other Monte Carlo methods) are a powerful tool to solve difficult inference problems. Formulating a filter is now a tractable eercise for many previously difficult or impossible problems. Implementing a filter effectively may require significant creativity and epertise to keep the computational requirements tractable. Particle Filtering Comments Reinvented many times in many fields: sequential Monte Carlo, condensation, bootstrap filtering, interacting particle approimations, survival of the fittest, Do you need R d samples to cover space? R is crude measure of linear resolution, d is dimensionality. You maintain a belief state p(). How do you answer the question Where is the robot now? mean, best sample, robust mean, ma likelihood, What happens if p() really is multimodal? Return to our regularly scheduled programming Filtering Filtering, Case 3 Mildly stochastic state transition function with p( k ) being N(µ,Σ ), k+1 = f( k ) + ε, with ε being N(0,Σ ε ) and independent of process. Mildly stochastic measurement function y k = g( k ) + υ, with υ being N(0,Σ υ ) and independent of everything else. This will lead to Kalman Filtering Nonlinear f() or g() means you are doing Etended Kalman Filtering (EKF). 7

Filtering, Case 3 Prediction Step E( k+1- ) f(µ) A = f/ Var( k+1- ) AΣ A T + Σ ε p( k+1- ) is N(E( k+1- ),Var( k+1- )) Filtering, Case 3 Measurement Update Step E( k+ ) E( k- ) + K k (y k g(e( k- ))) C = g/ Σ k- = Var( k- ) Var( k+ ) Σ k- -K k C Σ k - S k = CΣ k- C T + Σ υ K k = Σ k- C T S k -1 p( k+ ) is N(E( k+ ),Var( k+ )) This all comes from Gaussian lecture Unscented Filter Numerically find best fit Gaussian instead of analytical computation. Good if f() or g() strongly nonlinear. What I would like to see Combine particle system and Kalman filter, so each particle maintains a simple distribution, instead of just a point estimate. Smoothing, in general Have y 1:N, want p( k y 1:N ) Know how to compute p( k y 1:k ) from filtering slides p( k y 1:N ) = p( k y 1:k,y k+1:n ) p( k y 1:N ) p( k y 1:k )p(y k+1:n y 1:k, k ) p( k y 1:N ) p( k y 1:k )p(y k+1:n k ) p(y k+1:n k ) = Σ/ p(y k+1:n k, k+1 )p( k+1, k ) d k+1 = Σ/ p(y k+1:n k+1 )p( k+1, k ) d k+1 = Σ/ p(y k+1 k+1 )p(y k+:n k+1 )p( k+1, k ) d k+1 Note recursion implied by p(y k+i+1:n k+i ) Smoothing, general comments Need to maintain distributional information at all time steps from forward filter. Case 1: discrete states: forward/backward algorithm. Case : continuous states, nasty dynamics or noise: particle smoothing (epensive). Case 3: continuous states, Gaussian noise: Kalman smoother. 8

Finding most likely state trajectory Goal in speech recognition p( 1,,, N y 1:N ) p( 1 y 1:N )p( y 1:N ) p( N y 1:N ) Are we screwed? Computing joint probability is hard! Viterbi Algorithm ma p( 1:k y 1:k ) = ma p(y 1:k 1:k )p( 1:k ) = ma p(y 1:k-1,y k 1:k )p( 1:k ) = ma p(y 1:k-1 1:k-1 )p(y k k )p( 1:k ) = ma p(y 1:k-1 1:k-1 )p(y k k )p( k 1:k-1 )p( 1:k-1 ) = ma [p(y 1:k-1 1:k-1 ) p( 1:k-1 )]p(y k k )p( k 1:k-1 ) = ma p( 1:k-1 y 1:k-1 )p(y k k )p( k k-1 ) Note recursion Do we evaluate this over all possible sequences? Viterbi Algorithm () Viterbi Algorithm (3) Use dynamic programming p( 1:k- y 1:k- ) p( 1:k-1 y 1:k-1 ) p( 1:k- y 1:k- ) p( 1:k-1 y 1:k-1 ) p( 1:k- y 1:k- ) p( 1:k-1 y 1:k-1 ) p(y k k )p( k k-1 ) Well, this still only really works for discrete states. Continuous states have too many possible states at each step. D dimensions, R resolution in each dimension implies R D states at each time step. Ask me about local sequence maimization. States at time k- States at time k-1 States at time k ma p( 1:k-1 y 1:k-1 )p(y k k )p( k k-1 ) over predecessors Learning Given data, want to learn dynamic/transition and sensor models. Smooth, choose most likely state at each time, learn models, iterate. This is known as the EM algorithm. Discrete case: Baum-Welch Algorithm 9