Machine Learning 4771

Similar documents
Probabilistic Graphical Models

Factor Analysis and Kalman Filtering (11/2/04)

Linear Dynamical Systems

Bayesian Machine Learning - Lecture 7

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Introduction to Probabilistic Graphical Models: Exercises

Probabilistic Graphical Models

Linear Dynamical Systems (Kalman filter)

9 Forward-backward algorithm, sum-product on factor graphs

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Probabilistic Graphical Models

Statistical Techniques in Robotics (16-831, F12) Lecture#17 (Wednesday October 31) Kalman Filters. Lecturer: Drew Bagnell Scribe:Greydon Foil 1

Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics )

Introduction to Machine Learning CMU-10701

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Lecture 2: From Linear Regression to Kalman Filter and Beyond

STA 4273H: Statistical Machine Learning

Graphical Models for Collaborative Filtering

Basic math for biology

COMP90051 Statistical Machine Learning

Factor Graphs and Message Passing Algorithms Part 1: Introduction

Lecture 7: Optimal Smoothing

Computer Intensive Methods in Mathematical Statistics

p L yi z n m x N n xi

CS281A/Stat241A Lecture 17

Lecture 6: April 19, 2002

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Probabilistic Graphical Models

The Kalman Filter ImPr Talk

Hidden Markov Models Part 2: Algorithms

Lecture 2: From Linear Regression to Kalman Filter and Beyond

LEARNING DYNAMIC SYSTEMS: MARKOV MODELS

1 What is a hidden Markov model?

Particle Filtering for Data-Driven Simulation and Optimization

Robotics. Mobile Robotics. Marc Toussaint U Stuttgart

PROBABILISTIC REASONING OVER TIME

Probabilistic Graphical Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Graphical Models and Kernel Methods

Graphical Models. Outline. HMM in short. HMMs. What about continuous HMMs? p(o t q t ) ML 701. Anna Goldenberg ... t=1. !

Advanced Machine Learning & Perception

1 Kalman Filter Introduction

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Statistical Approaches to Learning and Discovery

An EM algorithm for Gaussian Markov Random Fields

Advanced Machine Learning & Perception

Multiscale Systems Engineering Research Group

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

Machine Learning 4771

Approximate Inference Part 1 of 2

4 : Exact Inference: Variable Elimination

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

CRF for human beings

Approximate Inference Part 1 of 2

CPSC 540: Machine Learning

Markov Chain Monte Carlo Methods for Stochastic

A Study of the Kalman Filter applied to Visual Tracking

4 Derivations of the Discrete-Time Kalman Filter

Probabilistic Graphical Models

Graphical Models for Statistical Inference and Data Assimilation

Machine Learning for Data Science (CS4786) Lecture 19

STA 414/2104: Machine Learning

Machine Learning 4771

Introduction to Graphical Models

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Markov Chain Monte Carlo Methods for Stochastic Optimization

Particle Filtering Approaches for Dynamic Stochastic Optimization

Advanced Machine Learning & Perception

Probabilistic Graphical Models

Graphical Models for Statistical Inference and Data Assimilation

Mixtures of Gaussians. Sargur Srihari

EM-algorithm for Training of State-space Models with Application to Time Series Prediction

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Junction Tree, BP and Variational Methods

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing

State Estimation of Linear and Nonlinear Dynamic Systems

Sequential Supervised Learning

Probabilistic & Unsupervised Learning. Latent Variable Models for Time Series

Conditional Random Field

Mobile Robot Localization

Undirected Graphical Models

Dynamic models 1 Kalman filters, linearization,

Miscellaneous. Regarding reading materials. Again, ask questions (if you have) and ask them earlier

Probabilistic Graphical Models

Dynamic models. Dependent data The AR(p) model The MA(q) model Hidden Markov models. 6 Dynamic models

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CPSC 540: Machine Learning

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

Pattern Recognition and Machine Learning

Introduction to Machine Learning

Inference in Bayesian Networks

Probabilistic and Bayesian Machine Learning

Data assimilation with and without a model

Statistical Filtering and Control for AI and Robotics. Part II. Linear methods for regression & Kalman filtering

Transcription:

Machine Learning 4771 Instructor: ony Jebara

Kalman Filtering Linear Dynamical Systems and Kalman Filtering Structure from Motion

Linear Dynamical Systems Audio: x=pitch y=acoustic waveform Vision: x=object pose y=pixel coordinates Industrial: x=state of steel y=temperature/pressure Military: x=target pos n y=radar returns Bio: x=protein levels y=gene expression levels Vision example (Rao)

Linear Dynamical Systems A Linear Dynamical System (LDS) or Kalman Filter (KF) has linear state dynamics and linear outputs: x t = Ax t 1 +Gw t w t N. 0,Q y t = Cx t + v t v t N. 0,R x t = state n y t = observed output p w v t = measurement noise p t = process noise n Called Linear-Gauss Markov Model (See Markov property?) Stationary LDS: A,C,G,Q,R do not change with t Non-Stationary LDS: they A,C,G,Q,R depend on t

Gaussian Marginals/Conditionals Conditional & marginal from joint: See derivation in Jordan 12.4: p( z µ,σ) 1 = exp 1 ( 2π) D/2 Σ z µ 2 p( x,y) 1 = exp 1 x ( 2π) D/2 Σ 2 y µ x µ y p( x) 1 = exp 1 ( 2π) x µ D x /2 Σ 2 x xx p y x = p ( x,y ) p( x) = p ( x,y ) p( x,y) p y x Σ 1 ( z µ ) Σ 1 xx Σ xx Σ xy x µ x Σ yx Σ yy ( 1 ( x µ x ),Σ yy Σ yx Σ 1 xx Σ ) xy = N y µ y + Σ yx Σ xx y = µ y + Σ yx Σ 1 xx y = Σ yx Σ 1 xx x + noise ( x µ ) x + noise 1 x y if zero-mean y µ x µ y

LDS as a Gaussian More statistically, the LDS is a large Gaussian distribution All variables are related by linear conditioned Gaussians: x t = Ax t 1 +Gw t w t N. 0,Q y t = Cx t + v t v t N (. 0,R ) = N ( x t Ax t 1,Q) p x t x t 1 = N ( y t Cx t,r) p y t x t We are assuming G=I, to simplify Jordan s approach he first hidden state has no parents, but still needs a Gaussian distribution: p( x ) 0 = N ( x t 0,Q ) Without loss of generality: all are zero-mean Gaussians

LDS as a Gaussian LDS has the following graphical model and conditionals: = N ( x t Ax t 1,Q) p x t x t 1 p( y t x ) t = N ( y t Cx t,r) p( x ) 0 = N ( x 0 0,P ) 0 Note (as with HMMs) stationary since A,C,R,Q constant Products of Gaussian distributions form Gaussian p( X ) U = p( x ) 0 p( x t x ) t 1 t=1 p( y t x ) t = N X U 0,Σ he graphical model describes a particular factorization of this large Gaussian over all variables. his factorization forms a huge yet sparse overall covariance Σ t=0

Properties of the LDS Since all variables in the LDS are Gaussian, we summarize their conditional probabilities not by tables p(q t q t-1 ) but by Mean & Covariance (Expectations or Moments) of Gaussian p( x t y 0, y ) t ' = N ( x t ˆx t t ',P ) t t ' Conditional Mean Conditional Cov Condition on Information until t observations y 0,,y t ˆx t t ' = E p ( xt y 0 y t ' ) x P t t t ' = E ( p xt x ( y 0 y t ' ) t ˆx )( t t ' x t ˆx ) t t ' Lyapunov Eqn shows evolution of unconditional covariance Σ t+1 = E x t+1 x t+1 = E ( Ax + w )( Ax t t t + w ) t = AE x t x t A + 2AE x t w t + E w w If evals of A <1 t t = AΣ t A + 0 +Q Σ = AΣ A +Q

Steady-State Covariance Consider Lyapunov at t going to infinity Σ t+1 = AΣ t A +Q If evals of A <1 Σ = AΣ A +Q Can solve for covariance at steady state (t = infinity) Example (Boyd): pick A and Q, solve for steady covariance A = 0.6 0.8 Q = 0.7 0.6 1 0 Σ 0 1 = 13.35 0.03 0.03 11.75 initialize Σ 0 = 0 I Σ 0 = 100 I

LDS: Basic Operations Recall we did 3 basic things with Bayesian networks: 1) Evaluate: given y 0,,y compute likelihood p(y 0,,y ) 2) Decode: given y 0,,y compute p(q t ) 3) Max Likelihood: given y 0,,y learn parameters θ = p( q,y) p y q0 q We want 3 basic things with our LDSs: 1) Evaluate: given y 0,,y compute likelihood p(y 0,,y ) 2) Filter/Smooth: given y 0,,y compute p(x t ) 3) Max Likelihood: given y 0,,y learn parameters θ p( y) = p( x,y)dx 0 dx x 0 x Will do Filtering traditionally but Junction ree does it all

Kalman Filtering Used in online tracking, information up to time t Filtering: given y 0,,y t compute p(x t ) incrementally via ˆx t t = E p ( X y0 y t ) x t P t t = E p X y0 x ( y t ) t ˆx t t Update with 2 steps: ( ˆx t t,p ) t t ˆx t+1 t,p t+1 t ( x t ˆx ) t t ( ˆx t+1 t+1,p ) t+1 t+1 ime Update: p(x t+1 ) moments at t+1 given same y s until t ˆx t+1 t = E p ( X y0 y t ) x t+1 = E p( X y 0 y t ) Ax + w t t = AE p ( X y0 y t ) x t + 0 = Aˆx t t P t+1 t = E ( p ( X y0 x y t ) t+1 ˆx )( t+1 t x t+1 ˆx ) t+1 t = E ( p ( X y0 Ax y t ) t + w t Aˆx )( t t ) = AP t t A +Q

Kalman Filtering Measurement Update: p(x t+1 ) moments given y t+1 E p X,yt+1 y 0 y t So: p Conditional Mean y t+1 = E p( X,y t+1 y 0 y t ) Cx + v t+1 t+1 = Cˆx + 0 t+1 t Conditional Covariance E ( p X,yt+1 y ( y 0 y t ) t+1 ŷ ) t+1 t ( ) = E p X,yt+1 Cx ( y 0 y t ) t+1 + v t+1 Cˆx t+1 t Conditional Correlation E p ( X,yt+1 y 0 y t ) x t+1 y t+1 ( y t+1 ŷ ) t+1 t x t+1 ˆx t+1 t y 0 y t = N x t+1 y t+1 ˆx t+1 t Cˆx t+1 t = CP t+1 t, P t+1 t CP t+1 t = CP C + R t+1 t P t+1 t C CP t+1 t C + R

Kalman Filtering Measurement Update: x p t+1 y y t+1 0 y t = N x t+1 y t+1 Want to condition this to get: Recall conditioning formula: p x y = N x µ x + Σ xy Σ yy ˆx t+1 t+1 = ˆx t+1 t + K t+1 ˆx t+1 t Cˆx t+1 t, P t+1 t CP t+1 t p x t+1,y t+1 P t+1 t C CP t+1 t C + R p( x t+1,y t+1 ) ( 1 ( y µ y ),Σ xx Σ xy Σ 1 yy Σ ) yx Next Conditional Mean y t+1 Cˆx t+1 t ˆx t+1 t = Aˆx t t P t+1 t = AP t t A +Q Next Conditional Cov where Kalman gain is K = P C t+1 t+1 t CP C t+1 t + R P t+1 t+1 = P t+1 t K t+1 CP t+1 t 1

Kalman Filtering (summary) Start at time t=0 and run forward until t= ˆx t+1 t = Aˆx t t P t+1 t = AP t t A +Q K t+1 = P t+1 t C ˆx t+1 t+1 = ˆx t+1 t + K t+1 ( CP t+1 t C + R) 1 P t+1 t+1 = P t+1 t K t+1 CP t+1 t ( y t+1 Cˆx ) t+1 t hen, the log-likelihood is (using the Gaussian model): loglikelihood = t=0 log N ( y t Cˆx t t 1,CP t t 1 C + R)

Filtering & Smoothing We have a recursive formula for going from ( ˆx t t,p ) t t ˆx t+1 t,p t+1 t ( ˆx t+1 t+1,p ) t+1 t+1 Initialize with: ( ˆx 0 1 = 0,P 0 1 = P ) 0 and loop updates Filtering: lets us compute the distribution over hidden states X online as we get more measurements given y 0,,y t compute p(x t ) p(x t ) does not get any information about future y s Smoothing: assume we have access to the whole sequence at once, can get BEER estimates for the hidden states X given y 0,,y compute p(x t ) Not real time tracking, since have all observations

RS Smoothing Smoothing goes back in time, propagates expectations we know at the last observation to earlier states Procedure: ( ˆx t+1,p ) t+1 ( ˆx t,p ) t Filtering already gave: ( ˆx 0 0,P 0 0 ) ( ˆx t t,p t t ) ( ˆx,P ) Recall: ˆx t+1 t = Aˆx t t so E ( P X y0 x (,,y t ) t t ˆx )( t t x t+1 t ˆx ) t+1 t = P t t A hus, have joint probability: x p t x y x t+1 0 y t = N t ˆx t t P t t P t t A, x t+1 ˆx t+1 t AP t t P t+1 t Using conditioning rule: p( x t x t+1,y 0 y t ) = N ( x t ˆx t t + L ( t x t+1 ˆx t+1 t ),P t t L t P t+1 t L ) t where we define L t = P t t A 1 P t+1 t

RS Smoothing Iteratively get the conditional mean on all sequence ˆx t = E P ( X y0,,y ) x t = x t Same for conditional cov: x p x y t (,,y t 0 ) x t = x t p x t,x t+1 y 0,,y x dx t dx xt+1 t+1 t = x t p( x t x t+1,y 0,,y ) dx t p( x t+1 y 0,,y )dx t+1 xt+1 xt+1 = x t p x t x t+1,y 0,,y t x t = ˆx t t + L t x t+1 ˆx xt+1 t+1 t = ˆx t t + L t ˆx t+1 ˆx t+1 t dx t dx t p( x t+1 y 0,,y )dx t+1 p ( x y,,y t+1 0 )dx t+1 P t = P t t + L ( t P t+1 P t+1 t )L t End up with: ( ˆx 0,P 0 ) ( ˆx t,p t ) ( ˆx,P )

Junction ree Approach Instead of deriving filtering & smoothing could use JA forward collect = filter, backward distribute = smooth Same directed graphical model as HMM Get same Junction tree as HMM ψ ( x,x 0 1) ψ ( x,x 1 2) ψ ( x,x 1 ) ψ x 0,y 0 φ ( x 1) ς( x 1 ) ς ( x 2) ς ( x ) φ x 0 ψ( x 2,y 2 ) ψ( x,y ) ψ x 1,y 1 Except now cliques & separators are not discrete tables but continuous functions of Gaussian form

Junction ree Approach Rewrite Gaussians in canonical or natural parameters ξ,k instead of mean and covariance µ,σ N ( x µ,σ) exp 1 ( x µ 2 )Σ 1 ( x µ ) ξ = Σ 1 µ exp( ξ x 1 2 x Kx) K = Σ 1 property: exp( ξ 1 x 1 2 x K 1 x)exp ξ 2 x 1 2 x K 2 x Initialize all cliques to the conditional Gaussians = exp 1 2 x t+1 Ax t ψ x t+1,x t = exp 1 y Cx 2 t t ζ( x ) t = φ( x ) t = 1 ψ x t,y t = exp( ξ 1+2 x 1 2 x K 1+2 x) ( Q 1 ( x t+1 Ax )) t p x t+1 x t ( R 1 ( x t Cx )) t p y t x t

Junction ree Algorithm Canonical-form Gaussians make it easy to compute the continuous version of JA message passing Collect Distribute φ * S = V \S ψ V ψ W * ψ V * = φ * S ψ φ W S = ψ V φ * S = ψ V \S V ψ W * ψ V * = φ * S ψ φ W S = ψ V Get same implementation as previous filtering & smoothing All cliques and separators become the marginals ψ( x t+1,x ) ** t p( x t+1,x t y 0,,y ) ζ( x ) ** t p( x t y 0,,y )

Expectation Maximization If we knew X states, maximum complete likelihood is easy: log p( X,Y ) = log p( x ) 0 p( x t x ) t=1 t 1 p( y t x ) t=0 t = log p( x ) 0 + log p( x t x ) t 1 + log p y t x t 1 1 y 2 t Cx t + t=0 2 For example, take derivatives over C and set to 0: log p = 1 ( y C C 2 t Cx ) t R 1 ( y t Cx ) t + t=0 2 t=1 1 x P 1 2 0 0 x 0 1 log P 2 0 1 = 1 2 1 2 t=1 2R 1 t=0 t=0 ( x t Ax ) t 1 Q 1 x t Ax t 1 R 1 ( y t Cx ) t + 1 2 log R 1 log R 1 ( y t Cx t )x t C = ( y t x ) t=0 t x t x t=0 t log Q 1 1

Expectation Maximization But, we don t know hidden states, maximize incomplete E-step: Use current model A,C,Q,R to get expectations smoothing gives: ( ˆx 0,P 0 ) ( ˆx t,p t ) ( ˆx,P ) M-step: Maximize the expected complete likelihood (replace x s & outerproducts with their expectations) C = ( y t x )( t=0 t x t x ) 1 t=0 t ( y t ˆx )( t=0 t P ) 1 t=0 t ( ) P t 1 1 A = E x t 1 x t etc... t=1 t=1 actually need pairwise marginals (see Ghahramani Paper) p( x t 1,x t y 0,,y ) = N x t 1 x t ˆx t 1 P t 1 P t 1,t, ˆx t P t,t 1 P t and iterate until convergence

LDS Smoothing Example Given some data y 0, y Fit LDS model with EM A,C,Q,R,P 0 hen do smoothing,( ˆx t,p t ), Show expected y s,( Cˆx t ),

Nonlinear Dynamical Systems What about nonlinear dynamics? LDS is simplistic since many real phenomena have nonlinear state evolution and nonlinear output given state x t = a x t 1 y t = c x t + w t w t N (. + v t v t N. 0,R 0,Q ) x t = state n w t = process noise n y t = observed output p v t = measurement noise p ypically we prespecify the functions a() and c() Also called an Extended Kalman Filter Good news: the basic filtering & smoothing algorithms stay the same and we can recover hidden states Bad news: not guaranteed optimal, EM learning harder

Extended Kalman Filtering he regular Kalman filter becomes non-stationary We need to approximate the nonlinear a() and c() functions at each step in time with their best constant (linear) estimate A and C given our guess of hidden state A t = a ( x ) C x ˆx t = c ( x ) t 1 t 1 x ˆx t 1 t 1 Change filtering equations as follows ˆx t+1 t = a ( ˆx ) t t P t+1 t = A t P t t A t +Q ˆx t+1 t+1 = ˆx t+1 t + K t+1 ( y t+1 c( ˆx )) t+1 t P t+1 t+1 = P t+1 t K t+1 C t P t+1 t K = P C t+1 t+1 t t ( C t P t+1 t C t + R) 1