Machine Learning 4771 Instructor: ony Jebara
Kalman Filtering Linear Dynamical Systems and Kalman Filtering Structure from Motion
Linear Dynamical Systems Audio: x=pitch y=acoustic waveform Vision: x=object pose y=pixel coordinates Industrial: x=state of steel y=temperature/pressure Military: x=target pos n y=radar returns Bio: x=protein levels y=gene expression levels Vision example (Rao)
Linear Dynamical Systems A Linear Dynamical System (LDS) or Kalman Filter (KF) has linear state dynamics and linear outputs: x t = Ax t 1 +Gw t w t N. 0,Q y t = Cx t + v t v t N. 0,R x t = state n y t = observed output p w v t = measurement noise p t = process noise n Called Linear-Gauss Markov Model (See Markov property?) Stationary LDS: A,C,G,Q,R do not change with t Non-Stationary LDS: they A,C,G,Q,R depend on t
Gaussian Marginals/Conditionals Conditional & marginal from joint: See derivation in Jordan 12.4: p( z µ,σ) 1 = exp 1 ( 2π) D/2 Σ z µ 2 p( x,y) 1 = exp 1 x ( 2π) D/2 Σ 2 y µ x µ y p( x) 1 = exp 1 ( 2π) x µ D x /2 Σ 2 x xx p y x = p ( x,y ) p( x) = p ( x,y ) p( x,y) p y x Σ 1 ( z µ ) Σ 1 xx Σ xx Σ xy x µ x Σ yx Σ yy ( 1 ( x µ x ),Σ yy Σ yx Σ 1 xx Σ ) xy = N y µ y + Σ yx Σ xx y = µ y + Σ yx Σ 1 xx y = Σ yx Σ 1 xx x + noise ( x µ ) x + noise 1 x y if zero-mean y µ x µ y
LDS as a Gaussian More statistically, the LDS is a large Gaussian distribution All variables are related by linear conditioned Gaussians: x t = Ax t 1 +Gw t w t N. 0,Q y t = Cx t + v t v t N (. 0,R ) = N ( x t Ax t 1,Q) p x t x t 1 = N ( y t Cx t,r) p y t x t We are assuming G=I, to simplify Jordan s approach he first hidden state has no parents, but still needs a Gaussian distribution: p( x ) 0 = N ( x t 0,Q ) Without loss of generality: all are zero-mean Gaussians
LDS as a Gaussian LDS has the following graphical model and conditionals: = N ( x t Ax t 1,Q) p x t x t 1 p( y t x ) t = N ( y t Cx t,r) p( x ) 0 = N ( x 0 0,P ) 0 Note (as with HMMs) stationary since A,C,R,Q constant Products of Gaussian distributions form Gaussian p( X ) U = p( x ) 0 p( x t x ) t 1 t=1 p( y t x ) t = N X U 0,Σ he graphical model describes a particular factorization of this large Gaussian over all variables. his factorization forms a huge yet sparse overall covariance Σ t=0
Properties of the LDS Since all variables in the LDS are Gaussian, we summarize their conditional probabilities not by tables p(q t q t-1 ) but by Mean & Covariance (Expectations or Moments) of Gaussian p( x t y 0, y ) t ' = N ( x t ˆx t t ',P ) t t ' Conditional Mean Conditional Cov Condition on Information until t observations y 0,,y t ˆx t t ' = E p ( xt y 0 y t ' ) x P t t t ' = E ( p xt x ( y 0 y t ' ) t ˆx )( t t ' x t ˆx ) t t ' Lyapunov Eqn shows evolution of unconditional covariance Σ t+1 = E x t+1 x t+1 = E ( Ax + w )( Ax t t t + w ) t = AE x t x t A + 2AE x t w t + E w w If evals of A <1 t t = AΣ t A + 0 +Q Σ = AΣ A +Q
Steady-State Covariance Consider Lyapunov at t going to infinity Σ t+1 = AΣ t A +Q If evals of A <1 Σ = AΣ A +Q Can solve for covariance at steady state (t = infinity) Example (Boyd): pick A and Q, solve for steady covariance A = 0.6 0.8 Q = 0.7 0.6 1 0 Σ 0 1 = 13.35 0.03 0.03 11.75 initialize Σ 0 = 0 I Σ 0 = 100 I
LDS: Basic Operations Recall we did 3 basic things with Bayesian networks: 1) Evaluate: given y 0,,y compute likelihood p(y 0,,y ) 2) Decode: given y 0,,y compute p(q t ) 3) Max Likelihood: given y 0,,y learn parameters θ = p( q,y) p y q0 q We want 3 basic things with our LDSs: 1) Evaluate: given y 0,,y compute likelihood p(y 0,,y ) 2) Filter/Smooth: given y 0,,y compute p(x t ) 3) Max Likelihood: given y 0,,y learn parameters θ p( y) = p( x,y)dx 0 dx x 0 x Will do Filtering traditionally but Junction ree does it all
Kalman Filtering Used in online tracking, information up to time t Filtering: given y 0,,y t compute p(x t ) incrementally via ˆx t t = E p ( X y0 y t ) x t P t t = E p X y0 x ( y t ) t ˆx t t Update with 2 steps: ( ˆx t t,p ) t t ˆx t+1 t,p t+1 t ( x t ˆx ) t t ( ˆx t+1 t+1,p ) t+1 t+1 ime Update: p(x t+1 ) moments at t+1 given same y s until t ˆx t+1 t = E p ( X y0 y t ) x t+1 = E p( X y 0 y t ) Ax + w t t = AE p ( X y0 y t ) x t + 0 = Aˆx t t P t+1 t = E ( p ( X y0 x y t ) t+1 ˆx )( t+1 t x t+1 ˆx ) t+1 t = E ( p ( X y0 Ax y t ) t + w t Aˆx )( t t ) = AP t t A +Q
Kalman Filtering Measurement Update: p(x t+1 ) moments given y t+1 E p X,yt+1 y 0 y t So: p Conditional Mean y t+1 = E p( X,y t+1 y 0 y t ) Cx + v t+1 t+1 = Cˆx + 0 t+1 t Conditional Covariance E ( p X,yt+1 y ( y 0 y t ) t+1 ŷ ) t+1 t ( ) = E p X,yt+1 Cx ( y 0 y t ) t+1 + v t+1 Cˆx t+1 t Conditional Correlation E p ( X,yt+1 y 0 y t ) x t+1 y t+1 ( y t+1 ŷ ) t+1 t x t+1 ˆx t+1 t y 0 y t = N x t+1 y t+1 ˆx t+1 t Cˆx t+1 t = CP t+1 t, P t+1 t CP t+1 t = CP C + R t+1 t P t+1 t C CP t+1 t C + R
Kalman Filtering Measurement Update: x p t+1 y y t+1 0 y t = N x t+1 y t+1 Want to condition this to get: Recall conditioning formula: p x y = N x µ x + Σ xy Σ yy ˆx t+1 t+1 = ˆx t+1 t + K t+1 ˆx t+1 t Cˆx t+1 t, P t+1 t CP t+1 t p x t+1,y t+1 P t+1 t C CP t+1 t C + R p( x t+1,y t+1 ) ( 1 ( y µ y ),Σ xx Σ xy Σ 1 yy Σ ) yx Next Conditional Mean y t+1 Cˆx t+1 t ˆx t+1 t = Aˆx t t P t+1 t = AP t t A +Q Next Conditional Cov where Kalman gain is K = P C t+1 t+1 t CP C t+1 t + R P t+1 t+1 = P t+1 t K t+1 CP t+1 t 1
Kalman Filtering (summary) Start at time t=0 and run forward until t= ˆx t+1 t = Aˆx t t P t+1 t = AP t t A +Q K t+1 = P t+1 t C ˆx t+1 t+1 = ˆx t+1 t + K t+1 ( CP t+1 t C + R) 1 P t+1 t+1 = P t+1 t K t+1 CP t+1 t ( y t+1 Cˆx ) t+1 t hen, the log-likelihood is (using the Gaussian model): loglikelihood = t=0 log N ( y t Cˆx t t 1,CP t t 1 C + R)
Filtering & Smoothing We have a recursive formula for going from ( ˆx t t,p ) t t ˆx t+1 t,p t+1 t ( ˆx t+1 t+1,p ) t+1 t+1 Initialize with: ( ˆx 0 1 = 0,P 0 1 = P ) 0 and loop updates Filtering: lets us compute the distribution over hidden states X online as we get more measurements given y 0,,y t compute p(x t ) p(x t ) does not get any information about future y s Smoothing: assume we have access to the whole sequence at once, can get BEER estimates for the hidden states X given y 0,,y compute p(x t ) Not real time tracking, since have all observations
RS Smoothing Smoothing goes back in time, propagates expectations we know at the last observation to earlier states Procedure: ( ˆx t+1,p ) t+1 ( ˆx t,p ) t Filtering already gave: ( ˆx 0 0,P 0 0 ) ( ˆx t t,p t t ) ( ˆx,P ) Recall: ˆx t+1 t = Aˆx t t so E ( P X y0 x (,,y t ) t t ˆx )( t t x t+1 t ˆx ) t+1 t = P t t A hus, have joint probability: x p t x y x t+1 0 y t = N t ˆx t t P t t P t t A, x t+1 ˆx t+1 t AP t t P t+1 t Using conditioning rule: p( x t x t+1,y 0 y t ) = N ( x t ˆx t t + L ( t x t+1 ˆx t+1 t ),P t t L t P t+1 t L ) t where we define L t = P t t A 1 P t+1 t
RS Smoothing Iteratively get the conditional mean on all sequence ˆx t = E P ( X y0,,y ) x t = x t Same for conditional cov: x p x y t (,,y t 0 ) x t = x t p x t,x t+1 y 0,,y x dx t dx xt+1 t+1 t = x t p( x t x t+1,y 0,,y ) dx t p( x t+1 y 0,,y )dx t+1 xt+1 xt+1 = x t p x t x t+1,y 0,,y t x t = ˆx t t + L t x t+1 ˆx xt+1 t+1 t = ˆx t t + L t ˆx t+1 ˆx t+1 t dx t dx t p( x t+1 y 0,,y )dx t+1 p ( x y,,y t+1 0 )dx t+1 P t = P t t + L ( t P t+1 P t+1 t )L t End up with: ( ˆx 0,P 0 ) ( ˆx t,p t ) ( ˆx,P )
Junction ree Approach Instead of deriving filtering & smoothing could use JA forward collect = filter, backward distribute = smooth Same directed graphical model as HMM Get same Junction tree as HMM ψ ( x,x 0 1) ψ ( x,x 1 2) ψ ( x,x 1 ) ψ x 0,y 0 φ ( x 1) ς( x 1 ) ς ( x 2) ς ( x ) φ x 0 ψ( x 2,y 2 ) ψ( x,y ) ψ x 1,y 1 Except now cliques & separators are not discrete tables but continuous functions of Gaussian form
Junction ree Approach Rewrite Gaussians in canonical or natural parameters ξ,k instead of mean and covariance µ,σ N ( x µ,σ) exp 1 ( x µ 2 )Σ 1 ( x µ ) ξ = Σ 1 µ exp( ξ x 1 2 x Kx) K = Σ 1 property: exp( ξ 1 x 1 2 x K 1 x)exp ξ 2 x 1 2 x K 2 x Initialize all cliques to the conditional Gaussians = exp 1 2 x t+1 Ax t ψ x t+1,x t = exp 1 y Cx 2 t t ζ( x ) t = φ( x ) t = 1 ψ x t,y t = exp( ξ 1+2 x 1 2 x K 1+2 x) ( Q 1 ( x t+1 Ax )) t p x t+1 x t ( R 1 ( x t Cx )) t p y t x t
Junction ree Algorithm Canonical-form Gaussians make it easy to compute the continuous version of JA message passing Collect Distribute φ * S = V \S ψ V ψ W * ψ V * = φ * S ψ φ W S = ψ V φ * S = ψ V \S V ψ W * ψ V * = φ * S ψ φ W S = ψ V Get same implementation as previous filtering & smoothing All cliques and separators become the marginals ψ( x t+1,x ) ** t p( x t+1,x t y 0,,y ) ζ( x ) ** t p( x t y 0,,y )
Expectation Maximization If we knew X states, maximum complete likelihood is easy: log p( X,Y ) = log p( x ) 0 p( x t x ) t=1 t 1 p( y t x ) t=0 t = log p( x ) 0 + log p( x t x ) t 1 + log p y t x t 1 1 y 2 t Cx t + t=0 2 For example, take derivatives over C and set to 0: log p = 1 ( y C C 2 t Cx ) t R 1 ( y t Cx ) t + t=0 2 t=1 1 x P 1 2 0 0 x 0 1 log P 2 0 1 = 1 2 1 2 t=1 2R 1 t=0 t=0 ( x t Ax ) t 1 Q 1 x t Ax t 1 R 1 ( y t Cx ) t + 1 2 log R 1 log R 1 ( y t Cx t )x t C = ( y t x ) t=0 t x t x t=0 t log Q 1 1
Expectation Maximization But, we don t know hidden states, maximize incomplete E-step: Use current model A,C,Q,R to get expectations smoothing gives: ( ˆx 0,P 0 ) ( ˆx t,p t ) ( ˆx,P ) M-step: Maximize the expected complete likelihood (replace x s & outerproducts with their expectations) C = ( y t x )( t=0 t x t x ) 1 t=0 t ( y t ˆx )( t=0 t P ) 1 t=0 t ( ) P t 1 1 A = E x t 1 x t etc... t=1 t=1 actually need pairwise marginals (see Ghahramani Paper) p( x t 1,x t y 0,,y ) = N x t 1 x t ˆx t 1 P t 1 P t 1,t, ˆx t P t,t 1 P t and iterate until convergence
LDS Smoothing Example Given some data y 0, y Fit LDS model with EM A,C,Q,R,P 0 hen do smoothing,( ˆx t,p t ), Show expected y s,( Cˆx t ),
Nonlinear Dynamical Systems What about nonlinear dynamics? LDS is simplistic since many real phenomena have nonlinear state evolution and nonlinear output given state x t = a x t 1 y t = c x t + w t w t N (. + v t v t N. 0,R 0,Q ) x t = state n w t = process noise n y t = observed output p v t = measurement noise p ypically we prespecify the functions a() and c() Also called an Extended Kalman Filter Good news: the basic filtering & smoothing algorithms stay the same and we can recover hidden states Bad news: not guaranteed optimal, EM learning harder
Extended Kalman Filtering he regular Kalman filter becomes non-stationary We need to approximate the nonlinear a() and c() functions at each step in time with their best constant (linear) estimate A and C given our guess of hidden state A t = a ( x ) C x ˆx t = c ( x ) t 1 t 1 x ˆx t 1 t 1 Change filtering equations as follows ˆx t+1 t = a ( ˆx ) t t P t+1 t = A t P t t A t +Q ˆx t+1 t+1 = ˆx t+1 t + K t+1 ( y t+1 c( ˆx )) t+1 t P t+1 t+1 = P t+1 t K t+1 C t P t+1 t K = P C t+1 t+1 t t ( C t P t+1 t C t + R) 1