Probabilistic Graphical Models

Probabilistic Graphical Models Lecture 12 Dynamical Models CS/CNS/EE 155 Andreas Krause

Homework 3 out tonight Start early!! Announcements Project milestones due today Please email to TAs 2

Parameter learning for log-linear models Feature functions φ i (C i ) defined over cliques Log linear model over undirected graph G Feature functions φ 1 (C 1 ),,φ k (C k ) Domains C i can overlap Joint distribution How do we get weights w i? 3

Log-linear conditional random field Define log-linear model over outputs Y No assumptions about inputs X Feature functions φ i (C i,x) defined over cliques and inputs Joint distribution 4

Example: CRFsin NLP Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Y 7 Y 8 Y 9 Y 10 Y 11 Y 12 X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 X 11 X 12 Mrs. Greene spoke today in New York. Green chairs the finance committee Classify into Person, Location or Other 5

Example: CRFsin vision 6

Gradient of conditional log-likelihood Partial derivative Requires one inference per Can optimize using conjugate gradient 7

Exponential Family Distributions Distributions for log-linear models More generally: Exponential family distributions h(x): Base measure w: natural parameters φ(x): Sufficient statistics A(w): log-partition function Hereby x can be continuous (defined over any set) 8

Examples Exp. Family: Gaussian distribution h(x): Base measure w: natural parameters φ(x): Sufficient statistics A(w): log-partition function Other examples: Multinomial, Poisson, Exponential, Gamma, Weibull, chi-square, Dirichlet, Geometric, 9

Moments and gradients Correspondence between moments and log-partition function (just like in log-linear models) Can compute moments from derivatives, and derivatives from moments! MLE moment matching 10

Conjugate priors in Exponential Family Any exponential family likelihood has a conjugate prior 11

Exponential family graphical models So far, only defined graphical models over discrete variables. Can define GMs over continuous distributions! For exponential family distributions: Can do much of what we discussed (VE, JT, parameter learning, etc.) for such exponential family models Important example: Gaussian Networks 12

Multivariate Gaussian distribution Joint distribution over n random variables P(X 1, X n ) σ jk = E[ (X j µ j ) (X k - µ k ) ] X j and X k independent σ jk =0 13

Marginalization Suppose (X 1,,X n ) ~ N( µ, Σ) What is P(X 1 )?? More generally: Let A={i 1,,i k } {1,,N} Write X A = (X i1,,x ik ) X A ~ N( µ A, Σ AA ) 14

Conditioning Suppose (X 1,,X n ) ~ N( µ, Σ) Decompose as (X A,X B ) What is P(X A X B )?? P(X A = x A X B = x B ) = N(x A ; µ A B, Σ A B ) where Computable using linear algebra! 15

Conditional linear Gaussians 16

Canonical Representation Multivariate Gaussians in exponential family! Standard vs canonical form: µ= Λ -1 η Σ= Λ -1 17

Gaussian Networks Zeros in precision matrix Λ indicate missing edges in log-linear model! 18

Inference in Gaussian Networks Can compute marginal distributions in O(n 3 )! For large numbers n of variables, still intractable If Gaussian Network has low treewidth, can use variable elimination / JT inference! Need to be able to multiply and marginalize factors! 19

Multiplying factors in Gaussians 20

Conditioning in canonical form Joint distribution (X A, X B ) ~ N(η AB,Λ AB ) Conditioning: P(X A X B = x B ) = N(x A ; η A B=xB, Λ A B=xB ) 21

Marginalizing in canonical form Recall conversion formulas µ= Λ -1 η Σ= Λ -1 Marginal distribution 22

Standard vs. canonical form Marginalization Standard form Canonical form Conditioning In standard form, marginalization is easy In canonical form, conditioning is easy! 23

Variable elimination In Gaussian Markov Networks, Variable elimination = Gaussian elimination (fast for low bandwidth = low treewidth matrices) 24

Dynamical models 25

HMMs/ KalmanFilters Most famous Graphical models: Naïve Bayes model Hidden Markov model Kalman Filter Hidden Markov models Speech recognition Sequence analysis in comp. bio Kalman Filters control Cruise control in cars GPS navigation devices Tracking missiles.. Very simple models but very powerful!! 26

HMMs/ KalmanFilters X 1 X 2 X 3 X 4 X 5 X 6 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 X 1,,X T : Unobserved (hidden) variables Y 1,,Y T : Observations HMMs: X i Multinomial, Y i arbitrary KalmanFilters: X i, Y i Gaussian distributions Non-linear KF: X i Gaussian, Y i arbitrary 27

HMMsfor speech recognition Words X 1 X 2 X 3 X 4 X 5 X 6 Phoneme Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 He ate the cookies on the couch Infer spoken words from audio signals 28

Inference: Hidden Markov Models In principle, can use VE, JT etc. New variables X t, Y t at each time step need to rerun Bayesian Filtering: Suppose we already have computed P(X t y 1,,t ) Want to efficiently compute P(X t+1 y 1,,t+1 ) X 1 X 2 X 3 X 4 X 5 X 6 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 29

Bayesian filtering Start with P(X 1 ) At time t Assume we have P(X t y 1 t-1 ) Condition: P(X t y 1 t ) X 1 X 2 X 3 X 4 X 5 X 6 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Prediction: P(X t+1, X t y 1 t ) Marginalization: P(X t+1 y 1 t ) 30

Parameter learning in HMMs Assume we have labels for hidden variables Assume stationarity P(X t+1 X t ) is same over all time steps P(Y t X t ) is same over all time steps Violates parameter independence ( parameter sharing ) Example: compute parameters for P(X t+1 =x X t =x ) What if we don t have labels for hidden vars? Use EM (later this course) 31

KalmanFilters (Gaussian HMMs) X 1,,X T : Location of object being tracked Y 1,,Y T : Observations P(X 1 ): Prior belief about location at time 1 P(X t+1 X t ): Motion model How do I expect my target to move in the environment? Represented as CLG: X t+1 = A X t + N(0, Σ M ) P(Y t X t ): Sensor model What do I observe if target is at location X t? Represented as CLG: Y t = H X t + N(0, Σ O ) X 1 X 2 X 3 X 4 X 5 X 6 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 32

Understanding Motion model 33

Understanding sensor model 34

Bayesian Filtering for KFs Can use Gaussian elimination to perform inference in unrolled model X 1 X 2 X 3 X 4 X 5 X 6 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Start with prior belief P(X 1 ) At every timestephave belief P(X t y 1:t-1 ) Condition on observation: P(X t y 1:t ) Predict (multiply motion model): P(X t+1,x t y 1:t ) Roll-up (marginalize prev. time): P(X t+1 y 1:t ) 35

Implementation Current belief: P(x t y 1:t-1 ) = N(x t ; η Xt, Λ Xt ) Multiply sensor and motion model Marginalize 36

What if observations not linear? Linear observations: Y t = H X t + noise Nonlinear observations: 37

Incorporating Non-gaussianobservations Nonlinear observation P(Y t X t ) not Gaussian First approach: Approximate P(Y t X t ) as CLG LinearizeP(Y t X t ) around current estimate E[X t y 1..t-1 ] Known as Extended Kalman Filter (EKF) Can perform poorly if P(Y t X t ) highly nonlinear Second approach: Approximate P(Y t, X t ) as Gaussian Takes correlation in X t into account After obtaining approximation, condition on Y t =y t (now a linear observation) 38

Finding Gaussian approximations Need to find Gaussian approximation of P(X t,y t ) How? Gaussians in Exponential Family Moment matching!! E[Y t ] = E[Y t 2] = E[X t Y t ] = 39

Linearization by integration Need to integrate product of Gaussian with arbitrary function Can do that by numerical integration Approximate integral as weighted sum of evaluation points Gaussian quadrature defines locations and weights of points For 1 dim: Exact for polynomials of degree D if choosing 2D points using Gaussian quadrature For higher dimensions: Need exponentially many points to achieve exact evaluation for polynomials Application of this is known as Unscented Kalman Filter (UKF) 40

Factored dynamical models So far: HMMs and Kalman filters X 1 X 2 X 3 X 4 X 5 X 6 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 What if we have more than one variable at each time step? E.g., temperature at different locations, or road conditions in a road network? Spatio-temporal models 41

Dynamic Bayesian Networks At every timestep have a Bayesian Network A 1 A 2 A 3 B 1 D 1 B 2 D 2 B 3 D 3 C 1 E 1 C 2 E 2 C 3 E 3 Variables at each time step t called a slice S t Temporal edges connecting S t+1 with S t 42

Tasks Read Koller& Friedman Chapters 6.2.3, 15.1 43