Networks. Dynamic. Bayesian. A Whirlwind Tour. Johannes Traa. Computational Audio Lab, UIUC

Size: px

Start display at page:

Download "Networks. Dynamic. Bayesian. A Whirlwind Tour. Johannes Traa. Computational Audio Lab, UIUC"

Kristian Wilkerson
5 years ago
Views:

1 Dynamic Bayesian Networks A Whirlwind Tour Johannes Traa Computational Audio Lab, UIUC

2 Sequential data is everywhere Speech waveform Bush s approval rating EEG brain signals Financial trends

3 What s a DBN? Dynamic Bayesian Network: Probabilistic graphical model for analyzing time series data Also called: o Times series model o Dynamic belief network o State space model (SSM) Useful for: o Tracking (e.g. sound source tracking, control systems) o Prediction (e.g. stock market forecasting, collision prevention) o Interpolation (e.g. sample recovery in audio/video) o Sequence classification (e.g. speech recognition) o Sequence clustering o And more

Roadmap Some common DBN architectures: o One layer: Markov Model/Autoregressive (AR) Model o Two Layer: Hidden Markov Model (HMM)/Linear Dynamical System (LDS) o Three Layer: Switching and Factorial

4 Roadmap Some common DBN architectures: o One layer: Markov Model/Autoregressive (AR) Model o Two Layer: Hidden Markov Model (HMM)/Linear Dynamical System (LDS) o Three Layer: Switching and Factorial HMM/LDS Problems we can solve for DBNs: o Evaluation (data likelihood) o Inference (hidden states) Viterbi algorithm Sequential inference (Kalman and particle filters) Variational inference Gibbs sampling o Learning (DBN parameters) o Structure Learning (DBN architecture)

5 Common Network Architectures One Layer Two Layer Three Layer

6 Markov Model Common DBNs Graphical Model z t 1 z t z t+1 State Transition Diagram a 11 a A = Transition matrix 2 3 a 11 a 12 a 13 4a 21 a 22 a 23 5 a 31 a 32 a 33 System equation a 31 3 DSP people like this ML people like this z t = A z t 1 Discrete state space or z t P (Az t 1 )

Markov Model State sequence = path through trellis 2 3 2 3 2 3 a 11 a 12 a 13 a 11 a 12 a 13 a 11 a 12 a 13 4a 21 a 22 a 23 5 4a 21 a 22 a 23 5 4a 21 a 22

7 Markov Model State sequence = path through trellis a 11 a 12 a 13 a 11 a 12 a 13 a 11 a 12 a 13 4a 21 a 22 a a 21 a 22 a a 21 a 22 a 23 5 a 31 a 32 a 33 a 31 a 32 a 33 a 31 a 32 a 33 Common DBNs 2 a 11 a 12 3 a 13 4a 21 a 22 a 23 5 a 31 a 32 a t 1 t t +1 t +2

8 Markov Model Common DBNs Monophonic piano Markov model: o State = piano note o Time = spectrogram frame A = Meaningless without units , = o Sampled state (note) sequence: o What it sounds like when the notes are played: 2 = low 3 = middle 1 = high

9 Vector Auto- Regressive (VAR) Model Common DBNs Graphical Model z t 1 z t z t+1 System equation 2 px z t = A i z t i + u t () 6 4 i=1 u t N (0, ) z t z t 1. z t p = 2 3 A 1 A 2 A p 2 I I z t 1 z t 2 z t. p 3 2 u t Continuous state space z t = Ã z t () 1 + ũ t or z t N Ã zt 1,

10 Hidden Markov Model (HMM)/ Linear Dynamical System (LDS) Common DBNs Graphical Model z t 1 z t z t+1 Hidden state sequence x t x t 1 x t+1 Observation sequence System equations z t = A z t 1 + u t z t P (Az t 1, u ) or x t = B z t + v t x t P (Bz t, v ) Discrete- (Gaussian) for HMM, Gaussian Gaussian for LDS

11 Hidden Markov Model (HMM) Common DBNs Piano HMM o State = piano note (3) o Observation = spectrum o Time = spectrogram frame A = , = o Sampled state (note) sequence: o Observation sequence: (next slide)

Hidden Markov Model (HMM) Common DBNs o Observation sequence: 250

12 Hidden Markov Model (HMM) Common DBNs o Observation sequence: Frequency Time

13 Graphical Model Switching HMM/LDS s t 1 s t s t+1 Common DBNs Hidden regime sequence z t 1 z t z t+1 Hidden state sequence x t x t 1 x t+1 Observed sequence System equations s t = C s t 1 z t = A st z t 1 + u t,st x t = B st z t + v t,st or s t P (Cs t 1 ) z t P (A st z t 1, u,st ) x t P (B st z t, v,st )

14 Switching HMM/LDS Common DBNs Switching HMM = HMM with special structure Global transition matrix Ã = 2 3 C 11 A 1 C 12 A 12 C 1K A 1K C 21 A 21 C 22 A 2 C 2K A 2K C K1 A K1 C K2 A K2 C KK A K Global emission matrix B = B 1 B 2 B K

15 Switching HMM Common DBNs Piano-Violin SHMM o Regime = instrument o State = piano/violin note (8 each) o Observation = spectrum o Time = spectrogram frame o Regime transition matrix C = apple 1 1, = o State transition matrices A = , =5 10 3

Switching HMM Common DBNs o Observation sequence: 250 200

16 Switching HMM Common DBNs o Observation sequence: Frequency Time

17 Factorial HMM (FHMM) Common DBNs Graphical Model Hidden state sequence #1 z 1 t 1 z 1 t z 1 t+1 Hidden state sequence #2 z 2 t 1 z 2 t z 2 t+1 x t 1 x t x t+1 Observed sequence System equations z 1 t = A 1 z 1 t 1 z 2 t = A 2 z 2 t 1 x t = f B 1 z 1 t, B 2 z 2 t + u t or z 1 t P A 1 z 1 t 1 z 2 t P A 2 z 2 t 1 x t P f B 1 z 1 t, B 2 z 2 t, Emission density reflects interaction between state chains

Factorial HMM (FHMM) Common DBNs Factorial HMM = HMM with special structure o Global transition matrix is M 2 x M 2 (all state pairs) State sequence = path through big trellis (2, 1) (1,

18 Factorial HMM (FHMM) Common DBNs Factorial HMM = HMM with special structure o Global transition matrix is M 2 x M 2 (all state pairs) State sequence = path through big trellis (2, 1) (1, 1) (2, 2) (2, 2) (2, 2) (1, 2) (1, 3) (3, 3) (2, 3) (3, 1) (3, 1) (3, 1) (2, 1) (2, 1) (3, 2) (1, 1) (1, 2) (3, 2) (3, 3) (1, 1) (1, 2) (2, 3) (2, 3) (1, 3) (1, 3) (3, 2) (3, 3) t 1 t t +1

Factorial HMM (FHMM) Common DBNs o Observation sequence: 250 200

19 Factorial HMM (FHMM) Common DBNs o Observation sequence: Frequency Time

Many more DBNs to choose from Common DBNs N-gram model o Discrete version of AR

Auto-regressive HMM, Input-Output HMM o Emissions are inter-dependent Hierarchical

Infinite HMM o Number of states/emissions can increase over time Non-negative

20 Many more DBNs to choose from Common DBNs N-gram model o Discrete version of AR model Gaussian mixture HMM o GMM in emission Gaussian sum LDS o GMM in state Auto-regressive HMM, Input-Output HMM o Emissions are inter-dependent Hierarchical HMM o Each state contains another HMM Mixture of HMMs o Samples are sequences Infinite HMM o Number of states/emissions can increase over time Non-negative dynamical system o Multiplicative noise in system equations Figures from Dynamic Bayesian Networks (Murphy)

21 Hierarchical HMM Common DBNs Figure from Murphy s thesis

22 Common DBNs General network structure Components of a DBN: o Nodes (variables): Observed Hidden Parameters o Directed edges (interactions) o Probability distributions (of the variables) Model fully specified by system equations: o (1) State transition dynamics o (2) Emission dynamics o (3) Initial conditions

23 Problems we can solve for DBNs Evaluation Inference Learning

g. Kalman smoother) variational methods sampling methods (e.g. Gibbs, Metropolis-Hastings) Same thing for a Bayesian Learning o What are the most likely parameters given the data?

24 We got problems Evaluation o What is the likelihood that the data was generated by my DBN? o Solution: sum-product (forward pass) Inference o What is(are) the most likely state sequence(s) given the data? o Solutions: max-product (e.g. Viterbi) filtering (e.g. Kalman/particle filters) forward-backward (e.g. Kalman smoother) variational methods sampling methods (e.g. Gibbs, Metropolis-Hastings) Same thing for a Bayesian Learning o What are the most likely parameters given the data? o Solutions: EM (e.g. Baum-Welch) (uses inference as sub-routine) method of moments (e.g. spectral learning) Structure learning o What is the most likely graph given the data? o Solutions:

25 #1: Evaluation #1: Evaluation Compute data likelihood (recursively) o Use conditional independence properties in directed graph o Sum-product (forward pass, average over states) P (x 1:T )= X z 1:T P (x 1:T, z 1:T ) = 1 T diag (o T ) A diag (o T 1 ) A diag (o 1 ) = X P (x T z T )P (z T z T 1 )P (x T 1 z T 1 ) P (z 2 z 1 )P (x 1 z 1 )P (z 1 ) z 1:T 2 = X P (x T z T ) X 4P (z T z T 1 )P (x T 1 z T 1 ) X " # 3 X P (z 2 z 1 )P (x 1 z 1 )P (z 1 ) 5 z T z T 1 zt 2 z 1 z t 1 z t z t+1 x t x t 1 x t+1

26 Find most likely state sequence or Find posterior distribution of state sequence Hard vs Soft decision Figure from Murphy s thesis

27 Off- line inference for the HMM Off- line Viterbi algorithm o Find most likely state sequence o Two passes on chain: Max-product (forward pass, remember most likely ancestors) Back-track (re-trace steps of ancestors) dz 1:T = argmax P (x 1:T z 1:T ) z 1:T P ( dz 1:T ) = max z T P (x T z T ) max z T 1 apple P (z T z T 1 )P (x T 1 z T 1 ) max z T 2 Sum replaced with max apple max z 1 P (z 2 z 1 )P (x 1 z 1 )P (z 1 ) o Observations only enter in as likelihoods Emission model can be GMM, neural network, etc.

28 Sequential inference: filtering Filtering Basic idea: use system model and observations to maintain probabilistic estimate of state Simple LDS: o Object tends to move in straight lines at constant speed o Measurement is state + noise Prediction Kalman Filter Predict Posterior Correct Prior Observation

29 Sequential inference: filtering Filtering Predict: Correct: Prediction Posterior ( Prior at time t+1) P (z t x 1:t 1 ) = Filtering Equations Z P (z t z t 1 )P (z t 1 x 1:t 1 ) dz t 1 z t 1 Z P (z t, z t 1 x 1:t 1 ) dz t 1 = z t 1 State transition P (z t x 1:t ) = P (z t x t, x 1:t 1 ) Prior at time t = P (x t z t, x 1:t 1 ) P (z t x 1:t 1 ) P (x t x 1:t 1 ) / P (x t z t )P (z t x 1:t 1 ) Emission

30 Filtering for the LDS: Kalman Filter Kalman Filter How can we track a noisy sinusoid? o Brownian motion: z t = z t 1 + u t u t N 0, x t = z t + v t v t N 0, 2 u 2 v o Issue: tracking will lag behind (wrong dynamics model) o Sinusoidal emission: z t = az t 1 + u t x t = sin (z t )+v t u t N 0, v t N 0, 2 u 2 v o Issue: Non-linear dynamics (intractable filtering equations)

31 Filtering for the LDS: Kalman Filter Kalman Filter System equations State describes rotating vector apple zt,1 z t,2 = apple cos ( ) sin ( ) sin( ) cos ( ) apple zt 1,1 z t 1,2 + apple ut,1 u t,2 x t = 1 0 apple z t,1 z t,2 + v t u t N (0, u ) v t N 0, 2 v Observation = sinusoid value We can apply the Kalman filter to a system with nonlinear-looking behavior by choosing the model wisely

32 Filtering for the LDS: Kalman Filter Kalman Filter Optimal filtering for sinusoidal model o Filter matches LDS 6 4 Clean sinusoid Measurement Filtered state

33 Filtering for the LDS: Kalman Filter Kalman Filter Filter assumes Brownian motion LDS o No knowledge of sinusoidal behavior causes lag 6 4 Clean sinusoid Measurement Filtered state

34 Filtering for the LDS: Kalman Filter Kalman Filter Correct model assumption, but over-estimate state noise covariance o Filter pays most attention to noisy measurements 6 4 Clean sinusoid Measurement Filtered state

35 Filtering for the LDS: Kalman Filter Kalman Filter Correct model assumption, but over-estimate observation noise variance o Filter pays most attention to state transition prior 6 4 Clean sinusoid Measurement Filtered state

36 Filtering for the LDS: Kalman Filter Kalman Filter Harmonic model for a piano note o State = multiple, decaying rotating vectors (stacked) o Emission = sum of harmonically-related sinusoids 2 3 a 1 R a 2 R 2 0 z t = z 2 t 1 + ũ t, ũ t N 0, ui 0 0 a M R M y t = z t + v t, v t N 0, 2 v Clean note Clean Noisy Denoised Noisy note Filtered note x 10 4

37 Intractable filter equations Filtering What if the system equations are crazy? o Discretization Partition state space into cells o Linearization Approximate nonlinearities via 1 st -order Taylor series expansion o Extended Kalman filter (EKF) o Moment-matching Approximate filtered distribution with a Gaussian o Unscented Kalman filter (UKF) o Switching Kalman Filter (SKF) o Variational approximations (deterministic) Approximate DBN by breaking edges in graph o Gibbs sampling (stochastic) Approximate inference with local sampling Very general, very awesome o Particle filter (PF) (stochastic) Approximate #&@% filtered distribution with point masses (sequential importance sampling)

38 Switching Kalman Filter Example: cockroach tracking Particle Filter Switching LDS s t Mult (Cs t 1 ) z t N (A st z t 1, st ) x t N (Bz t, st ) C = A 1 = A 2 = , A 3 = A 4 = = , 2 = , 3 = , 4 = State: z = h x dx dt y i > dy dt B = apple Switch values correspond to: (1) Stay still (2) Brownian motion (3) ~ Constant velocity (4) Sudden dash k = 2 k I, 2 1 = =1 2 3 =3 2 4 = 10

39 Switching Kalman Filter Example: cockroach tracking Filtering Switching Kalman filter (aka mixture of Kalman filters) Switch Predict Correct State Observation Collapse Predict Correct

40 Switching Kalman Filter Example: cockroach tracking Particle Filter

41 Sequential Inference for DBNs Particle Filtering Particle Filter Useful when: o Filtering equations are analytically intractable o Linear/Gaussian approximations fail o Computation power is plentiful Basic idea: o Replace filtered state distribution with weighted point masses o Particle = guess for state o Weight = confidence in guess o As L à, approximation weight is perfect particle J o State statistics (μ, Σ, etc.) easily computed from particles bp (z t x 1:t )= LX l=1! (l) t z (l) t L!1! P (zt x 1:t )

42 Importance Sampling Particle Filter o Tricky integral (expectation with respect to complicated distribution P(x)) o Approximate P(x) with proposal Q(x) o Weight compensates for mismatch o Sample from Q(x) to estimate integral Z f (x)p (x) d x = Z Z f (x) P (x) Q (x) Q (x) d x = f (x)w (x) Q (x) d x Filtered state distribution LX f x l w x l, x l Q (x) l=1 State transition/ emission densities Normalized weights o Apply this sequentially and we have the particle filter

Importance Sampling Particle Filter Example: n th moment of von Mises random variable o We want: E[x n ]= Z x n vm (x ; µ, apple) dx 0.35 0.3 0.25 0.2 0.

43 Importance Sampling Particle Filter Example: n th moment of von Mises random variable o We want: E[x n ]= Z x n vm (x ; µ, apple) dx von Mises wrapped Gaussian uniform o Use Monte Carlo estimate: E[x n ] 1 L LX l=1 x l n o But can t sample from vm directly , x l / vm (x ; µ, apple) o Instead, find similar wrapped Gaussian (the closer, the better) o Approximate integral with importance sampling: LX E[x n ] w l x l n, x l 2 WN x ; µ,, w l / vm xl ; µ, apple WN (x l ; µ, 2 ) l=1

44 Importance Sampling Particle Filter Average absolute error in mean estimate (L = 10): WG: Unif: WG: Unif: WG: Unif: von Mises wrapped Gaussian uniform κ = κ = κ = 3

45 Sequential Importance Sampling Particle Filter o Merged filtering equations: Z P (z t x 1:t ) / P (x t z t )P (z t z t 1 )P (z t 1 x 1:t 1 ) dz t 1 o Approximate with importance sampling: P (z t x 1:t ) LX l=1 w l t z l t z l t Q z l t z l t 1, x t wt l / P x t z l t P z l t z l t 1 Q z l t z l t 1, x t w l t 1 o Optimal Q is typically hard to sample from o Common choice is transition density: Q z l t z l t 1, x t = P z l t z l t 1 w l t / P x t z l t w l t 1 o Shouldn t ignore emission density Use EKF/UKF to approximate optimal Q

Basic Particle Filter Prior state representation Particle Filter Predict: propagate particles through transition distribution Correct: update

46 Basic Particle Filter Prior state representation Particle Filter Predict: propagate particles through transition distribution Correct: update weights via emission density Compute statistics before resampling Resample: draw fresh particles from updated set Posterior state representation

47 Particle Filter Particle filtering Example: Multiple DBN tracking with scrambled observations 8i 2 [1,,K] z i t = f z i t 1, u i t x i t = g z i t, v i t X t = x 1 t,, x K t Multiple DBNs active Each DBN emits an observation We observe permuted bag of emission Nightmare to invert the generative model!!! Easy with PF Particle filter w/ mixture model in the state o Probabilistic Data Association (PDA) Observation-to-cluster association Particle-to-cluster association

48 Multiple LDS tracking: o GMM captures multimodal state distribution o Gaussians hold the particles in tight clusters o Collisions handled gracefully by probabilistic assignments Particle filtering Particle Filter Example: Multiple DBN tracking with scrambled observations

49 Particle filtering Example: grasshopper tracking Particle Filter LDS with indicator functions!!! z t = B z t u t C 0 5A, u t N 0, ui x t = apple z t + v t, v t N 0, 2 vi State: z = h x dx dt y dy dt i > Bounce function: (z) Thanks to Taylan Cemgil (for the grasshopper DBN)

50 Particle filtering Example: grasshopper tracking Particle Filter

51 Variational Inference Variational Basic Idea: o Inference is too hard (intractable, expensive) o Approximate DBN true with DBN simple o Set (variational) parameters of DBN simple to match DBN true o Amounts to breaking edges in the graphical model The math: o True joint distribution of observed and hidden variables: P (X, Z) o Variational approximation of joint of hidden nodes: Implies broken edges in graph Q (Z) = MY i=1 Q i (Z i ) Called product density transform in statistics

52 Variational Inference Variational The math (continued): o For inference, we want to maximize the data likelihood: ln P (X) =L (Q)+KL(Q P ) Z P (X, Z) L (Q) = Q (Z) ln Q (Z) Z P (Z X) KL (Q P )= Q (Z) ln Q (Z) Lower bound Extra stuff o Optimal variational distribution is posterior (gives best lower bound): Q (Z) = P (Z X) o Doable when fitting a model with EM (ML estimate): Z Z ln P (X) / Q (Z) ln P (X, Z) dz = P (Z X)lnP (X, Z) dz dz dz Q function

53 Variational Inference Variational 1D, 3-component GMM Log likelihood and bounds given current estimate of 2 nd mean log likelihood EM lower bound variational lower bound 600 log probability nd mean value

54 Variational Inference Variational lower bound: Z e E i6=j [ln P (X,Z)] L (Q) / Q j (Z j ) ln dz j Q j (Z j ) Variational Expectation wrt product of other factors Best bound when: Q j (Z j ) / e E i6=j[ln P (X,Z)] j th factor depends on all others, so cycle through them: for j =1 : M end Q j (Z j ) / e E i6=j[ln P (X,Z)] If we use a conjugate prior for Z j, Q j has the form of the corresponding posterior Variational inference = setting the parameters of the Q j s Do this in E step for variational EM

55 Variational EM for the GMM Variational GMM w/ Dirichlet prior on weights o Joint distribution of all variables: P (X, Z, ; µ,, ) =P (X Z ; µ, ) P (Z ) P ( ; ) z o Variational factorization: o E step: Q (Z, ) = Q 1 (Z) Q 2 ( ) ln Q 1 (Z) / E [ln P (X Z ; µ, )+lnp (Z )] ln Q 2 ( ) / E Z [ln P (Z )+lnp ( ; )] N x µ, Q 1 (z i = j) = j N x i µ j, j P k N (x i µ k, k ) k E[ln ], = e Q 2 ( ) = Dir ( ), = + N E[z] z o M step: Regular EM update for µ, N x Coupled variational parameters (iterate) µ,

56 Variational EM for the GMM Variational GMM w/ Dirichlet prior on weights E step M step Update variational assignment parameters Update variational mixing weight parameters Update model parameters z z z µ, N x N N x µ,

57 Variational EM for the GMM Variational EM for GMM samples - Fit 20 Gaussians

58 Variational EM for the GMM Variational Variational EM for GMM w/ sparse Dirichlet prior on weights samples - Fit 20 Gaussians

59 Variational Inference for the FHMM Variational Full DBN z 1 t 1 z 1 t z 1 t+1 z 2 t 1 z 2 t z 2 t+1 x t 1 x t x t+1 P Joint distribution of all variables: X, Z 1:K ; A 1:K, 1:K, 1:K = " # KY Y T Y T P z k 1 ; k P z k t z k t 1 ; A k P k=1 t=2 t=1 x t z 1:K t ; 1:K

60 Variational Inference for the FHMM Variational Gaussian emission model with additive means (Ghahramani, Jordan 97) P x t z 1:K t = N x t! KX B k z k t, k=1 Shared covariance Matrix of means for k th chain

61 Variational Inference for the FHMM Variational Fully factored variational approximation z 1 t 1 z 1 t z 1 t+1 z 2 t 1 z 2 t z 2 t+1 Variational factorization: KY Q (Z) = TY Q z k t Factors turn out to be multinomial k=1 t=1

62 Variational Inference for the FHMM Variational Induced variational parameters and dependencies o Hidden variables are de-coupled o Variational parameters induced by factorization (act as means) o Parameters are locally coupled o Iterate: Update variational parameters using neighbors 1 t 1 1 t 1 t+1 z 1 t 1 z 1 t z 1 t+1 2 t 2 t 1 2 t+1 z 2 t 1 z 2 t z 2 t+1

63 Variational Inference for the FHMM Variational Update variational parameters 1 t 1 t 1 1 t+1 1 t z 1 t 2 t 2 t 1 2 t+1 2 t z 2 t

64 Variational Inference for the FHMM Variational Structured variational approximation z 1 t 1 z 1 t z 1 t+1 z 2 t 1 z 2 t z 2 t+1 Variational factorization: Choose these to be HMM- ish: - Initial prob * t = 1 - Transition prob * t > 1 Q (Z) = KY Q z k 1 TY Q z k t z k t 1 k=1 t=2

65 Variational Inference for the FHMM Variational Induced variational parameters and dependencies o Hidden variables are de-coupled between chains o Variational parameters induced by factorization (act as likelihoods) o Parameters are locally coupled o Iterate: Forward-backward on each chain using fake likelihoods Update variational parameters using other chains posteriors 1 t 1 1 t 1 t+1 z 1 t 1 z 1 t z 1 t+1 2 t 2 t 1 2 t+1 z 2 t 1 z 2 t z 2 t+1

66 Variational Inference for the FHMM Variational Forward-Backward Update variational parameters 1 t 1 1 t 1 t+1 1 t z 1 t 1 z 1 t z 1 t+1 z 1 t 2 t 1 2 t+1 2 t 2 t z 2 t 1 z 2 t z 2 t+1 z 2 t

Gibbs sampling Gibbs Basic idea: o Exact inference is hard (takes too long, math is #&@%) o Any distribution can be described by its samples o So approximate inference by sampling o Sampling from

67 Gibbs sampling Gibbs Basic idea: o Exact inference is hard (takes too long, math is #&@%) o Any distribution can be described by its samples o So approximate inference by sampling o Sampling from full posterior is hard: Z s P Z X, Z = apple Z Latent variables Parameters o Iteratively draw from local conditionals instead: for s =1 : S for j =1 : J Z s j P Zj Z s j, X end end Samples Variables Draw from conditional (keep other unknowns fixed) o Samples eventually resemble draws from full posterior

68 Gibbs EM for the GMM Gibbs GMM w/ Dirichlet prior on weights o Joint distribution of all variables: P (X, Z, ; µ,, ) =P (X Z ; µ, ) P (Z ) P ( ; ) z o Gibbs conditionals: 8 i 2 [1,N] Z s i P (Z i s, X i ) / P (X i Z i ) P (Z i s ) s P ( Z s ) / P (Z s ) P ( ) N x µ, o E step: 8 i 2 [1,N] Z s i Mult( i ), ij = s j N X i ; µ j, j KP s k N (X i ; µ k, k ) k=1 s Dir ( ), = + NX i=1 Z s i o M step: Regular EM update for µ, Coupled Gibbs parameters (iterate)

69 Gibbs EM for the GMM Gibbs GMM w/ Dirichlet prior on weights o Conditional sampling = updates only involve variables in Markov blanket E step M step Sample assignments Sample mixing weights Update parameters z z z N x µ, N N x µ,

70 Gibbs inference for the FHMM Gibbs Possibly slow convergence of Gibbs samples to posterior o Strong correlations between state variables o Gibbs draws samples along individual coordinates of Z space o In practice, it s quite fast Sample 1 st chain s state Sample 2 nd chain s state z 1 t 1 z 1 t z 1 t+1 z 1 t z 2 t z 2 t 1 z 2 t z 2 t+1 t =1:T x t x t

71 Audio Source Separation with FHMM Filtering FHMM Set-up o 10 variational/gibbs iterations per frame, M K /5 particles o Binary masks = max. of spectra for most likely state combination m k t = I B k bz k t > B k bz k t Piano-violin FHMM o 12 notes each o Optimal variational Gibbs o Particle filter is bad Basic proposal = bad tracking Optimal proposal = too slow mix Optimal Basic PF Speech-speech FHMM o 30 speech bases each (pre-trained) o Optimal variational Gibbs PF o Not bad! mix Optimal Solo Viterbi

72 #3: Learning #3: Learning Expectation-Maximization (e.g. Baum-Welch) o Find parameters that maximize data likelihood b = argmax P (x 1:T ) o If inference (E step) is too hard, use: Variational methods Sampling methods Method of moments (e.g. spectral learning) o Express moments in terms of parameters o Solve non-linear system of equations f ( ) m

73 Baum- Welch for the HMM #3: Learning Baum- Welch Find most likely parameters given data b = argmax = argmax = argmax A,O, P (x 1:T ; ) X z 1:T P (x 1:T, z 1:T ; ) X z 1:T P (z 1 ; ) TY P (z t z t 1 ; A) t=2 TY P (x t z t ; O) t=1 This is hard, so use EM Both states and parameters are unknown o Given parameters, estimate states (inference) o Given states, estimate parameters o Iterate

Baum- Welch for the HMM #3: Learning Baum- Welch E step: inference with forward-backward algorithm o Estimate hidden state probabilities (posteriors) (z t )=P (x 1:t, z t ) (z t )=P (x t+1:t z t ) z

74 Baum- Welch for the HMM #3: Learning Baum- Welch E step: inference with forward-backward algorithm o Estimate hidden state probabilities (posteriors) (z t )=P (x 1:t, z t ) (z t )=P (x t+1:t z t ) z t 1 z t z t+1 x t x t 1 x t+1 P (z t x 1:T ) / (z t ) (z t ) P (z t 1, z t x 1:T ) / (z t 1 ) P (x t z t ) P (z t z t 1 ) (z t ) M step: update parameters with weighted averages over data o weights = posteriors

75 Method of moments #3: Learning Say what?

76 #4: Structure Learning #4: Structure Learning Say what?

Books: o Pattern Recognition for Machine Learning, Christopher Bishop, 2006 o Probabilistic Reasoning over Time, Stuart Russel and Peter Norvig, Chapter 15 in Artificial Intelligence: A Modern

77 Books: o Pattern Recognition for Machine Learning, Christopher Bishop, 2006 o Probabilistic Reasoning over Time, Stuart Russel and Peter Norvig, Chapter 15 in Artificial Intelligence: A Modern Approach, 2009 o Machine Learning: A Probabilistic Perspective, Kevin Murphy, 2013 o Bayesian Reasoning and Machine Learning, David Barber, 2013 Thesis: References o Dynamic Bayesian Networks: Representation, Inference and Learning, Kevin Murphy, 2002 Papers: o Factorial Hidden Markov Models, Zoubin Ghahramani and Michael Jordan, Journal of Machine Learning Research, 1997 o An Introduction to Hidden Markov Models and Bayesian Networks, Zoubin Ghahramani, Journal of Pattern Recognition and AI, 2001 o A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking, Sanjeev Arulampalam, Simon Maskell, Neil Gordon, and Tim Clapp, IEEE Transactions on Signal Processing, 2002 o An Introduction to the Kalman Filter, Greg Welch and Gary Bishop, 2006 o Graphical Models for Time Series, David Barber and Taylan Cemgil, IEEE Signal Processing Magazine, 2010

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project