Probabilistic & Unsupervised Learning. Latent Variable Models for Time Series

Size: px
Start display at page:

Download "Probabilistic & Unsupervised Learning. Latent Variable Models for Time Series"

Transcription

1 Probabilistic & Unsupervised Learning Latent Variable Models for Time Series Maneesh Sahani Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2016

2 Modeling time series Thus far, our data have been (marginally) iid. Now consider a sequence of observations: x 1, x 2, x 3,..., x t that are not independent.

3 Modeling time series Thus far, our data have been (marginally) iid. Now consider a sequence of observations: x 1, x 2, x 3,..., x t that are not independent. Examples: Sequence of images Stock prices Recorded speech or music, English sentences Kinematic variables in a robot Sensor readings from an industrial process Amino acids, DNA, etc...

4 Modeling time series Thus far, our data have been (marginally) iid. Now consider a sequence of observations: x 1, x 2, x 3,..., x t that are not independent. Examples: Sequence of images Stock prices Recorded speech or music, English sentences Kinematic variables in a robot Sensor readings from an industrial process Amino acids, DNA, etc... Goal: To build a joint probabilistic model of the data p(x 1,..., x t).

5 Modeling time series Thus far, our data have been (marginally) iid. Now consider a sequence of observations: x 1, x 2, x 3,..., x t that are not independent. Examples: Sequence of images Stock prices Recorded speech or music, English sentences Kinematic variables in a robot Sensor readings from an industrial process Amino acids, DNA, etc... Goal: To build a joint probabilistic model of the data p(x 1,..., x t). Predict p(x t x 1,..., x t 1)

6 Modeling time series Thus far, our data have been (marginally) iid. Now consider a sequence of observations: x 1, x 2, x 3,..., x t that are not independent. Examples: Sequence of images Stock prices Recorded speech or music, English sentences Kinematic variables in a robot Sensor readings from an industrial process Amino acids, DNA, etc... Goal: To build a joint probabilistic model of the data p(x 1,..., x t). Predict p(x t x 1,..., x t 1) Detect abnormal/changed behaviour (if p(x t, x t+1,... x 1,..., x t 1) small)

7 Modeling time series Thus far, our data have been (marginally) iid. Now consider a sequence of observations: x 1, x 2, x 3,..., x t that are not independent. Examples: Sequence of images Stock prices Recorded speech or music, English sentences Kinematic variables in a robot Sensor readings from an industrial process Amino acids, DNA, etc... Goal: To build a joint probabilistic model of the data p(x 1,..., x t). Predict p(x t x 1,..., x t 1) Detect abnormal/changed behaviour (if p(x t, x t+1,... x 1,..., x t 1) small) Recover underlying/latent/hidden causes linking entire sequence

8 Markov models In general: P(x 1,..., x t) = P(x 1)P(x 2 x 1)P(x 3 x 1, x 2) P(x t x 1, x 2... x t 1)

9 Markov models In general: P(x 1,..., x t) = P(x 1)P(x 2 x 1)P(x 3 x 1, x 2) P(x t x 1, x 2... x t 1) First-order Markov model: P(x 1,..., x t) = P(x 1)P(x 2 x 1) P(x t x t 1) x 1 x 2 x 3 x T The term Markov refers to a conditional independence relationship. In this case, the Markov property is that, given the present observation (x t), the future (x t+1,...) is independent of the past (x 1,..., x t 1).

10 Markov models In general: P(x 1,..., x t) = P(x 1)P(x 2 x 1)P(x 3 x 1, x 2) P(x t x 1, x 2... x t 1) First-order Markov model: P(x 1,..., x t) = P(x 1)P(x 2 x 1) P(x t x t 1) x 1 x 2 x 3 x T The term Markov refers to a conditional independence relationship. In this case, the Markov property is that, given the present observation (x t), the future (x t+1,...) is independent of the past (x 1,..., x t 1). Second-order Markov model: P(x 1,..., x t) = P(x 1)P(x 2 x 1) P(x t 1 x t 3, x t 2)P(x t x t 2, x t 1) x 1 x 2 x 3 x T

11 Causal structure and latent variables y 1 y 2 y 3 y T x 1 x 2 x 3 x T Temporal dependence captured by latents, with observations conditionally independent. Speech recognition: Vision: y - underlying phonemes or words x - acoustic waveform y - object identities, poses, illumination x - image pixel values Industrial Monitoring: y - current state of molten steel in caster x - temperature and pressure sensor readings

12 Latent-Chain models y 1 y 2 y 3 y T x 1 x 2 x 3 x T Joint probability factorizes: T P(y 1:T, x 1:T ) = P(y 1)P(x 1 y 1) P(y t y t 1)P(x t y t) where y t and x t are both real-valued vectors, and 1:T 1,..., T. t=2 Two frequently-used tractable models: Linear-Gaussian state-space models Hidden Markov models

13 Linear-Gaussian state-space models (SSMs) y 1 y 2 y 3 y T x 1 x 2 x 3 x T In a linear Gaussian SSM all conditional distributions are linear and Gaussian: Output equation: State dynamics equation: x t= Cy t + v t y t= Ay t 1 + w t where v t and w t are uncorrelated zero-mean multivariate Gaussian noise vectors. Also assume y 1 is multivariate Gaussian. The joint distribution over all variables x 1:T, y 1:T is (one big) multivariate Gaussian. These models are also known as stochastic linear dynamical systems, Kalman filter models.

14 From factor analysis to state space models y 1 y 2 y K y t x 1 x 2 x D K Factor analysis: x i = Λ ij y j + ɛ i SSM output: x t,i = j=1 x t K C ij y t,j + v i. j=1 Interpretation 1: Observations confined near low-dimensional subspace (as in FA/PCA). Successive observations are generated from correlated points in the latent space. However: FA requires K < D and Ψ diagonal; SSMs may have K D and arbitrary output noise. Why? Thus ML estimates of subspace by FA and SSM may differ.

15 Linear dynamical systems y 1 y 2 y 3 y T x 1 x 2 x 3 x T Interpretation 2: Markov chain with linear dynamics y t = Ay t perturbed by Gaussian innovations noise may describe stochasticity, unknown control, or model mismatch. Observations are a linear projection of the dynamical state, with additive iid Gaussian noise. Note: Dynamical process (y t) may be higher dimensional than the observations (x t). Observations do not form a Markov chain longer-scale dependence reflects/reveals latent dynamics.

16 State Space Models with Control Inputs u 1 u 2 u 3 u T y 1 y 2 y 3 y T x 1 x 2 x 3 x T State space models can be used to model the input output behaviour of controlled systems. The observed variables are divided into inputs (u t) and outputs (x t). State dynamics equation: y t = Ay t 1 + Bu t 1 + w t. Output equation: x t = Cy t + Du t + v t. Note that we can have many variants, e.g. y t = Ay t 1 + Bu t + w t or even y t = Ay t 1 + Bx t 1 + w t.

17 Hidden Markov models s 1 s 2 s 3 s T x 1 x 2 x 3 x T Discrete hidden states s t {1..., K }; outputs x t can be discrete or continuous. Joint probability factorizes: P(s 1:T, x 1:T ) = P(s 1)P(x 1 s 1) Generative process: T P(s t s t 1)P(x t s t) t=2 A first-order Markov chain generates the hidden state sequence (path): initial state probs: π j = P(s 1 = j) transition matrix: Φ ij = P(s t+1 = j s t = i) A set of emission (output) distributions A j( ) (one per state) converts state path to a sequence of observations x t. A j(x) = P(x t = x s t = j) (for continuous x t) A jk = P(x t = k s t = j) (for discrete x t)

18 Hidden Markov models Two interpretations: a Markov chain with stochastic measurements: s 1 s 2 s 3 s T x 1 x 2 x 3 x T or a mixture model with states coupled across time: s 1 s 2 s 3 s T x 1 x 2 x 3 x T Even though hidden state sequence is first-order Markov, the output process may not be Markov of any order (for example: ). Discrete state, discrete output models can approximate any continuous dynamics and observation mapping even if nonlinear; however this is usually not practical. HMMs are related to stochastic finite state machines/automata.

19 Input-output hidden Markov models u 1 u 2 u 3 u T s 1 s 2 s 3 s T x 1 x 2 x 3 x T Hidden Markov models can also be used to model sequential input-output behaviour: P(s 1:T, x 1:T u 1:T ) = P(s 1 u 1)P(x 1 s 1, u 1) T P(s t s t 1, u t 1)P(x t s t, u t) IOHMMs can capture arbitrarily complex input-output relationships, however the number of states required is often impractical. t=2

20 HMMs and SSMs (Linear Gaussian) State space models are the continuous state analogue of hidden Markov models. s1 s2 s3 st y1 y2 y3 yt x1 x2 x3 xt x1 x2 x3 xt A continuous vector state is a very powerful representation. For an HMM to communicate N bits of information about the past, it needs 2 N states! But a real-valued state vector can store an arbitrary number of bits in principle. s1 s2 s3 st x1 x2 x3 xt Linear-Gaussian output/dynamics are very weak. The types of dynamics linear SSMs can capture is very limited. HMMs can in principle represent arbitrary stochastic dynamics and output mappings.

21 Many Extensions 1 Constrained HMMs Continuous state models with discrete outputs for time series and static data Hierarchical models Hybrid systems Mixed continuous & discrete states, switching state-space models

22 Richer state representations A t A t+1 A t+2 B t B t+1 B t+2... C t C t+1 C t+2 D t D t+1 D t+2 Factorial HMMs Dynamic Bayesian Networks These are hidden Markov models with many state variables (i.e. a distributed representation of the state). The state can capture many more bits of information about the sequence (linear in the number of state variables).

23 Chain models: ML Learning with EM y1 y2 y3 yt s1 s2 s3 st x1 x2 x3 xt x1 x2 x3 xt y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) s 1 π s t s t 1 Φ st 1, x t s t A st The structure of learning and inference for both models is dictated by the factored structure.

24 Chain models: ML Learning with EM y1 y2 y3 yt s1 s2 s3 st x1 x2 x3 xt x1 x2 x3 xt y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) s 1 π s t s t 1 Φ st 1, x t s t A st The structure of learning and inference for both models is dictated by the factored structure. T T P(x 1,..., x T, y 1,..., y T ) = P(y 1) P(y t y t 1) P(x t y t) t=2 t=1

25 Chain models: ML Learning with EM y1 y2 y3 yt s1 s2 s3 st x1 x2 x3 xt x1 x2 x3 xt y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) s 1 π s t s t 1 Φ st 1, x t s t A st The structure of learning and inference for both models is dictated by the factored structure. T T P(x 1,..., x T, y 1,..., y T ) = P(y 1) P(y t y t 1) P(x t y t) t=2 t=1 Learning (M-step): argmax log P(x 1,..., x T, y 1,..., y T ) q(y1,...,y T ) = [ ] T T argmax log P(y 1) q(y1 ) + log P(y t y t 1) q(yt,y t 1 ) + log P(x t y t) q(yt ) t=2 t=1

26 Chain models: ML Learning with EM y1 y2 y3 yt s1 s2 s3 st x1 x2 x3 xt x1 x2 x3 xt y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) s 1 π s t s t 1 Φ st 1, x t s t A st The structure of learning and inference for both models is dictated by the factored structure. T T P(x 1,..., x T, y 1,..., y T ) = P(y 1) P(y t y t 1) P(x t y t) t=2 t=1 Learning (M-step): argmax log P(x 1,..., x T, y 1,..., y T ) q(y1,...,y T ) = [ ] T T argmax log P(y 1) q(y1 ) + log P(y t y t 1) q(yt,y t 1 ) + log P(x t y t) q(yt ) t=2 t=1 So the expectations needed (in E-step) are derived from singleton and pairwise marginals.

27 Chain models: Inference Three general inference problems: Filtering: P(y t x 1,..., x t) Smoothing: P(y t x 1,..., x T ) (also P(y t, y t 1 x 1,..., x T ) for learning) Prediction: P(y t x 1,..., x t t )

28 Chain models: Inference Three general inference problems: Filtering: P(y t x 1,..., x t) Smoothing: P(y t x 1,..., x T ) (also P(y t, y t 1 x 1,..., x T ) for learning) Prediction: P(y t x 1,..., x t t ) Naively, these marginal posteriors seem to require very large integrals (or sums) P(y t x 1,..., x t) = dy 1... dy t 1 P(y 1,..., y t x 1,..., x t)

29 Chain models: Inference Three general inference problems: Filtering: P(y t x 1,..., x t) Smoothing: P(y t x 1,..., x T ) (also P(y t, y t 1 x 1,..., x T ) for learning) Prediction: P(y t x 1,..., x t t ) Naively, these marginal posteriors seem to require very large integrals (or sums) P(y t x 1,..., x t) = dy 1... dy t 1 P(y 1,..., y t x 1,..., x t) but again the factored structure of the distributions will help us. The algorithms rely on a form of temporal updating or message passing.

30 Crawling the HMM state-lattice s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P(s t=k x 1... x t) = P(s 1=k 1,..., s t=k x 1... x t) = π k1 A k1 (x 1)Φ k1,k 2 A k2 (x 2)... Φ kt 1,k A k(x t) k 1,...,k t 1 k 1,...,k t 1

31 Crawling the HMM state-lattice s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P(s t=k x 1... x t) = P(s 1=k 1,..., s t=k x 1... x t) = π k1 A k1 (x 1)Φ k1,k 2 A k2 (x 2)... Φ kt 1,k A k(x t) k 1,...,k t 1 k 1,...,k t 1 Naïve algorithm: start a bug at each of the K states at t = 1 holding value 1

32 Crawling the HMM state-lattice s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P(s t=k x 1... x t) = P(s 1=k 1,..., s t=k x 1... x t) = π k1 A k1 (x 1)Φ k1,k 2 A k2 (x 2)... Φ kt 1,k A k(x t) k 1,...,k t 1 k 1,...,k t 1 Naïve algorithm: start a bug at each of the K states at t = 1 holding value 1 move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. output emission prob.

33 Crawling the HMM state-lattice s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P(s t=k x 1... x t) = P(s 1=k 1,..., s t=k x 1... x t) = π k1 A k1 (x 1)Φ k1,k 2 A k2 (x 2)... Φ kt 1,k A k(x t) k 1,...,k t 1 k 1,...,k t 1 Naïve algorithm: start a bug at each of the K states at t = 1 holding value 1 move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. output emission prob. repeat

34 Crawling the HMM state-lattice s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P(s t=k x 1... x t) = P(s 1=k 1,..., s t=k x 1... x t) = π k1 A k1 (x 1)Φ k1,k 2 A k2 (x 2)... Φ kt 1,k A k(x t) k 1,...,k t 1 k 1,...,k t 1 Naïve algorithm: start a bug at each of the K states at t = 1 holding value 1 move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. output emission prob. repeat until all bugs have reached time t sum up values on all K t 1 bugs that reach state s t = k (one bug per state path)

35 Crawling the HMM state-lattice s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P(s t=k x 1... x t) = P(s 1=k 1,..., s t=k x 1... x t) = π k1 A k1 (x 1)Φ k1,k 2 A k2 (x 2)... Φ kt 1,k A k(x t) k 1,...,k t 1 k 1,...,k t 1 Naïve algorithm: start a bug at each of the K states at t = 1 holding value 1 move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. output emission prob. repeat until all bugs have reached time t sum up values on all K t 1 bugs that reach state s t = k (one bug per state path) Clever recursion: at every step, replace bugs at each node with a single bug carrying sum of values

36 Probability updating: Bayesian filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:t) = P(y t, y t 1 x 1:t) dy t 1

37 Probability updating: Bayesian filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:t) = P(y t, y t 1 x t, x 1:t 1) dy t 1

38 Probability updating: Bayesian filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:t) = = P(y t, y t 1 x t, x 1:t 1) dy t 1 P(xt, y t, y t 1 x 1:t 1) P(x t x 1:t 1) dy t 1

39 Probability updating: Bayesian filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:t) = P(y t, y t 1 x t, x 1:t 1) dy t 1 P(xt, y t, y t 1 x 1:t 1) = dy t 1 P(x t x 1:t 1) P(x t y t, y t 1, x 1:t 1)P(y t y t 1, x 1:t 1)P(y t 1 x 1:t 1) dy t 1

40 Probability updating: Bayesian filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:t) = = P(y t, y t 1 x t, x 1:t 1) dy t 1 P(xt, y t, y t 1 x 1:t 1) P(x t x 1:t 1) dy t 1 P(x t y t, y t 1, x 1:t 1)P(y t y t 1, x 1:t 1)P(y t 1 x 1:t 1) dy t 1 = P(x t y t)p(y t y t 1)P(y t 1 x 1:t 1) dy t 1 Markov property

41 Probability updating: Bayesian filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:t) = = = P(y t, y t 1 x t, x 1:t 1) dy t 1 P(xt, y t, y t 1 x 1:t 1) P(x t x 1:t 1) dy t 1 P(x t y t, y t 1, x 1:t 1)P(y t y t 1, x 1:t 1)P(y t 1 x 1:t 1) dy t 1 P(x t y t)p(y t y t 1)P(y t 1 x 1:t 1) dy t 1 Markov property This is a forward recursion based on Bayes rule.

42 The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming.

43 The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming. Define: α t(i) = P(x 1,..., x t, s t = i θ)

44 The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming. Define: α t(i) = P(x 1,..., x t, s t = i θ) Then much like the Bayesian filtering updates, we have: ( K ) α 1(i) = π ia i(x 1) α t+1(i) = α t(j)φ ji A i(x t+1) j=1

45 The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming. Define: α t(i) = P(x 1,..., x t, s t = i θ) Then much like the Bayesian filtering updates, we have: ( K ) α 1(i) = π ia i(x 1) α t+1(i) = α t(j)φ ji A i(x t+1) We ve defined α t(i) to be a joint rather than a posterior. j=1

46 The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming. Define: α t(i) = P(x 1,..., x t, s t = i θ) Then much like the Bayesian filtering updates, we have: ( K ) α 1(i) = π ia i(x 1) α t+1(i) = α t(j)φ ji A i(x t+1) We ve defined α t(i) to be a joint rather than a posterior. It s easy to obtain the posterior by normalisation: P(s t = i x 1,..., x t, θ) = αt(i) k αt(k) j=1

47 The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming. Define: α t(i) = P(x 1,..., x t, s t = i θ) Then much like the Bayesian filtering updates, we have: ( K ) α 1(i) = π ia i(x 1) α t+1(i) = α t(j)φ ji A i(x t+1) We ve defined α t(i) to be a joint rather than a posterior. It s easy to obtain the posterior by normalisation: P(s t = i x 1,..., x t, θ) = αt(i) k αt(k) This form enables us to compute the likelihood for θ = {A, Φ, π} efficiently in O(TK 2 ) time: P(x 1... x T θ) = K P(x 1,..., x T, s 1,..., s T, θ) = α T (k) s 1,...,s T avoiding the exponential number of paths in the naïve sum (number of paths = K T ). j=1 k=1

48 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals.

49 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 = Q 0; then (cf. FA)

50 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 = Q 0; then (cf. FA) K 1 = ˆV 0 1 C T (C ˆV 0 1 C T + R) 1

51 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 }{{}}{{} ŷ 1 1 ˆV 1 1 = Q 0; then (cf. FA) K 1 = ˆV 0 1 C T (C ˆV 0 1 C T + R) 1

52 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 = Q 0; then (cf. FA) P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 K 1 = ˆV 1 C T 0 (C ˆV 1 C T + R) 1 }{{}}{{} ŷ 1 1 ˆV 1 1 In general, we define ŷ τ t E[y t x 1,..., x τ ] and ˆV τ t V[y t x 1,..., x τ ].

53 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 = Q 0; then (cf. FA) P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 K 1 = ˆV 1 C T 0 (C ˆV 1 C T + R) 1 }{{}}{{} ŷ 1 1 ˆV 1 1 In general, we define ŷ τ τ t E[y t x 1,..., x τ ] and ˆV t V[y t x 1,..., x τ ]. Then, P(y t x 1:t 1) = dy t 1P(y t y t 1)P(y t 1 x 1:t 1)

54 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 = Q 0; then (cf. FA) P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 K 1 = ˆV 1 C T 0 (C ˆV 1 C T + R) 1 }{{}}{{} ŷ 1 1 ˆV 1 1 In general, we define ŷ τ τ t E[y t x 1,..., x τ ] and ˆV t V[y t x 1,..., x τ ]. Then, P(y t x 1:t 1) = dy t 1P(y t y t 1)P(y t 1 x 1:t 1) = N (Aŷ t 1 t 1, AˆV t 1 A T + Q) }{{}}{{} ŷ t 1 t ˆV t 1 t

55 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 = Q 0; then (cf. FA) P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 K 1 = ˆV 1 C T 0 (C ˆV 1 C T + R) 1 }{{}}{{} ŷ 1 1 ˆV 1 1 In general, we define ŷ τ τ t E[y t x 1,..., x τ ] and ˆV t V[y t x 1,..., x τ ]. Then, P(y t x 1:t 1) = dy t 1P(y t y t 1)P(y t 1 x 1:t 1) = N (Aŷ t 1 t 1, AˆV t 1 A T + Q) }{{}}{{} ŷ t 1 t ˆV t 1 t P(y t x 1:t) = N (ŷ t 1 t + K t(x t Cŷ t 1 t 1 t 1 t ), ˆV t K tc ˆV t ) }{{}}{{} ŷ t t ˆV t t t 1 K t = ˆV t C T t 1 (C ˆV t C T + R) 1 }{{} Kalman gain

56 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 = Q 0; then (cf. FA) P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 K 1 = ˆV 1 C T 0 (C ˆV 1 C T + R) 1 }{{}}{{} ŷ 1 1 ˆV 1 1 In general, we define ŷ τ τ t E[y t x 1,..., x τ ] and ˆV t V[y t x 1,..., x τ ]. Then, P(y t x 1:t 1) = dy t 1P(y t y t 1)P(y t 1 x 1:t 1) = N (Aŷ t 1 t 1, AˆV t 1 A T + Q) }{{}}{{} ŷ t 1 t ˆV t 1 t P(y t x 1:t) = N (ŷ t 1 t + K t(x t Cŷ t 1 t 1 t 1 t ), ˆV t K tc ˆV t ) }{{}}{{} ŷ t t ˆV t t K t = yx T {}}{ ˆV t 1 t C T ( xx T {}}{ t 1 C ˆV t C T + R) 1 }{{} Kalman gain

57 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 = Q 0; then (cf. FA) P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 K 1 = ˆV 1 C T 0 (C ˆV 1 C T + R) 1 }{{}}{{} ŷ 1 1 ˆV 1 1 In general, we define ŷ τ τ t E[y t x 1,..., x τ ] and ˆV t V[y t x 1,..., x τ ]. Then, P(y t x 1:t 1) = dy t 1P(y t y t 1)P(y t 1 x 1:t 1) = N (Aŷ t 1 t 1, AˆV t 1 A T + Q) }{{}}{{} ŷ t 1 t ˆV t 1 t P(y t x 1:t) = N (ŷ t 1 t + K t(x t Cŷ t 1 t 1 t 1 t ), ˆV t K tc ˆV t ) }{{}}{{} ŷ t t ˆV t t K t = yx T {}}{ ˆV t 1 t C T ( xx T {}}{ t 1 C ˆV t C T + R) 1 }{{} Kalman gain FA: β = (I + Λ T Ψ 1 Λ) 1 Λ T 1 mat. inv. lem. Ψ = Λ T (ΛΛ T + Ψ) 1 ; µ = βx n; Σ = I βλ.

58 The marginal posterior: Bayesian smoothing y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:T )

59 The marginal posterior: Bayesian smoothing y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:T ) = P(yt, xt+1:t x1:t) P(x t+1:t x 1:t)

60 The marginal posterior: Bayesian smoothing y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:T ) = = P(yt, xt+1:t x1:t) P(x t+1:t x 1:t) P(xt+1:T yt)p(yt x1:t) P(x t+1:t x 1:t) The marginal combines a backward message with the forward message found by filtering.

61 Bugs again: the bugs run forward from time 0 to t and backward from time T to t. The HMM: Forward Backward Algorithm State estimation: compute marginal posterior distribution over state at time t: γ t(i) P(s t=i x 1:T ) = P(st=i, x1:t)p(xt+1:t st=i) P(x 1:T ) where there is a simple backward recursion for K β t(i) P(x t+1:t s t=i) = P(s t+1=j, x t+1, x t+2:t s t=i) = j=1 = αt(i)βt(i) j α t(j)β t(j) K P(s t+1=j s t=i)p(x t+1 s t+1=j)p(x t+2:t s t+1=j) = j=1 K Φ ija j(x t+1)β t+1(j) α t(i) gives total inflow of probabilities to node (t, i); β t(i) gives total outflow of probabiilties. j= s 1 s 2 s 3 s 4 s 5 s 6

62 Viterbi decoding The numbers γ t(i) computed by forward-backward give the marginal posterior distribution over states at each time.

63 Viterbi decoding The numbers γ t(i) computed by forward-backward give the marginal posterior distribution over states at each time. By choosing the state i t with the largest γ t(i) at each time, we can make a best state path. This is the path with the maximum expected number of correct states.

64 Viterbi decoding The numbers γ t(i) computed by forward-backward give the marginal posterior distribution over states at each time. By choosing the state i t with the largest γ t(i) at each time, we can make a best state path. This is the path with the maximum expected number of correct states. But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero!

65 Viterbi decoding The numbers γ t(i) computed by forward-backward give the marginal posterior distribution over states at each time. By choosing the state i t with the largest γ t(i) at each time, we can make a best state path. This is the path with the maximum expected number of correct states. But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! To find the single best path, we use the Viterbi decoding algorithm which is just Bellman s dynamic programming algorithm applied to this problem. This is an inference algorithm which computes the most probable state sequences: argmax s 1:T P(s 1:T x 1:T, θ)

66 Viterbi decoding The numbers γ t(i) computed by forward-backward give the marginal posterior distribution over states at each time. By choosing the state i t with the largest γ t(i) at each time, we can make a best state path. This is the path with the maximum expected number of correct states. But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! To find the single best path, we use the Viterbi decoding algorithm which is just Bellman s dynamic programming algorithm applied to this problem. This is an inference algorithm which computes the most probable state sequences: argmax s 1:T P(s 1:T x 1:T, θ) The recursions look the same as forward-backward, except with max instead of.

67 Viterbi decoding The numbers γ t(i) computed by forward-backward give the marginal posterior distribution over states at each time. By choosing the state i t with the largest γ t(i) at each time, we can make a best state path. This is the path with the maximum expected number of correct states. But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! To find the single best path, we use the Viterbi decoding algorithm which is just Bellman s dynamic programming algorithm applied to this problem. This is an inference algorithm which computes the most probable state sequences: argmax s 1:T P(s 1:T x 1:T, θ) The recursions look the same as forward-backward, except with max instead of. Bugs once more: same trick except at each step kill all bugs but the one with the highest value at the node.

68 Viterbi decoding The numbers γ t(i) computed by forward-backward give the marginal posterior distribution over states at each time. By choosing the state i t with the largest γ t(i) at each time, we can make a best state path. This is the path with the maximum expected number of correct states. But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! To find the single best path, we use the Viterbi decoding algorithm which is just Bellman s dynamic programming algorithm applied to this problem. This is an inference algorithm which computes the most probable state sequences: argmax s 1:T P(s 1:T x 1:T, θ) The recursions look the same as forward-backward, except with max instead of. Bugs once more: same trick except at each step kill all bugs but the one with the highest value at the node. There is also a modified EM training based on the Viterbi decoder (assignment).

69 The LGSSM: Kalman smoothing y 1 y 2 y 3 y T x 1 x 2 x 3 x T We use a slightly different decomposition: P(y t x 1:T ) = P(y t, y t+1 x 1:T ) dy t+1 = P(y t y t+1, x 1:T )P(y t+1 x 1:T ) dy t+1 = P(y t y t+1, x 1:t)P(y t+1 x 1:T ) dy t+1 Markov property This gives the additional backward recursion: J t = ˆV t t A T (ˆV t t+1) 1 ŷ T t = ŷ t t + J t(ŷ T t+1 Aŷ t t) ˆV T t T t T t = ˆV t + J t(ˆv t+1 ˆV t+1)j t

70 ML Learning for SSMs using batch EM y 1 A y 2 A y 3 A A y T C C C C x 1 x 2 x 3 x T Parameters: θ = {µ 0, Q 0, A, Q, C, R} Free energy: F(q, θ) = dy 1:T q(y 1:T )(log P(x 1:T, y 1:T θ) log q(y 1:T )) E-step: Maximise F w.r.t. q with θ fixed: q (y) = p(y x, θ) This can be achieved with a two-state extension of the Kalman smoother. M-step: Maximize F w.r.t. θ with q fixed. This boils down to solving a few weighted least squares problems, since all the variables in: p(y, x θ) = p(y 1)p(x 1 y 1) form a multivariate Gaussian. T p(y t y t 1)p(x t y t) t=2

71 The M step for C p(x t y t) exp [ 1 2 (xt Cyt)T R 1 (x t Cy t) ]

72 The M step for C p(x t y t) exp [ 1 2 (xt Cyt)T R 1 (x t Cy t) ] C new = argmax C ln p(x t y t) t q

73 The M step for C p(x t y t) exp [ 1 2 (xt Cyt)T R 1 (x t Cy t) ] C new = argmax C = argmax C ln p(x t y t) t q 1 (x t Cy t) T R 1 (x t Cy t) + const 2 t q

74 The M step for C p(x t y t) exp [ 1 2 (xt Cyt)T R 1 (x t Cy t) ] C new = argmax C = argmax C = argmax C ln p(x t y t) t q 1 (x t Cy t) T R 1 (x t Cy t) + const 2 t q { } 1 x T t R 1 x t 2x T t R 1 C y t + y T t C T R 1 Cy t 2 t

75 The M step for C p(x t y t) exp [ 1 2 (xt Cyt)T R 1 (x t Cy t) ] C new = argmax C = argmax C = argmax C = argmax C ln p(x t y t) t q 1 (x t Cy t) T R 1 (x t Cy t) + const 2 t q { } 1 x T t R 1 x t 2x T t R 1 C y t + y T t C T R 1 Cy t 2 t { [ ] ]} Tr C y t x T t R 1 [C 1 2 Tr T R 1 C y ty T t t t

76 The M step for C p(x t y t) exp [ 1 2 (xt Cyt)T R 1 (x t Cy t) ] using Tr[AB] A C new = argmax C = argmax C = argmax C = argmax C ln p(x t y t) t q 1 (x t Cy t) T R 1 (x t Cy t) + const 2 t q { } 1 x T t R 1 x t 2x T t R 1 C y t + y T t C T R 1 Cy t 2 t { [ ] ]} Tr C y t x T t R 1 [C 1 2 Tr T R 1 C y ty T t = B T, we have { } C t = R 1 t x t y t T R 1 C y ty T t t t

77 The M step for C p(x t y t) exp [ 1 2 (xt Cyt)T R 1 (x t Cy t) ] using Tr[AB] A C new = argmax C = argmax C = argmax C = argmax C ln p(x t y t) t q 1 (x t Cy t) T R 1 (x t Cy t) + const 2 t q { } 1 x T t R 1 x t 2x T t R 1 C y t + y T t C T R 1 Cy t 2 t { [ ] ]} Tr C y t x T t R 1 [C 1 2 Tr T R 1 C y ty T t = B T, we have { } C t = R 1 t x t y t T R 1 C y ty T t t t ( ) ( C new = x t y t T t t ) 1 y ty T t Notice that this is exactly the same equation as in factor analysis and linear regression!

78 The M step for A p(y t+1 y t) exp { 1 2 (yt+1 Ayt)T Q 1 (y t+1 Ay t) }

79 The M step for A p(y t+1 y t) exp { 1 2 (yt+1 Ayt)T Q 1 (y t+1 Ay t) } A new = argmax A ln p(y t+1 y t) t q

80 The M step for A p(y t+1 y t) exp { 1 2 (yt+1 Ayt)T Q 1 (y t+1 Ay t) } A new = argmax A = argmax A ln p(y t+1 y t) t q 1 (y t+1 Ay t) T Q 1 (y t+1 Ay t) + const 2 t q

81 The M step for A p(y t+1 y t) exp { 1 2 (yt+1 Ayt)T Q 1 (y t+1 Ay t) } A new = argmax A = argmax A = argmax A ln p(y t+1 y t) t q 1 (y t+1 Ay t) T Q 1 (y t+1 Ay t) + const 2 t q { 1 y T t+1q 1 y t+1 2 y Tt+1Q 1 Ay t + 2 t y T t A T Q 1 Ay t }

82 The M step for A p(y t+1 y t) exp { 1 2 (yt+1 Ayt)T Q 1 (y t+1 Ay t) } A new = argmax A = argmax A = argmax A = argmax A ln p(y t+1 y t) t q 1 (y t+1 Ay t) T Q 1 (y t+1 Ay t) + const 2 t q { 1 y T t+1q 1 y t+1 2 y Tt+1Q 1 Ay t + 2 t { [ ] Tr A y ty T t+1 Q 1 1 [A 2 Tr T Q 1 A t y T t A T Q 1 Ay t } t ]} y ty T t

83 The M step for A p(y t+1 y t) exp { 1 2 (yt+1 Ayt)T Q 1 (y t+1 Ay t) } using Tr[AB] A A new = argmax A = argmax A = argmax A = argmax A ln p(y t+1 y t) t q 1 (y t+1 Ay t) T Q 1 (y t+1 Ay t) + const 2 t q { 1 y T t+1q 1 y t+1 2 y Tt+1Q 1 Ay t + 2 t { [ ] Tr A y ty T t+1 Q 1 1 [A 2 Tr T Q 1 A = B T, we have { } A t = Q 1 t y t+1y T t Q 1 A t y T t A T Q 1 Ay t } t ]} y ty T t y ty T t

84 The M step for A p(y t+1 y t) exp { 1 2 (yt+1 Ayt)T Q 1 (y t+1 Ay t) } using Tr[AB] A A new = argmax A A new = = argmax A = argmax A = argmax A ln p(y t+1 y t) t q 1 (y t+1 Ay t) T Q 1 (y t+1 Ay t) + const 2 t q { 1 y T t+1q 1 y t+1 2 y Tt+1Q 1 Ay t + 2 t { [ ] Tr A y ty T t+1 Q 1 1 [A 2 Tr T Q 1 A = B T, we have { } A ( t y t+1y T t t = Q 1 t ) ( t y t+1y T t Q 1 A ) 1 y ty T t t y T t A T Q 1 Ay t } t ]} y ty T t y ty T t This is still analagous to factor analysis and linear regression, with expected correlations.

85 Learning (online gradient) Time series data must often be processed in real-time, and we may want to update parameters online as observations arrive. We can do so by updating a local version of the likelihood based on the Kalman filter estimates. Consider the log likelihood contributed by each data point (l t): Then, l = T ln p(x t x 1,..., x t 1) = t=1 T t=1 l t l t = D 2 ln 2π 1 2 ln Σ 1 2 where D is dimension of x, and: ŷ t 1 t ˆV t 1 t = Aŷ t 1 t 1 t 1 Σ = C ˆV t C T + R = AˆV t 1 t 1 A T + Q (xt Cŷt 1t ) T Σ 1 (x t Cŷ t 1 t ) We differentiate l t to obtain gradient rules for A, C, Q, R. The size of the gradient step (learning rate) reflects our expectation about nonstationarity.

86 Learning HMMs using EM s 1 T s 2 T s 3 T T s T A A A A x 1 x 2 x 3 x T Parameters: θ = {π, Φ, A} Free energy: F(q, θ) = q(s 1:T )(log P(x 1:T, s 1:T θ) log q(s 1:T )) s1:t E-step: Maximise F w.r.t. q with θ fixed: q (s 1:T ) = P(s 1:T x 1:T, θ) We will only need the marginal probabilities q(s t, s t+1), which can also be obtained from the forward backward algorithm. M-step: Maximize F w.r.t. θ with q fixed. We can re-estimate the parameters by computing the expected number of times the HMM was in state i, emitted symbol k and transitioned to state j. This is the Baum-Welch algorithm and it predates the (more general) EM algorithm.

87 M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ.

88 M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ. The initial state distribution is the expected number of times in state i at t = 1: ˆπ i = γ 1(i)

89 M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ. The initial state distribution is the expected number of times in state i at t = 1: ˆπ i = γ 1(i) The expected number of transitions from state i to j which begin at time t is: ξ t(i j) P(s t = i, s t+1 = j x 1:T ) = α t(i)φ ija j(x t+1)β t+1(j)/p(x 1:T ) so the estimated transition probabilities are: T 1 Φ ij = ξ t(i j) / T 1 t=1 t=1 γ t(i)

90 M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ. The initial state distribution is the expected number of times in state i at t = 1: ˆπ i = γ 1(i) The expected number of transitions from state i to j which begin at time t is: ξ t(i j) P(s t = i, s t+1 = j x 1:T ) = α t(i)φ ija j(x t+1)β t+1(j)/p(x 1:T ) so the estimated transition probabilities are: T 1 Φ ij = ξ t(i j) t=1 / T 1 γ t(i) t=1 The output distributions are the expected number of times we observe a particular symbol in a particular state: Â ik = / T γ t(i) γ t(i) t:x t =k t=1 (or the state-probability-weighted mean and variance for a Gaussian output model).

91 HMM practicalities Numerical scaling: the conventional message definition is in terms of a large joint: Rescale: α t(i) = P(x 1:t, s t=i) 0 as t grows, and so can easily underflow. α t(i) = A i(x t) j α t 1(j)Φ ji ρ t = K α t(i) α t(i) = α t(i)/ρ t i=1 Exercise: show that: ρ t = P(x t x 1:t 1, θ) T ρ t = P(x 1:T θ) t=1 What does this make α t(i)?

92 HMM practicalities Numerical scaling: the conventional message definition is in terms of a large joint: Rescale: α t(i) = P(x 1:t, s t=i) 0 as t grows, and so can easily underflow. α t(i) = A i(x t) j α t 1(j)Φ ji ρ t = K α t(i) α t(i) = α t(i)/ρ t i=1 Exercise: show that: ρ t = P(x t x 1:t 1, θ) T ρ t = P(x 1:T θ) t=1 What does this make α t(i)? Multiple observed sequences: average numerators and denominators in the ratios of updates.

93 HMM practicalities Numerical scaling: the conventional message definition is in terms of a large joint: Rescale: α t(i) = P(x 1:t, s t=i) 0 as t grows, and so can easily underflow. α t(i) = A i(x t) j α t 1(j)Φ ji ρ t = K α t(i) α t(i) = α t(i)/ρ t i=1 Exercise: show that: ρ t = P(x t x 1:t 1, θ) T ρ t = P(x 1:T θ) t=1 What does this make α t(i)? Multiple observed sequences: average numerators and denominators in the ratios of updates. Local optima (random restarts, annealing; see discussion later).

94 HMM pseudocode: inference (E step) Forward-backward including scaling tricks. [ is the element-by-element (Hadamard/Schur) product:. in matlab.] for t = 1:T, i = 1:K p t(i) = A i(x t) α 1 = π p 1 ρ 1 = K i=1 α1(i) α1 = α1/ρ1 for t = 2:T α t = (Φ T α t 1) p t ρ t = K i=1 αt(i) αt = αt/ρt β T = 1 for t = T 1:1 β t = Φ (β t+1 p t+1)/ρ t+1 for t = 1:T log P(x 1:T ) = T log(ρt) t=1 γ t = α t β t for t = 1:T 1 ξ t = Φ (α t (β t+1 p t+1) T )/ρ t+1

95 HMM pseudocode: parameter re-estimation (M step) Baum-Welch parameter updates: For each sequence l = 1 : L, run forward backward to get γ (l) and ξ (l), then π i = 1 L l=1 L γ(l) (i) 1 L T (l) 1 l=1 Φ ij = L T (l) 1 l=1 t=1 ξ (l) t t=1 γ (l) t (ij) (i) L T (l) l=1 t=1 δ(xt = k)γ(l) t (i) A ik = L T (l) l=1 (i) t=1 γ(l) t

96 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: P(y) = N (0, I) P(x y) = N (Λy, Ψ)

97 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: P(y) = N (0, I) P(x y) = N (Λy, Ψ) & ỹ = Uy Λ = ΛU T

98 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: ) P(y) = N (0, I) & ỹ = Uy P(ỹ) = N (U0, UIU T = N (0, I) ( ) P(x y) = N (Λy, Ψ) Λ = ΛU T P(x ỹ) = N ΛU T Uy, Ψ = N ( Λỹ, Ψ)

99 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: ) P(y) = N (0, I) & ỹ = Uy P(ỹ) = N (U0, UIU T = N (0, I) ( ) P(x y) = N (Λy, Ψ) Λ = ΛU T P(x ỹ) = N ΛU T Uy, Ψ = N ( Λỹ, Ψ) Similarly, a mixture model is invariant to permutations of the latent.

100 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: ) P(y) = N (0, I) & ỹ = Uy P(ỹ) = N (U0, UIU T = N (0, I) ( ) P(x y) = N (Λy, Ψ) Λ = ΛU T P(x ỹ) = N ΛU T Uy, Ψ = N ( Λỹ, Ψ) Similarly, a mixture model is invariant to permutations of the latent. The LGSSM likelihood is conserved with respect to any invertible transform of the latent: P(y t+1 y t) = N (Ay t, Q) P(x t y t) = N (Cy, R)

101 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: ) P(y) = N (0, I) & ỹ = Uy P(ỹ) = N (U0, UIU T = N (0, I) ( ) P(x y) = N (Λy, Ψ) Λ = ΛU T P(x ỹ) = N ΛU T Uy, Ψ = N ( Λỹ, Ψ) Similarly, a mixture model is invariant to permutations of the latent. The LGSSM likelihood is conserved with respect to any invertible transform of the latent: P(y t+1 y t) = N (Ay t, Q) P(x t y t) = N (Cy, R) & ỹ = Gy à = GAG 1 Q = GQG T C = CG 1

102 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: ) P(y) = N (0, I) & ỹ = Uy P(ỹ) = N (U0, UIU T = N (0, I) ( ) P(x y) = N (Λy, Ψ) Λ = ΛU T P(x ỹ) = N ΛU T Uy, Ψ = N ( Λỹ, Ψ) Similarly, a mixture model is invariant to permutations of the latent. The LGSSM likelihood is conserved with respect to any invertible transform of the latent: P(y t+1 y t) = N (Ay t, Q) P(x t y t) = N (Cy, R) & ỹ = Gy à = GAG 1 Q = GQG T C = CG 1 ) P(ỹ t+1 ỹ t ) = N (GAG 1 Gy t, GQG T = N ( Ãỹ t, Q) P(x t ỹ t ) = N ( CG 1 Gy, R ) = N ( Cỹ, ) R

103 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: ) P(y) = N (0, I) & ỹ = Uy P(ỹ) = N (U0, UIU T = N (0, I) ( ) P(x y) = N (Λy, Ψ) Λ = ΛU T P(x ỹ) = N ΛU T Uy, Ψ = N ( Λỹ, Ψ) Similarly, a mixture model is invariant to permutations of the latent. The LGSSM likelihood is conserved with respect to any invertible transform of the latent: P(y t+1 y t) = N (Ay t, Q) P(x t y t) = N (Cy, R) & ỹ = Gy à = GAG 1 Q = GQG T C = CG 1 ) P(ỹ t+1 ỹ t ) = N (GAG 1 Gy t, GQG T = N ( Ãỹ t, Q) P(x t ỹ t ) = N ( CG 1 Gy, R ) = N ( Cỹ, ) R and the HMM is invariant to permutations (and to relaxations into something called an observable operator model).

104 The likelihood landscape Besides the degenerate solutions, both LGSSMs and HMMs likelihood functions tend to have multiple local maxima. This complicates any ML-based approach to learning (including EM and gradient methods).

105 The likelihood landscape Besides the degenerate solutions, both LGSSMs and HMMs likelihood functions tend to have multiple local maxima. This complicates any ML-based approach to learning (including EM and gradient methods). Strategies: Restart EM/gradient ascent from different (often random) initial parameter values.

106 The likelihood landscape Besides the degenerate solutions, both LGSSMs and HMMs likelihood functions tend to have multiple local maxima. This complicates any ML-based approach to learning (including EM and gradient methods). Strategies: Restart EM/gradient ascent from different (often random) initial parameter values. Initialise output parameters with a stationary model: PCA FA LGSSM k-means mixture HMM

107 The likelihood landscape Besides the degenerate solutions, both LGSSMs and HMMs likelihood functions tend to have multiple local maxima. This complicates any ML-based approach to learning (including EM and gradient methods). Strategies: Restart EM/gradient ascent from different (often random) initial parameter values. Initialise output parameters with a stationary model: PCA FA LGSSM k-means mixture HMM Stochastic gradient methods and momentum. Overstepping EM.

108 The likelihood landscape Besides the degenerate solutions, both LGSSMs and HMMs likelihood functions tend to have multiple local maxima. This complicates any ML-based approach to learning (including EM and gradient methods). Strategies: Restart EM/gradient ascent from different (often random) initial parameter values. Initialise output parameters with a stationary model: PCA FA LGSSM k-means mixture HMM Stochastic gradient methods and momentum. Overstepping EM. Deterministic annealing.

109 The likelihood landscape Besides the degenerate solutions, both LGSSMs and HMMs likelihood functions tend to have multiple local maxima. This complicates any ML-based approach to learning (including EM and gradient methods). Strategies: Restart EM/gradient ascent from different (often random) initial parameter values. Initialise output parameters with a stationary model: PCA FA LGSSM k-means mixture HMM Stochastic gradient methods and momentum. Overstepping EM. Deterministic annealing. Non-ML learning (spectral methods).

110 Slow Feature Analysis (SFA) Is there a zero-noise limit for an LGSSM analagous to PCA?

111 Slow Feature Analysis (SFA) Is there a zero-noise limit for an LGSSM analagous to PCA? We can take: A = diag [a 1... a K ] I (with a 1 a 2 a K 1) Q = I AA T 0 (setting the stationary latent covariance to I) R 0. This limit yields a spectral algorithm called Slow Feature Analysis (or SFA).

112 Slow Feature Analysis (SFA) Is there a zero-noise limit for an LGSSM analagous to PCA? We can take: A = diag [a 1... a K ] I (with a 1 a 2 a K 1) Q = I AA T 0 (setting the stationary latent covariance to I) R 0. This limit yields a spectral algorithm called Slow Feature Analysis (or SFA). SFA is conventionally defined as slowness pursuit (cf PCA and variance pursuit): Given zero-mean series {x 1, x 2,... x T } find a K D matrix W (= C T ) such that y t = W x t changes as slowly as possible.

113 Slow Feature Analysis (SFA) Is there a zero-noise limit for an LGSSM analagous to PCA? We can take: A = diag [a 1... a K ] I (with a 1 a 2 a K 1) Q = I AA T 0 (setting the stationary latent covariance to I) R 0. This limit yields a spectral algorithm called Slow Feature Analysis (or SFA). SFA is conventionally defined as slowness pursuit (cf PCA and variance pursuit): Given zero-mean series {x 1, x 2,... x T } find a K D matrix W (= C T ) such that y t = W x t changes as slowly as possible. Specifically, find W = argmin W y t y t 1 2 t subject to y ty T t = I. The variance constraint prevents the trivial solutions W = 0 and W = 1w T. t

114 Slow Feature Analysis (SFA) Is there a zero-noise limit for an LGSSM analagous to PCA? We can take: A = diag [a 1... a K ] I (with a 1 a 2 a K 1) Q = I AA T 0 (setting the stationary latent covariance to I) R 0. This limit yields a spectral algorithm called Slow Feature Analysis (or SFA). SFA is conventionally defined as slowness pursuit (cf PCA and variance pursuit): Given zero-mean series {x 1, x 2,... x T } find a K D matrix W (= C T ) such that y t = W x t changes as slowly as possible. Specifically, find W = argmin W y t y t 1 2 t subject to y ty T t = I. The variance constraint prevents the trivial solutions W = 0 and W = 1w T. W can be found by solving the generalised eigenvalue problem WA = ΩWB where A = t (xt xt 1)(xt xt 1)T and B = t xtxt t. See maneesh/papers/turner-sahani-2007-sfa.pdf. t

115 (Generalised) Method of Moments estimators What if we don t want the A I limit?

116 (Generalised) Method of Moments estimators What if we don t want the A I limit? The limit is necessary to make the spectral estimate and the (limiting) ML estimates coincide. However, if we give up on ML, then there are general spectral estimates available.

117 (Generalised) Method of Moments estimators What if we don t want the A I limit? The limit is necessary to make the spectral estimate and the (limiting) ML estimates coincide. However, if we give up on ML, then there are general spectral estimates available. The ML parameter estimates are defined: θ ML = argmax θ log P(X θ) and as you found, if P ExpFam with sufficient statistic T then T (x) θ ML = 1 T (x i) (moment matching). N i

118 (Generalised) Method of Moments estimators What if we don t want the A I limit? The limit is necessary to make the spectral estimate and the (limiting) ML estimates coincide. However, if we give up on ML, then there are general spectral estimates available. The ML parameter estimates are defined: θ ML = argmax θ log P(X θ) and as you found, if P ExpFam with sufficient statistic T then T (x) θ ML = 1 T (x i) (moment matching). N i We can generalise this condition to arbitrary T even if P / ExpFam. That is, we find estimate θ with T (x) θ = 1 N i T (x i) or θ = argmin T (x) θ 1 θ N T (x i) C i

119 (Generalised) Method of Moments estimators What if we don t want the A I limit? The limit is necessary to make the spectral estimate and the (limiting) ML estimates coincide. However, if we give up on ML, then there are general spectral estimates available. The ML parameter estimates are defined: θ ML = argmax θ log P(X θ) and as you found, if P ExpFam with sufficient statistic T then T (x) θ ML = 1 T (x i) (moment matching). N i We can generalise this condition to arbitrary T even if P / ExpFam. That is, we find estimate θ with T (x) θ = 1 N i T (x i) or θ = argmin T (x) θ 1 θ N T (x i) C Judicious choice of T and metric C might make solution unique (no local optima) and consistent (correct given infinite within-model data). i

120 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2

121 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 M τ x t+τ x T t

122 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 C x t 2 x t 1 x t x t+1 x t+2 M τ x t+τ x T t = x t+τ (Cy t + η) T

123 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 C x t 2 x t 1 x t x t+1 x t+2 M τ x t+τ x T t = x t+τ y T t C T

124 Ho-Kalman SSID for LGSSMs A A y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 C M τ x t+τ x T t = x t+τ y T t C T = (CA τ y t + η)y T t C T

125 Ho-Kalman SSID for LGSSMs A A y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 C M τ x t+τ x T t = x t+τ y T t C T = CA τ y ty T t C T

126 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 M τ x t+τ x T t = x t+τ y T t C T = CA τ y ty T t C T = CA τ ΠC T

127 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 x t = [x t 1:t L] x + t = [x t:t+l 1] M τ x t+τ x T t = x t+τ y T t C T = CA τ y ty T t C T = CA τ ΠC T H x + t, x t T

128 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 x t = [x t 1:t L] x + t = [x t:t+l 1] M τ x t+τ x T t = x t+τ y T t C T = CA τ y ty T t C T = CA τ ΠC T H x + t, x t T = M 1 M 2 M L M 2 M 3.. = M L M 2L 1 C CA. CA L 1 [ AΠC T A L ΠC T]

129 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 x t = [x t 1:t L] x + t = [x t:t+l 1] M τ x t+τ x T t = x t+τ y T t C T = CA τ y ty T t C T = CA τ ΠC T H x + t, x t T = M 1 M 2 M L M 2 M 3.. M L M 2L 1 = C CA Ξ. CA L 1 [ AΠC T Υ A L ΠC T] K LD LD LD LD K

130 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 x t = [x t 1:t L] x + t = [x t:t+l 1] M τ x t+τ x T t = x t+τ y T t C T = CA τ y ty T t C T = CA τ ΠC T H x + t, x t T = M 1 M 2 M L M 2 M 3.. M L M 2L 1 = C CA Ξ. CA L 1 [ AΠC T Υ A L ΠC T] K LD LD LD LD K Off-diagonal correlation unaffected by noise. SVD( 1 T x + t x T t ) yields least-squares estimates of Ξ and Υ. Regression between blocks of Ξ yields  and Ĉ.

131 HMMs OOMs s 1 s 2 s 3 s T x 1 x 2 x 3 x T Now consider an HMM with discrete output symbols.

132 HMMs OOMs s 1 s 2 s 3 s T x 1 x 2 x 3 x T Now consider an HMM with discrete output symbols. The likelihood can be written: P(x 1:T π, Φ, A) = i π ia i(x 1) j Φ jia j(x 2) k Φ kja K (x 3)...

133 HMMs OOMs s 1 s 2 s 3 s T x 1 x 2 x 3 x T Now consider an HMM with discrete output symbols. The likelihood can be written: P(x 1:T π, Φ, A) = i π ia i(x 1) j Φ jia j(x 2) k Φ kja K (x 3)... = π T A x1 Φ T A x2 Φ T A x [A x1 = diag [A 1(x 1), A 2(x 1),...]]

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

More information

order is number of previous outputs

order is number of previous outputs Markov Models Lecture : Markov and Hidden Markov Models PSfrag Use past replacements as state. Next output depends on previous output(s): y t = f[y t, y t,...] order is number of previous outputs y t y

More information

Week 3: The EM algorithm

Week 3: The EM algorithm Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010 Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data

More information

Bayesian Hidden Markov Models and Extensions

Bayesian Hidden Markov Models and Extensions Bayesian Hidden Markov Models and Extensions Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yunus Saatci, Tom Stepleton, Yee Whye Teh Modeling

More information

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing Hidden Markov Models By Parisa Abedi Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed data Sequential (non i.i.d.) data Time-series data E.g. Speech

More information

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010 Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition

More information

Linear Dynamical Systems (Kalman filter)

Linear Dynamical Systems (Kalman filter) Linear Dynamical Systems (Kalman filter) (a) Overview of HMMs (b) From HMMs to Linear Dynamical Systems (LDS) 1 Markov Chains with Discrete Random Variables x 1 x 2 x 3 x T Let s assume we have discrete

More information

Probabilistic & Unsupervised Learning. Latent Variable Models

Probabilistic & Unsupervised Learning. Latent Variable Models Probabilistic & Unsupervised Learning Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabás Póczos & Aarti Singh Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed

More information

Factor Analysis and Kalman Filtering (11/2/04)

Factor Analysis and Kalman Filtering (11/2/04) CS281A/Stat241A: Statistical Learning Theory Factor Analysis and Kalman Filtering (11/2/04) Lecturer: Michael I. Jordan Scribes: Byung-Gon Chun and Sunghoon Kim 1 Factor Analysis Factor analysis is used

More information

Dynamic Approaches: The Hidden Markov Model

Dynamic Approaches: The Hidden Markov Model Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: ony Jebara Kalman Filtering Linear Dynamical Systems and Kalman Filtering Structure from Motion Linear Dynamical Systems Audio: x=pitch y=acoustic waveform Vision: x=object

More information

Hidden Markov models

Hidden Markov models Hidden Markov models Charles Elkan November 26, 2012 Important: These lecture notes are based on notes written by Lawrence Saul. Also, these typeset notes lack illustrations. See the classroom lectures

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 24. Hidden Markov Models & message passing Looking back Representation of joint distributions Conditional/marginal independence

More information

Basic math for biology

Basic math for biology Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 25&29 January 2018 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I University of Cambridge MPhil in Computer Speech Text & Internet Technology Module: Speech Processing II Lecture 2: Hidden Markov Models I o o o o o 1 2 3 4 T 1 b 2 () a 12 2 a 3 a 4 5 34 a 23 b () b ()

More information

LEARNING DYNAMIC SYSTEMS: MARKOV MODELS

LEARNING DYNAMIC SYSTEMS: MARKOV MODELS LEARNING DYNAMIC SYSTEMS: MARKOV MODELS Markov Process and Markov Chains Hidden Markov Models Kalman Filters Types of dynamic systems Problem of future state prediction Predictability Observability Easily

More information

Sequential Supervised Learning

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given

More information

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named We Live in Exciting Times ACM (an international computing research society) has named CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Apr. 2, 2019 Yoshua Bengio,

More information

Note Set 5: Hidden Markov Models

Note Set 5: Hidden Markov Models Note Set 5: Hidden Markov Models Probabilistic Learning: Theory and Algorithms, CS 274A, Winter 2016 1 Hidden Markov Models (HMMs) 1.1 Introduction Consider observed data vectors x t that are d-dimensional

More information

Master 2 Informatique Probabilistic Learning and Data Analysis

Master 2 Informatique Probabilistic Learning and Data Analysis Master 2 Informatique Probabilistic Learning and Data Analysis Faicel Chamroukhi Maître de Conférences USTV, LSIS UMR CNRS 7296 email: chamroukhi@univ-tln.fr web: chamroukhi.univ-tln.fr 2013/2014 Faicel

More information

Inference and estimation in probabilistic time series models

Inference and estimation in probabilistic time series models 1 Inference and estimation in probabilistic time series models David Barber, A Taylan Cemgil and Silvia Chiappa 11 Time series The term time series refers to data that can be represented as a sequence

More information

PROBABILISTIC REASONING OVER TIME

PROBABILISTIC REASONING OVER TIME PROBABILISTIC REASONING OVER TIME In which we try to interpret the present, understand the past, and perhaps predict the future, even when very little is crystal clear. Outline Time and uncertainty Inference:

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Slides mostly from Mitch Marcus and Eric Fosler (with lots of modifications). Have you seen HMMs? Have you seen Kalman filters? Have you seen dynamic programming? HMMs are dynamic

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 12 Dynamical Models CS/CNS/EE 155 Andreas Krause Homework 3 out tonight Start early!! Announcements Project milestones due today Please email to TAs 2 Parameter learning

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Dr Philip Jackson Centre for Vision, Speech & Signal Processing University of Surrey, UK 1 3 2 http://www.ee.surrey.ac.uk/personal/p.jackson/isspr/ Outline 1. Recognizing patterns

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are

More information

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma COMS 4771 Probabilistic Reasoning via Graphical Models Nakul Verma Last time Dimensionality Reduction Linear vs non-linear Dimensionality Reduction Principal Component Analysis (PCA) Non-linear methods

More information

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Course 495: Advanced Statistical Machine Learning/Pattern Recognition Course 495: Advanced Statistical Machine Learning/Pattern Recognition Lecturer: Stefanos Zafeiriou Goal (Lectures): To present discrete and continuous valued probabilistic linear dynamical systems (HMMs

More information

State-Space Methods for Inferring Spike Trains from Calcium Imaging

State-Space Methods for Inferring Spike Trains from Calcium Imaging State-Space Methods for Inferring Spike Trains from Calcium Imaging Joshua Vogelstein Johns Hopkins April 23, 2009 Joshua Vogelstein (Johns Hopkins) State-Space Calcium Imaging April 23, 2009 1 / 78 Outline

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 12: Gaussian Belief Propagation, State Space Models and Kalman Filters Guest Kalman Filter Lecture by

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 8: Sequence Labeling Jimmy Lin University of Maryland Thursday, March 14, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Statistical NLP: Hidden Markov Models. Updated 12/15

Statistical NLP: Hidden Markov Models. Updated 12/15 Statistical NLP: Hidden Markov Models Updated 12/15 Markov Models Markov models are statistical tools that are useful for NLP because they can be used for part-of-speech-tagging applications Their first

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14 STATS 306B: Unsupervised Learning Spring 2014 Lecture 5 April 14 Lecturer: Lester Mackey Scribe: Brian Do and Robin Jia 5.1 Discrete Hidden Markov Models 5.1.1 Recap In the last lecture, we introduced

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Probabilistic Time Series Classification

Probabilistic Time Series Classification Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign

More information

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models Part 2: Algorithms Hidden Markov Models Part 2: Algorithms CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Hidden Markov Model An HMM consists of:

More information

Lecture 11: Hidden Markov Models

Lecture 11: Hidden Markov Models Lecture 11: Hidden Markov Models Cognitive Systems - Machine Learning Cognitive Systems, Applied Computer Science, Bamberg University slides by Dr. Philip Jackson Centre for Vision, Speech & Signal Processing

More information

Hidden Markov Models

Hidden Markov Models CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each

More information

Lecture 6: April 19, 2002

Lecture 6: April 19, 2002 EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models CI/CI(CS) UE, SS 2015 Christian Knoll Signal Processing and Speech Communication Laboratory Graz University of Technology June 23, 2015 CI/CI(CS) SS 2015 June 23, 2015 Slide 1/26 Content

More information

Chapter 3 - Temporal processes

Chapter 3 - Temporal processes STK4150 - Intro 1 Chapter 3 - Temporal processes Odd Kolbjørnsen and Geir Storvik January 23 2017 STK4150 - Intro 2 Temporal processes Data collected over time Past, present, future, change Temporal aspect

More information

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 Hidden Markov Model Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/19 Outline Example: Hidden Coin Tossing Hidden

More information

Kalman filtering and friends: Inference in time series models. Herke van Hoof slides mostly by Michael Rubinstein

Kalman filtering and friends: Inference in time series models. Herke van Hoof slides mostly by Michael Rubinstein Kalman filtering and friends: Inference in time series models Herke van Hoof slides mostly by Michael Rubinstein Problem overview Goal Estimate most probable state at time k using measurement up to time

More information

Graphical Models Seminar

Graphical Models Seminar Graphical Models Seminar Forward-Backward and Viterbi Algorithm for HMMs Bishop, PRML, Chapters 13.2.2, 13.2.3, 13.2.5 Dinu Kaufmann Departement Mathematik und Informatik Universität Basel April 8, 2013

More information

Estimating Covariance Using Factorial Hidden Markov Models

Estimating Covariance Using Factorial Hidden Markov Models Estimating Covariance Using Factorial Hidden Markov Models João Sedoc 1,2 with: Jordan Rodu 3, Lyle Ungar 1, Dean Foster 1 and Jean Gallier 1 1 University of Pennsylvania Philadelphia, PA joao@cis.upenn.edu

More information

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg Temporal Reasoning Kai Arras, University of Freiburg 1 Temporal Reasoning Contents Introduction Temporal Reasoning Hidden Markov Models Linear Dynamical Systems (LDS) Kalman Filter 2 Temporal Reasoning

More information

Introduction to Graphical Models

Introduction to Graphical Models Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Hidden Markov Models Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Additional References: David

More information

1 What is a hidden Markov model?

1 What is a hidden Markov model? 1 What is a hidden Markov model? Consider a Markov chain {X k }, where k is a non-negative integer. Suppose {X k } embedded in signals corrupted by some noise. Indeed, {X k } is hidden due to noise and

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1

Hidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1 Hidden Markov Models AIMA Chapter 15, Sections 1 5 AIMA Chapter 15, Sections 1 5 1 Consider a target tracking problem Time and uncertainty X t = set of unobservable state variables at time t e.g., Position

More information

1. Markov models. 1.1 Markov-chain

1. Markov models. 1.1 Markov-chain 1. Markov models 1.1 Markov-chain Let X be a random variable X = (X 1,..., X t ) taking values in some set S = {s 1,..., s N }. The sequence is Markov chain if it has the following properties: 1. Limited

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Lecture Notes Speech Communication 2, SS 2004 Erhard Rank/Franz Pernkopf Signal Processing and Speech Communication Laboratory Graz University of Technology Inffeldgasse 16c, A-8010

More information

Statistical Sequence Recognition and Training: An Introduction to HMMs

Statistical Sequence Recognition and Training: An Introduction to HMMs Statistical Sequence Recognition and Training: An Introduction to HMMs EECS 225D Nikki Mirghafori nikki@icsi.berkeley.edu March 7, 2005 Credit: many of the HMM slides have been borrowed and adapted, with

More information

Dynamic models. Dependent data The AR(p) model The MA(q) model Hidden Markov models. 6 Dynamic models

Dynamic models. Dependent data The AR(p) model The MA(q) model Hidden Markov models. 6 Dynamic models 6 Dependent data The AR(p) model The MA(q) model Hidden Markov models Dependent data Dependent data Huge portion of real-life data involving dependent datapoints Example (Capture-recapture) capture histories

More information

Hidden Markov Modelling

Hidden Markov Modelling Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models

More information

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017 1 Introduction Let x = (x 1,..., x M ) denote a sequence (e.g. a sequence of words), and let y = (y 1,..., y M ) denote a corresponding hidden sequence that we believe explains or influences x somehow

More information

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II CSE 586, Spring 2015 Computer Vision II Hidden Markov Model and Kalman Filter Recall: Modeling Time Series State-Space Model: You have a Markov chain of latent (unobserved) states Each state generates

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Bayesian Machine Learning - Lecture 7

Bayesian Machine Learning - Lecture 7 Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1

More information

Variational Learning : From exponential families to multilinear systems

Variational Learning : From exponential families to multilinear systems Variational Learning : From exponential families to multilinear systems Ananth Ranganathan th February 005 Abstract This note aims to give a general overview of variational inference on graphical models.

More information

Supervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore

Supervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore Supervised Learning Hidden Markov Models Some of these slides were inspired by the tutorials of Andrew Moore A Markov System S 2 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1,.

More information

Factor Analysis (10/2/13)

Factor Analysis (10/2/13) STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.

More information

Mathematical Formulation of Our Example

Mathematical Formulation of Our Example Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot

More information

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

More information

Bioinformatics 2 - Lecture 4

Bioinformatics 2 - Lecture 4 Bioinformatics 2 - Lecture 4 Guido Sanguinetti School of Informatics University of Edinburgh February 14, 2011 Sequences Many data types are ordered, i.e. you can naturally say what is before and what

More information

Towards a Bayesian model for Cyber Security

Towards a Bayesian model for Cyber Security Towards a Bayesian model for Cyber Security Mark Briers (mbriers@turing.ac.uk) Joint work with Henry Clausen and Prof. Niall Adams (Imperial College London) 27 September 2017 The Alan Turing Institute

More information

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

More information

Hidden Markov Models,99,100! Markov, here I come!

Hidden Markov Models,99,100! Markov, here I come! Hidden Markov Models,99,100! Markov, here I come! 16.410/413 Principles of Autonomy and Decision-Making Pedro Santana (psantana@mit.edu) October 7 th, 2015. Based on material by Brian Williams and Emilio

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information