Probabilistic & Unsupervised Learning. Latent Variable Models for Time Series

Size: px

Start display at page:

Download "Probabilistic & Unsupervised Learning. Latent Variable Models for Time Series"

Kelly Tucker
5 years ago
Views:

1 Probabilistic & Unsupervised Learning Latent Variable Models for Time Series Maneesh Sahani Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2016

2 Modeling time series Thus far, our data have been (marginally) iid. Now consider a sequence of observations: x 1, x 2, x 3,..., x t that are not independent.

3 Modeling time series Thus far, our data have been (marginally) iid. Now consider a sequence of observations: x 1, x 2, x 3,..., x t that are not independent. Examples: Sequence of images Stock prices Recorded speech or music, English sentences Kinematic variables in a robot Sensor readings from an industrial process Amino acids, DNA, etc...

4 Modeling time series Thus far, our data have been (marginally) iid. Now consider a sequence of observations: x 1, x 2, x 3,..., x t that are not independent. Examples: Sequence of images Stock prices Recorded speech or music, English sentences Kinematic variables in a robot Sensor readings from an industrial process Amino acids, DNA, etc... Goal: To build a joint probabilistic model of the data p(x 1,..., x t).

5 Modeling time series Thus far, our data have been (marginally) iid. Now consider a sequence of observations: x 1, x 2, x 3,..., x t that are not independent. Examples: Sequence of images Stock prices Recorded speech or music, English sentences Kinematic variables in a robot Sensor readings from an industrial process Amino acids, DNA, etc... Goal: To build a joint probabilistic model of the data p(x 1,..., x t). Predict p(x t x 1,..., x t 1)

6 Modeling time series Thus far, our data have been (marginally) iid. Now consider a sequence of observations: x 1, x 2, x 3,..., x t that are not independent. Examples: Sequence of images Stock prices Recorded speech or music, English sentences Kinematic variables in a robot Sensor readings from an industrial process Amino acids, DNA, etc... Goal: To build a joint probabilistic model of the data p(x 1,..., x t). Predict p(x t x 1,..., x t 1) Detect abnormal/changed behaviour (if p(x t, x t+1,... x 1,..., x t 1) small)

7 Modeling time series Thus far, our data have been (marginally) iid. Now consider a sequence of observations: x 1, x 2, x 3,..., x t that are not independent. Examples: Sequence of images Stock prices Recorded speech or music, English sentences Kinematic variables in a robot Sensor readings from an industrial process Amino acids, DNA, etc... Goal: To build a joint probabilistic model of the data p(x 1,..., x t). Predict p(x t x 1,..., x t 1) Detect abnormal/changed behaviour (if p(x t, x t+1,... x 1,..., x t 1) small) Recover underlying/latent/hidden causes linking entire sequence

8 Markov models In general: P(x 1,..., x t) = P(x 1)P(x 2 x 1)P(x 3 x 1, x 2) P(x t x 1, x 2... x t 1)

9 Markov models In general: P(x 1,..., x t) = P(x 1)P(x 2 x 1)P(x 3 x 1, x 2) P(x t x 1, x 2... x t 1) First-order Markov model: P(x 1,..., x t) = P(x 1)P(x 2 x 1) P(x t x t 1) x 1 x 2 x 3 x T The term Markov refers to a conditional independence relationship. In this case, the Markov property is that, given the present observation (x t), the future (x t+1,...) is independent of the past (x 1,..., x t 1).

10 Markov models In general: P(x 1,..., x t) = P(x 1)P(x 2 x 1)P(x 3 x 1, x 2) P(x t x 1, x 2... x t 1) First-order Markov model: P(x 1,..., x t) = P(x 1)P(x 2 x 1) P(x t x t 1) x 1 x 2 x 3 x T The term Markov refers to a conditional independence relationship. In this case, the Markov property is that, given the present observation (x t), the future (x t+1,...) is independent of the past (x 1,..., x t 1). Second-order Markov model: P(x 1,..., x t) = P(x 1)P(x 2 x 1) P(x t 1 x t 3, x t 2)P(x t x t 2, x t 1) x 1 x 2 x 3 x T

11 Causal structure and latent variables y 1 y 2 y 3 y T x 1 x 2 x 3 x T Temporal dependence captured by latents, with observations conditionally independent. Speech recognition: Vision: y - underlying phonemes or words x - acoustic waveform y - object identities, poses, illumination x - image pixel values Industrial Monitoring: y - current state of molten steel in caster x - temperature and pressure sensor readings

12 Latent-Chain models y 1 y 2 y 3 y T x 1 x 2 x 3 x T Joint probability factorizes: T P(y 1:T, x 1:T ) = P(y 1)P(x 1 y 1) P(y t y t 1)P(x t y t) where y t and x t are both real-valued vectors, and 1:T 1,..., T. t=2 Two frequently-used tractable models: Linear-Gaussian state-space models Hidden Markov models

13 Linear-Gaussian state-space models (SSMs) y 1 y 2 y 3 y T x 1 x 2 x 3 x T In a linear Gaussian SSM all conditional distributions are linear and Gaussian: Output equation: State dynamics equation: x t= Cy t + v t y t= Ay t 1 + w t where v t and w t are uncorrelated zero-mean multivariate Gaussian noise vectors. Also assume y 1 is multivariate Gaussian. The joint distribution over all variables x 1:T, y 1:T is (one big) multivariate Gaussian. These models are also known as stochastic linear dynamical systems, Kalman filter models.

14 From factor analysis to state space models y 1 y 2 y K y t x 1 x 2 x D K Factor analysis: x i = Λ ij y j + ɛ i SSM output: x t,i = j=1 x t K C ij y t,j + v i. j=1 Interpretation 1: Observations confined near low-dimensional subspace (as in FA/PCA). Successive observations are generated from correlated points in the latent space. However: FA requires K < D and Ψ diagonal; SSMs may have K D and arbitrary output noise. Why? Thus ML estimates of subspace by FA and SSM may differ.

15 Linear dynamical systems y 1 y 2 y 3 y T x 1 x 2 x 3 x T Interpretation 2: Markov chain with linear dynamics y t = Ay t perturbed by Gaussian innovations noise may describe stochasticity, unknown control, or model mismatch. Observations are a linear projection of the dynamical state, with additive iid Gaussian noise. Note: Dynamical process (y t) may be higher dimensional than the observations (x t). Observations do not form a Markov chain longer-scale dependence reflects/reveals latent dynamics.

16 State Space Models with Control Inputs u 1 u 2 u 3 u T y 1 y 2 y 3 y T x 1 x 2 x 3 x T State space models can be used to model the input output behaviour of controlled systems. The observed variables are divided into inputs (u t) and outputs (x t). State dynamics equation: y t = Ay t 1 + Bu t 1 + w t. Output equation: x t = Cy t + Du t + v t. Note that we can have many variants, e.g. y t = Ay t 1 + Bu t + w t or even y t = Ay t 1 + Bx t 1 + w t.

17 Hidden Markov models s 1 s 2 s 3 s T x 1 x 2 x 3 x T Discrete hidden states s t {1..., K }; outputs x t can be discrete or continuous. Joint probability factorizes: P(s 1:T, x 1:T ) = P(s 1)P(x 1 s 1) Generative process: T P(s t s t 1)P(x t s t) t=2 A first-order Markov chain generates the hidden state sequence (path): initial state probs: π j = P(s 1 = j) transition matrix: Φ ij = P(s t+1 = j s t = i) A set of emission (output) distributions A j( ) (one per state) converts state path to a sequence of observations x t. A j(x) = P(x t = x s t = j) (for continuous x t) A jk = P(x t = k s t = j) (for discrete x t)

18 Hidden Markov models Two interpretations: a Markov chain with stochastic measurements: s 1 s 2 s 3 s T x 1 x 2 x 3 x T or a mixture model with states coupled across time: s 1 s 2 s 3 s T x 1 x 2 x 3 x T Even though hidden state sequence is first-order Markov, the output process may not be Markov of any order (for example: ). Discrete state, discrete output models can approximate any continuous dynamics and observation mapping even if nonlinear; however this is usually not practical. HMMs are related to stochastic finite state machines/automata.

19 Input-output hidden Markov models u 1 u 2 u 3 u T s 1 s 2 s 3 s T x 1 x 2 x 3 x T Hidden Markov models can also be used to model sequential input-output behaviour: P(s 1:T, x 1:T u 1:T ) = P(s 1 u 1)P(x 1 s 1, u 1) T P(s t s t 1, u t 1)P(x t s t, u t) IOHMMs can capture arbitrarily complex input-output relationships, however the number of states required is often impractical. t=2

20 HMMs and SSMs (Linear Gaussian) State space models are the continuous state analogue of hidden Markov models. s1 s2 s3 st y1 y2 y3 yt x1 x2 x3 xt x1 x2 x3 xt A continuous vector state is a very powerful representation. For an HMM to communicate N bits of information about the past, it needs 2 N states! But a real-valued state vector can store an arbitrary number of bits in principle. s1 s2 s3 st x1 x2 x3 xt Linear-Gaussian output/dynamics are very weak. The types of dynamics linear SSMs can capture is very limited. HMMs can in principle represent arbitrary stochastic dynamics and output mappings.

21 Many Extensions 1 Constrained HMMs Continuous state models with discrete outputs for time series and static data Hierarchical models Hybrid systems Mixed continuous & discrete states, switching state-space models

22 Richer state representations A t A t+1 A t+2 B t B t+1 B t+2... C t C t+1 C t+2 D t D t+1 D t+2 Factorial HMMs Dynamic Bayesian Networks These are hidden Markov models with many state variables (i.e. a distributed representation of the state). The state can capture many more bits of information about the sequence (linear in the number of state variables).

23 Chain models: ML Learning with EM y1 y2 y3 yt s1 s2 s3 st x1 x2 x3 xt x1 x2 x3 xt y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) s 1 π s t s t 1 Φ st 1, x t s t A st The structure of learning and inference for both models is dictated by the factored structure.

24 Chain models: ML Learning with EM y1 y2 y3 yt s1 s2 s3 st x1 x2 x3 xt x1 x2 x3 xt y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) s 1 π s t s t 1 Φ st 1, x t s t A st The structure of learning and inference for both models is dictated by the factored structure. T T P(x 1,..., x T, y 1,..., y T ) = P(y 1) P(y t y t 1) P(x t y t) t=2 t=1

25 Chain models: ML Learning with EM y1 y2 y3 yt s1 s2 s3 st x1 x2 x3 xt x1 x2 x3 xt y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) s 1 π s t s t 1 Φ st 1, x t s t A st The structure of learning and inference for both models is dictated by the factored structure. T T P(x 1,..., x T, y 1,..., y T ) = P(y 1) P(y t y t 1) P(x t y t) t=2 t=1 Learning (M-step): argmax log P(x 1,..., x T, y 1,..., y T ) q(y1,...,y T ) = [ ] T T argmax log P(y 1) q(y1 ) + log P(y t y t 1) q(yt,y t 1 ) + log P(x t y t) q(yt ) t=2 t=1

26 Chain models: ML Learning with EM y1 y2 y3 yt s1 s2 s3 st x1 x2 x3 xt x1 x2 x3 xt y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) s 1 π s t s t 1 Φ st 1, x t s t A st The structure of learning and inference for both models is dictated by the factored structure. T T P(x 1,..., x T, y 1,..., y T ) = P(y 1) P(y t y t 1) P(x t y t) t=2 t=1 Learning (M-step): argmax log P(x 1,..., x T, y 1,..., y T ) q(y1,...,y T ) = [ ] T T argmax log P(y 1) q(y1 ) + log P(y t y t 1) q(yt,y t 1 ) + log P(x t y t) q(yt ) t=2 t=1 So the expectations needed (in E-step) are derived from singleton and pairwise marginals.

27 Chain models: Inference Three general inference problems: Filtering: P(y t x 1,..., x t) Smoothing: P(y t x 1,..., x T ) (also P(y t, y t 1 x 1,..., x T ) for learning) Prediction: P(y t x 1,..., x t t )

28 Chain models: Inference Three general inference problems: Filtering: P(y t x 1,..., x t) Smoothing: P(y t x 1,..., x T ) (also P(y t, y t 1 x 1,..., x T ) for learning) Prediction: P(y t x 1,..., x t t ) Naively, these marginal posteriors seem to require very large integrals (or sums) P(y t x 1,..., x t) = dy 1... dy t 1 P(y 1,..., y t x 1,..., x t)

29 Chain models: Inference Three general inference problems: Filtering: P(y t x 1,..., x t) Smoothing: P(y t x 1,..., x T ) (also P(y t, y t 1 x 1,..., x T ) for learning) Prediction: P(y t x 1,..., x t t ) Naively, these marginal posteriors seem to require very large integrals (or sums) P(y t x 1,..., x t) = dy 1... dy t 1 P(y 1,..., y t x 1,..., x t) but again the factored structure of the distributions will help us. The algorithms rely on a form of temporal updating or message passing.

30 Crawling the HMM state-lattice s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P(s t=k x 1... x t) = P(s 1=k 1,..., s t=k x 1... x t) = π k1 A k1 (x 1)Φ k1,k 2 A k2 (x 2)... Φ kt 1,k A k(x t) k 1,...,k t 1 k 1,...,k t 1

31 Crawling the HMM state-lattice s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P(s t=k x 1... x t) = P(s 1=k 1,..., s t=k x 1... x t) = π k1 A k1 (x 1)Φ k1,k 2 A k2 (x 2)... Φ kt 1,k A k(x t) k 1,...,k t 1 k 1,...,k t 1 Naïve algorithm: start a bug at each of the K states at t = 1 holding value 1

32 Crawling the HMM state-lattice s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P(s t=k x 1... x t) = P(s 1=k 1,..., s t=k x 1... x t) = π k1 A k1 (x 1)Φ k1,k 2 A k2 (x 2)... Φ kt 1,k A k(x t) k 1,...,k t 1 k 1,...,k t 1 Naïve algorithm: start a bug at each of the K states at t = 1 holding value 1 move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. output emission prob.

33 Crawling the HMM state-lattice s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P(s t=k x 1... x t) = P(s 1=k 1,..., s t=k x 1... x t) = π k1 A k1 (x 1)Φ k1,k 2 A k2 (x 2)... Φ kt 1,k A k(x t) k 1,...,k t 1 k 1,...,k t 1 Naïve algorithm: start a bug at each of the K states at t = 1 holding value 1 move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. output emission prob. repeat

34 Crawling the HMM state-lattice s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P(s t=k x 1... x t) = P(s 1=k 1,..., s t=k x 1... x t) = π k1 A k1 (x 1)Φ k1,k 2 A k2 (x 2)... Φ kt 1,k A k(x t) k 1,...,k t 1 k 1,...,k t 1 Naïve algorithm: start a bug at each of the K states at t = 1 holding value 1 move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. output emission prob. repeat until all bugs have reached time t sum up values on all K t 1 bugs that reach state s t = k (one bug per state path)

35 Crawling the HMM state-lattice s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P(s t=k x 1... x t) = P(s 1=k 1,..., s t=k x 1... x t) = π k1 A k1 (x 1)Φ k1,k 2 A k2 (x 2)... Φ kt 1,k A k(x t) k 1,...,k t 1 k 1,...,k t 1 Naïve algorithm: start a bug at each of the K states at t = 1 holding value 1 move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. output emission prob. repeat until all bugs have reached time t sum up values on all K t 1 bugs that reach state s t = k (one bug per state path) Clever recursion: at every step, replace bugs at each node with a single bug carrying sum of values

36 Probability updating: Bayesian filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:t) = P(y t, y t 1 x 1:t) dy t 1

37 Probability updating: Bayesian filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:t) = P(y t, y t 1 x t, x 1:t 1) dy t 1

38 Probability updating: Bayesian filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:t) = = P(y t, y t 1 x t, x 1:t 1) dy t 1 P(xt, y t, y t 1 x 1:t 1) P(x t x 1:t 1) dy t 1

39 Probability updating: Bayesian filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:t) = P(y t, y t 1 x t, x 1:t 1) dy t 1 P(xt, y t, y t 1 x 1:t 1) = dy t 1 P(x t x 1:t 1) P(x t y t, y t 1, x 1:t 1)P(y t y t 1, x 1:t 1)P(y t 1 x 1:t 1) dy t 1

40 Probability updating: Bayesian filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:t) = = P(y t, y t 1 x t, x 1:t 1) dy t 1 P(xt, y t, y t 1 x 1:t 1) P(x t x 1:t 1) dy t 1 P(x t y t, y t 1, x 1:t 1)P(y t y t 1, x 1:t 1)P(y t 1 x 1:t 1) dy t 1 = P(x t y t)p(y t y t 1)P(y t 1 x 1:t 1) dy t 1 Markov property

41 Probability updating: Bayesian filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:t) = = = P(y t, y t 1 x t, x 1:t 1) dy t 1 P(xt, y t, y t 1 x 1:t 1) P(x t x 1:t 1) dy t 1 P(x t y t, y t 1, x 1:t 1)P(y t y t 1, x 1:t 1)P(y t 1 x 1:t 1) dy t 1 P(x t y t)p(y t y t 1)P(y t 1 x 1:t 1) dy t 1 Markov property This is a forward recursion based on Bayes rule.

42 The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming.

43 The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming. Define: α t(i) = P(x 1,..., x t, s t = i θ)

44 The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming. Define: α t(i) = P(x 1,..., x t, s t = i θ) Then much like the Bayesian filtering updates, we have: ( K ) α 1(i) = π ia i(x 1) α t+1(i) = α t(j)φ ji A i(x t+1) j=1

45 The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming. Define: α t(i) = P(x 1,..., x t, s t = i θ) Then much like the Bayesian filtering updates, we have: ( K ) α 1(i) = π ia i(x 1) α t+1(i) = α t(j)φ ji A i(x t+1) We ve defined α t(i) to be a joint rather than a posterior. j=1

46 The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming. Define: α t(i) = P(x 1,..., x t, s t = i θ) Then much like the Bayesian filtering updates, we have: ( K ) α 1(i) = π ia i(x 1) α t+1(i) = α t(j)φ ji A i(x t+1) We ve defined α t(i) to be a joint rather than a posterior. It s easy to obtain the posterior by normalisation: P(s t = i x 1,..., x t, θ) = αt(i) k αt(k) j=1

47 The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming. Define: α t(i) = P(x 1,..., x t, s t = i θ) Then much like the Bayesian filtering updates, we have: ( K ) α 1(i) = π ia i(x 1) α t+1(i) = α t(j)φ ji A i(x t+1) We ve defined α t(i) to be a joint rather than a posterior. It s easy to obtain the posterior by normalisation: P(s t = i x 1,..., x t, θ) = αt(i) k αt(k) This form enables us to compute the likelihood for θ = {A, Φ, π} efficiently in O(TK 2 ) time: P(x 1... x T θ) = K P(x 1,..., x T, s 1,..., s T, θ) = α T (k) s 1,...,s T avoiding the exponential number of paths in the naïve sum (number of paths = K T ). j=1 k=1

48 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals.

49 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 = Q 0; then (cf. FA)

50 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 = Q 0; then (cf. FA) K 1 = ˆV 0 1 C T (C ˆV 0 1 C T + R) 1

51 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 }{{}}{{} ŷ 1 1 ˆV 1 1 = Q 0; then (cf. FA) K 1 = ˆV 0 1 C T (C ˆV 0 1 C T + R) 1

52 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 = Q 0; then (cf. FA) P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 K 1 = ˆV 1 C T 0 (C ˆV 1 C T + R) 1 }{{}}{{} ŷ 1 1 ˆV 1 1 In general, we define ŷ τ t E[y t x 1,..., x τ ] and ˆV τ t V[y t x 1,..., x τ ].

53 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 = Q 0; then (cf. FA) P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 K 1 = ˆV 1 C T 0 (C ˆV 1 C T + R) 1 }{{}}{{} ŷ 1 1 ˆV 1 1 In general, we define ŷ τ τ t E[y t x 1,..., x τ ] and ˆV t V[y t x 1,..., x τ ]. Then, P(y t x 1:t 1) = dy t 1P(y t y t 1)P(y t 1 x 1:t 1)

54 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 = Q 0; then (cf. FA) P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 K 1 = ˆV 1 C T 0 (C ˆV 1 C T + R) 1 }{{}}{{} ŷ 1 1 ˆV 1 1 In general, we define ŷ τ τ t E[y t x 1,..., x τ ] and ˆV t V[y t x 1,..., x τ ]. Then, P(y t x 1:t 1) = dy t 1P(y t y t 1)P(y t 1 x 1:t 1) = N (Aŷ t 1 t 1, AˆV t 1 A T + Q) }{{}}{{} ŷ t 1 t ˆV t 1 t

55 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 = Q 0; then (cf. FA) P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 K 1 = ˆV 1 C T 0 (C ˆV 1 C T + R) 1 }{{}}{{} ŷ 1 1 ˆV 1 1 In general, we define ŷ τ τ t E[y t x 1,..., x τ ] and ˆV t V[y t x 1,..., x τ ]. Then, P(y t x 1:t 1) = dy t 1P(y t y t 1)P(y t 1 x 1:t 1) = N (Aŷ t 1 t 1, AˆV t 1 A T + Q) }{{}}{{} ŷ t 1 t ˆV t 1 t P(y t x 1:t) = N (ŷ t 1 t + K t(x t Cŷ t 1 t 1 t 1 t ), ˆV t K tc ˆV t ) }{{}}{{} ŷ t t ˆV t t t 1 K t = ˆV t C T t 1 (C ˆV t C T + R) 1 }{{} Kalman gain

56 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 = Q 0; then (cf. FA) P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 K 1 = ˆV 1 C T 0 (C ˆV 1 C T + R) 1 }{{}}{{} ŷ 1 1 ˆV 1 1 In general, we define ŷ τ τ t E[y t x 1,..., x τ ] and ˆV t V[y t x 1,..., x τ ]. Then, P(y t x 1:t 1) = dy t 1P(y t y t 1)P(y t 1 x 1:t 1) = N (Aŷ t 1 t 1, AˆV t 1 A T + Q) }{{}}{{} ŷ t 1 t ˆV t 1 t P(y t x 1:t) = N (ŷ t 1 t + K t(x t Cŷ t 1 t 1 t 1 t ), ˆV t K tc ˆV t ) }{{}}{{} ŷ t t ˆV t t K t = yx T {}}{ ˆV t 1 t C T ( xx T {}}{ t 1 C ˆV t C T + R) 1 }{{} Kalman gain

57 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 = Q 0; then (cf. FA) P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 K 1 = ˆV 1 C T 0 (C ˆV 1 C T + R) 1 }{{}}{{} ŷ 1 1 ˆV 1 1 In general, we define ŷ τ τ t E[y t x 1,..., x τ ] and ˆV t V[y t x 1,..., x τ ]. Then, P(y t x 1:t 1) = dy t 1P(y t y t 1)P(y t 1 x 1:t 1) = N (Aŷ t 1 t 1, AˆV t 1 A T + Q) }{{}}{{} ŷ t 1 t ˆV t 1 t P(y t x 1:t) = N (ŷ t 1 t + K t(x t Cŷ t 1 t 1 t 1 t ), ˆV t K tc ˆV t ) }{{}}{{} ŷ t t ˆV t t K t = yx T {}}{ ˆV t 1 t C T ( xx T {}}{ t 1 C ˆV t C T + R) 1 }{{} Kalman gain FA: β = (I + Λ T Ψ 1 Λ) 1 Λ T 1 mat. inv. lem. Ψ = Λ T (ΛΛ T + Ψ) 1 ; µ = βx n; Σ = I βλ.

58 The marginal posterior: Bayesian smoothing y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:T )

59 The marginal posterior: Bayesian smoothing y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:T ) = P(yt, xt+1:t x1:t) P(x t+1:t x 1:t)

60 The marginal posterior: Bayesian smoothing y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:T ) = = P(yt, xt+1:t x1:t) P(x t+1:t x 1:t) P(xt+1:T yt)p(yt x1:t) P(x t+1:t x 1:t) The marginal combines a backward message with the forward message found by filtering.

61 Bugs again: the bugs run forward from time 0 to t and backward from time T to t. The HMM: Forward Backward Algorithm State estimation: compute marginal posterior distribution over state at time t: γ t(i) P(s t=i x 1:T ) = P(st=i, x1:t)p(xt+1:t st=i) P(x 1:T ) where there is a simple backward recursion for K β t(i) P(x t+1:t s t=i) = P(s t+1=j, x t+1, x t+2:t s t=i) = j=1 = αt(i)βt(i) j α t(j)β t(j) K P(s t+1=j s t=i)p(x t+1 s t+1=j)p(x t+2:t s t+1=j) = j=1 K Φ ija j(x t+1)β t+1(j) α t(i) gives total inflow of probabilities to node (t, i); β t(i) gives total outflow of probabiilties. j= s 1 s 2 s 3 s 4 s 5 s 6

62 Viterbi decoding The numbers γ t(i) computed by forward-backward give the marginal posterior distribution over states at each time.

63 Viterbi decoding The numbers γ t(i) computed by forward-backward give the marginal posterior distribution over states at each time. By choosing the state i t with the largest γ t(i) at each time, we can make a best state path. This is the path with the maximum expected number of correct states.

64 Viterbi decoding The numbers γ t(i) computed by forward-backward give the marginal posterior distribution over states at each time. By choosing the state i t with the largest γ t(i) at each time, we can make a best state path. This is the path with the maximum expected number of correct states. But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero!

65 Viterbi decoding The numbers γ t(i) computed by forward-backward give the marginal posterior distribution over states at each time. By choosing the state i t with the largest γ t(i) at each time, we can make a best state path. This is the path with the maximum expected number of correct states. But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! To find the single best path, we use the Viterbi decoding algorithm which is just Bellman s dynamic programming algorithm applied to this problem. This is an inference algorithm which computes the most probable state sequences: argmax s 1:T P(s 1:T x 1:T, θ)

66 Viterbi decoding The numbers γ t(i) computed by forward-backward give the marginal posterior distribution over states at each time. By choosing the state i t with the largest γ t(i) at each time, we can make a best state path. This is the path with the maximum expected number of correct states. But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! To find the single best path, we use the Viterbi decoding algorithm which is just Bellman s dynamic programming algorithm applied to this problem. This is an inference algorithm which computes the most probable state sequences: argmax s 1:T P(s 1:T x 1:T, θ) The recursions look the same as forward-backward, except with max instead of.

67 Viterbi decoding The numbers γ t(i) computed by forward-backward give the marginal posterior distribution over states at each time. By choosing the state i t with the largest γ t(i) at each time, we can make a best state path. This is the path with the maximum expected number of correct states. But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! To find the single best path, we use the Viterbi decoding algorithm which is just Bellman s dynamic programming algorithm applied to this problem. This is an inference algorithm which computes the most probable state sequences: argmax s 1:T P(s 1:T x 1:T, θ) The recursions look the same as forward-backward, except with max instead of. Bugs once more: same trick except at each step kill all bugs but the one with the highest value at the node.

68 Viterbi decoding The numbers γ t(i) computed by forward-backward give the marginal posterior distribution over states at each time. By choosing the state i t with the largest γ t(i) at each time, we can make a best state path. This is the path with the maximum expected number of correct states. But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! To find the single best path, we use the Viterbi decoding algorithm which is just Bellman s dynamic programming algorithm applied to this problem. This is an inference algorithm which computes the most probable state sequences: argmax s 1:T P(s 1:T x 1:T, θ) The recursions look the same as forward-backward, except with max instead of. Bugs once more: same trick except at each step kill all bugs but the one with the highest value at the node. There is also a modified EM training based on the Viterbi decoder (assignment).

69 The LGSSM: Kalman smoothing y 1 y 2 y 3 y T x 1 x 2 x 3 x T We use a slightly different decomposition: P(y t x 1:T ) = P(y t, y t+1 x 1:T ) dy t+1 = P(y t y t+1, x 1:T )P(y t+1 x 1:T ) dy t+1 = P(y t y t+1, x 1:t)P(y t+1 x 1:T ) dy t+1 Markov property This gives the additional backward recursion: J t = ˆV t t A T (ˆV t t+1) 1 ŷ T t = ŷ t t + J t(ŷ T t+1 Aŷ t t) ˆV T t T t T t = ˆV t + J t(ˆv t+1 ˆV t+1)j t

70 ML Learning for SSMs using batch EM y 1 A y 2 A y 3 A A y T C C C C x 1 x 2 x 3 x T Parameters: θ = {µ 0, Q 0, A, Q, C, R} Free energy: F(q, θ) = dy 1:T q(y 1:T )(log P(x 1:T, y 1:T θ) log q(y 1:T )) E-step: Maximise F w.r.t. q with θ fixed: q (y) = p(y x, θ) This can be achieved with a two-state extension of the Kalman smoother. M-step: Maximize F w.r.t. θ with q fixed. This boils down to solving a few weighted least squares problems, since all the variables in: p(y, x θ) = p(y 1)p(x 1 y 1) form a multivariate Gaussian. T p(y t y t 1)p(x t y t) t=2

71 The M step for C p(x t y t) exp [ 1 2 (xt Cyt)T R 1 (x t Cy t) ]

72 The M step for C p(x t y t) exp [ 1 2 (xt Cyt)T R 1 (x t Cy t) ] C new = argmax C ln p(x t y t) t q

73 The M step for C p(x t y t) exp [ 1 2 (xt Cyt)T R 1 (x t Cy t) ] C new = argmax C = argmax C ln p(x t y t) t q 1 (x t Cy t) T R 1 (x t Cy t) + const 2 t q

74 The M step for C p(x t y t) exp [ 1 2 (xt Cyt)T R 1 (x t Cy t) ] C new = argmax C = argmax C = argmax C ln p(x t y t) t q 1 (x t Cy t) T R 1 (x t Cy t) + const 2 t q { } 1 x T t R 1 x t 2x T t R 1 C y t + y T t C T R 1 Cy t 2 t

75 The M step for C p(x t y t) exp [ 1 2 (xt Cyt)T R 1 (x t Cy t) ] C new = argmax C = argmax C = argmax C = argmax C ln p(x t y t) t q 1 (x t Cy t) T R 1 (x t Cy t) + const 2 t q { } 1 x T t R 1 x t 2x T t R 1 C y t + y T t C T R 1 Cy t 2 t { [ ] ]} Tr C y t x T t R 1 [C 1 2 Tr T R 1 C y ty T t t t

76 The M step for C p(x t y t) exp [ 1 2 (xt Cyt)T R 1 (x t Cy t) ] using Tr[AB] A C new = argmax C = argmax C = argmax C = argmax C ln p(x t y t) t q 1 (x t Cy t) T R 1 (x t Cy t) + const 2 t q { } 1 x T t R 1 x t 2x T t R 1 C y t + y T t C T R 1 Cy t 2 t { [ ] ]} Tr C y t x T t R 1 [C 1 2 Tr T R 1 C y ty T t = B T, we have { } C t = R 1 t x t y t T R 1 C y ty T t t t

77 The M step for C p(x t y t) exp [ 1 2 (xt Cyt)T R 1 (x t Cy t) ] using Tr[AB] A C new = argmax C = argmax C = argmax C = argmax C ln p(x t y t) t q 1 (x t Cy t) T R 1 (x t Cy t) + const 2 t q { } 1 x T t R 1 x t 2x T t R 1 C y t + y T t C T R 1 Cy t 2 t { [ ] ]} Tr C y t x T t R 1 [C 1 2 Tr T R 1 C y ty T t = B T, we have { } C t = R 1 t x t y t T R 1 C y ty T t t t ( ) ( C new = x t y t T t t ) 1 y ty T t Notice that this is exactly the same equation as in factor analysis and linear regression!

78 The M step for A p(y t+1 y t) exp { 1 2 (yt+1 Ayt)T Q 1 (y t+1 Ay t) }

79 The M step for A p(y t+1 y t) exp { 1 2 (yt+1 Ayt)T Q 1 (y t+1 Ay t) } A new = argmax A ln p(y t+1 y t) t q

80 The M step for A p(y t+1 y t) exp { 1 2 (yt+1 Ayt)T Q 1 (y t+1 Ay t) } A new = argmax A = argmax A ln p(y t+1 y t) t q 1 (y t+1 Ay t) T Q 1 (y t+1 Ay t) + const 2 t q

81 The M step for A p(y t+1 y t) exp { 1 2 (yt+1 Ayt)T Q 1 (y t+1 Ay t) } A new = argmax A = argmax A = argmax A ln p(y t+1 y t) t q 1 (y t+1 Ay t) T Q 1 (y t+1 Ay t) + const 2 t q { 1 y T t+1q 1 y t+1 2 y Tt+1Q 1 Ay t + 2 t y T t A T Q 1 Ay t }

82 The M step for A p(y t+1 y t) exp { 1 2 (yt+1 Ayt)T Q 1 (y t+1 Ay t) } A new = argmax A = argmax A = argmax A = argmax A ln p(y t+1 y t) t q 1 (y t+1 Ay t) T Q 1 (y t+1 Ay t) + const 2 t q { 1 y T t+1q 1 y t+1 2 y Tt+1Q 1 Ay t + 2 t { [ ] Tr A y ty T t+1 Q 1 1 [A 2 Tr T Q 1 A t y T t A T Q 1 Ay t } t ]} y ty T t

83 The M step for A p(y t+1 y t) exp { 1 2 (yt+1 Ayt)T Q 1 (y t+1 Ay t) } using Tr[AB] A A new = argmax A = argmax A = argmax A = argmax A ln p(y t+1 y t) t q 1 (y t+1 Ay t) T Q 1 (y t+1 Ay t) + const 2 t q { 1 y T t+1q 1 y t+1 2 y Tt+1Q 1 Ay t + 2 t { [ ] Tr A y ty T t+1 Q 1 1 [A 2 Tr T Q 1 A = B T, we have { } A t = Q 1 t y t+1y T t Q 1 A t y T t A T Q 1 Ay t } t ]} y ty T t y ty T t

84 The M step for A p(y t+1 y t) exp { 1 2 (yt+1 Ayt)T Q 1 (y t+1 Ay t) } using Tr[AB] A A new = argmax A A new = = argmax A = argmax A = argmax A ln p(y t+1 y t) t q 1 (y t+1 Ay t) T Q 1 (y t+1 Ay t) + const 2 t q { 1 y T t+1q 1 y t+1 2 y Tt+1Q 1 Ay t + 2 t { [ ] Tr A y ty T t+1 Q 1 1 [A 2 Tr T Q 1 A = B T, we have { } A ( t y t+1y T t t = Q 1 t ) ( t y t+1y T t Q 1 A ) 1 y ty T t t y T t A T Q 1 Ay t } t ]} y ty T t y ty T t This is still analagous to factor analysis and linear regression, with expected correlations.

85 Learning (online gradient) Time series data must often be processed in real-time, and we may want to update parameters online as observations arrive. We can do so by updating a local version of the likelihood based on the Kalman filter estimates. Consider the log likelihood contributed by each data point (l t): Then, l = T ln p(x t x 1,..., x t 1) = t=1 T t=1 l t l t = D 2 ln 2π 1 2 ln Σ 1 2 where D is dimension of x, and: ŷ t 1 t ˆV t 1 t = Aŷ t 1 t 1 t 1 Σ = C ˆV t C T + R = AˆV t 1 t 1 A T + Q (xt Cŷt 1t ) T Σ 1 (x t Cŷ t 1 t ) We differentiate l t to obtain gradient rules for A, C, Q, R. The size of the gradient step (learning rate) reflects our expectation about nonstationarity.

86 Learning HMMs using EM s 1 T s 2 T s 3 T T s T A A A A x 1 x 2 x 3 x T Parameters: θ = {π, Φ, A} Free energy: F(q, θ) = q(s 1:T )(log P(x 1:T, s 1:T θ) log q(s 1:T )) s1:t E-step: Maximise F w.r.t. q with θ fixed: q (s 1:T ) = P(s 1:T x 1:T, θ) We will only need the marginal probabilities q(s t, s t+1), which can also be obtained from the forward backward algorithm. M-step: Maximize F w.r.t. θ with q fixed. We can re-estimate the parameters by computing the expected number of times the HMM was in state i, emitted symbol k and transitioned to state j. This is the Baum-Welch algorithm and it predates the (more general) EM algorithm.

87 M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ.

88 M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ. The initial state distribution is the expected number of times in state i at t = 1: ˆπ i = γ 1(i)

89 M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ. The initial state distribution is the expected number of times in state i at t = 1: ˆπ i = γ 1(i) The expected number of transitions from state i to j which begin at time t is: ξ t(i j) P(s t = i, s t+1 = j x 1:T ) = α t(i)φ ija j(x t+1)β t+1(j)/p(x 1:T ) so the estimated transition probabilities are: T 1 Φ ij = ξ t(i j) / T 1 t=1 t=1 γ t(i)

90 M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ. The initial state distribution is the expected number of times in state i at t = 1: ˆπ i = γ 1(i) The expected number of transitions from state i to j which begin at time t is: ξ t(i j) P(s t = i, s t+1 = j x 1:T ) = α t(i)φ ija j(x t+1)β t+1(j)/p(x 1:T ) so the estimated transition probabilities are: T 1 Φ ij = ξ t(i j) t=1 / T 1 γ t(i) t=1 The output distributions are the expected number of times we observe a particular symbol in a particular state: Â ik = / T γ t(i) γ t(i) t:x t =k t=1 (or the state-probability-weighted mean and variance for a Gaussian output model).

91 HMM practicalities Numerical scaling: the conventional message definition is in terms of a large joint: Rescale: α t(i) = P(x 1:t, s t=i) 0 as t grows, and so can easily underflow. α t(i) = A i(x t) j α t 1(j)Φ ji ρ t = K α t(i) α t(i) = α t(i)/ρ t i=1 Exercise: show that: ρ t = P(x t x 1:t 1, θ) T ρ t = P(x 1:T θ) t=1 What does this make α t(i)?

92 HMM practicalities Numerical scaling: the conventional message definition is in terms of a large joint: Rescale: α t(i) = P(x 1:t, s t=i) 0 as t grows, and so can easily underflow. α t(i) = A i(x t) j α t 1(j)Φ ji ρ t = K α t(i) α t(i) = α t(i)/ρ t i=1 Exercise: show that: ρ t = P(x t x 1:t 1, θ) T ρ t = P(x 1:T θ) t=1 What does this make α t(i)? Multiple observed sequences: average numerators and denominators in the ratios of updates.

93 HMM practicalities Numerical scaling: the conventional message definition is in terms of a large joint: Rescale: α t(i) = P(x 1:t, s t=i) 0 as t grows, and so can easily underflow. α t(i) = A i(x t) j α t 1(j)Φ ji ρ t = K α t(i) α t(i) = α t(i)/ρ t i=1 Exercise: show that: ρ t = P(x t x 1:t 1, θ) T ρ t = P(x 1:T θ) t=1 What does this make α t(i)? Multiple observed sequences: average numerators and denominators in the ratios of updates. Local optima (random restarts, annealing; see discussion later).

94 HMM pseudocode: inference (E step) Forward-backward including scaling tricks. [ is the element-by-element (Hadamard/Schur) product:. in matlab.] for t = 1:T, i = 1:K p t(i) = A i(x t) α 1 = π p 1 ρ 1 = K i=1 α1(i) α1 = α1/ρ1 for t = 2:T α t = (Φ T α t 1) p t ρ t = K i=1 αt(i) αt = αt/ρt β T = 1 for t = T 1:1 β t = Φ (β t+1 p t+1)/ρ t+1 for t = 1:T log P(x 1:T ) = T log(ρt) t=1 γ t = α t β t for t = 1:T 1 ξ t = Φ (α t (β t+1 p t+1) T )/ρ t+1

95 HMM pseudocode: parameter re-estimation (M step) Baum-Welch parameter updates: For each sequence l = 1 : L, run forward backward to get γ (l) and ξ (l), then π i = 1 L l=1 L γ(l) (i) 1 L T (l) 1 l=1 Φ ij = L T (l) 1 l=1 t=1 ξ (l) t t=1 γ (l) t (ij) (i) L T (l) l=1 t=1 δ(xt = k)γ(l) t (i) A ik = L T (l) l=1 (i) t=1 γ(l) t

96 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: P(y) = N (0, I) P(x y) = N (Λy, Ψ)

97 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: P(y) = N (0, I) P(x y) = N (Λy, Ψ) & ỹ = Uy Λ = ΛU T

98 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: ) P(y) = N (0, I) & ỹ = Uy P(ỹ) = N (U0, UIU T = N (0, I) ( ) P(x y) = N (Λy, Ψ) Λ = ΛU T P(x ỹ) = N ΛU T Uy, Ψ = N ( Λỹ, Ψ)

99 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: ) P(y) = N (0, I) & ỹ = Uy P(ỹ) = N (U0, UIU T = N (0, I) ( ) P(x y) = N (Λy, Ψ) Λ = ΛU T P(x ỹ) = N ΛU T Uy, Ψ = N ( Λỹ, Ψ) Similarly, a mixture model is invariant to permutations of the latent.

100 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: ) P(y) = N (0, I) & ỹ = Uy P(ỹ) = N (U0, UIU T = N (0, I) ( ) P(x y) = N (Λy, Ψ) Λ = ΛU T P(x ỹ) = N ΛU T Uy, Ψ = N ( Λỹ, Ψ) Similarly, a mixture model is invariant to permutations of the latent. The LGSSM likelihood is conserved with respect to any invertible transform of the latent: P(y t+1 y t) = N (Ay t, Q) P(x t y t) = N (Cy, R)

101 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: ) P(y) = N (0, I) & ỹ = Uy P(ỹ) = N (U0, UIU T = N (0, I) ( ) P(x y) = N (Λy, Ψ) Λ = ΛU T P(x ỹ) = N ΛU T Uy, Ψ = N ( Λỹ, Ψ) Similarly, a mixture model is invariant to permutations of the latent. The LGSSM likelihood is conserved with respect to any invertible transform of the latent: P(y t+1 y t) = N (Ay t, Q) P(x t y t) = N (Cy, R) & ỹ = Gy Ã = GAG 1 Q = GQG T C = CG 1

102 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: ) P(y) = N (0, I) & ỹ = Uy P(ỹ) = N (U0, UIU T = N (0, I) ( ) P(x y) = N (Λy, Ψ) Λ = ΛU T P(x ỹ) = N ΛU T Uy, Ψ = N ( Λỹ, Ψ) Similarly, a mixture model is invariant to permutations of the latent. The LGSSM likelihood is conserved with respect to any invertible transform of the latent: P(y t+1 y t) = N (Ay t, Q) P(x t y t) = N (Cy, R) & ỹ = Gy Ã = GAG 1 Q = GQG T C = CG 1 ) P(ỹ t+1 ỹ t ) = N (GAG 1 Gy t, GQG T = N ( Ãỹ t, Q) P(x t ỹ t ) = N ( CG 1 Gy, R ) = N ( Cỹ, ) R

103 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: ) P(y) = N (0, I) & ỹ = Uy P(ỹ) = N (U0, UIU T = N (0, I) ( ) P(x y) = N (Λy, Ψ) Λ = ΛU T P(x ỹ) = N ΛU T Uy, Ψ = N ( Λỹ, Ψ) Similarly, a mixture model is invariant to permutations of the latent. The LGSSM likelihood is conserved with respect to any invertible transform of the latent: P(y t+1 y t) = N (Ay t, Q) P(x t y t) = N (Cy, R) & ỹ = Gy Ã = GAG 1 Q = GQG T C = CG 1 ) P(ỹ t+1 ỹ t ) = N (GAG 1 Gy t, GQG T = N ( Ãỹ t, Q) P(x t ỹ t ) = N ( CG 1 Gy, R ) = N ( Cỹ, ) R and the HMM is invariant to permutations (and to relaxations into something called an observable operator model).

104 The likelihood landscape Besides the degenerate solutions, both LGSSMs and HMMs likelihood functions tend to have multiple local maxima. This complicates any ML-based approach to learning (including EM and gradient methods).

105 The likelihood landscape Besides the degenerate solutions, both LGSSMs and HMMs likelihood functions tend to have multiple local maxima. This complicates any ML-based approach to learning (including EM and gradient methods). Strategies: Restart EM/gradient ascent from different (often random) initial parameter values.

106 The likelihood landscape Besides the degenerate solutions, both LGSSMs and HMMs likelihood functions tend to have multiple local maxima. This complicates any ML-based approach to learning (including EM and gradient methods). Strategies: Restart EM/gradient ascent from different (often random) initial parameter values. Initialise output parameters with a stationary model: PCA FA LGSSM k-means mixture HMM

107 The likelihood landscape Besides the degenerate solutions, both LGSSMs and HMMs likelihood functions tend to have multiple local maxima. This complicates any ML-based approach to learning (including EM and gradient methods). Strategies: Restart EM/gradient ascent from different (often random) initial parameter values. Initialise output parameters with a stationary model: PCA FA LGSSM k-means mixture HMM Stochastic gradient methods and momentum. Overstepping EM.

108 The likelihood landscape Besides the degenerate solutions, both LGSSMs and HMMs likelihood functions tend to have multiple local maxima. This complicates any ML-based approach to learning (including EM and gradient methods). Strategies: Restart EM/gradient ascent from different (often random) initial parameter values. Initialise output parameters with a stationary model: PCA FA LGSSM k-means mixture HMM Stochastic gradient methods and momentum. Overstepping EM. Deterministic annealing.

109 The likelihood landscape Besides the degenerate solutions, both LGSSMs and HMMs likelihood functions tend to have multiple local maxima. This complicates any ML-based approach to learning (including EM and gradient methods). Strategies: Restart EM/gradient ascent from different (often random) initial parameter values. Initialise output parameters with a stationary model: PCA FA LGSSM k-means mixture HMM Stochastic gradient methods and momentum. Overstepping EM. Deterministic annealing. Non-ML learning (spectral methods).

110 Slow Feature Analysis (SFA) Is there a zero-noise limit for an LGSSM analagous to PCA?

111 Slow Feature Analysis (SFA) Is there a zero-noise limit for an LGSSM analagous to PCA? We can take: A = diag [a 1... a K ] I (with a 1 a 2 a K 1) Q = I AA T 0 (setting the stationary latent covariance to I) R 0. This limit yields a spectral algorithm called Slow Feature Analysis (or SFA).

112 Slow Feature Analysis (SFA) Is there a zero-noise limit for an LGSSM analagous to PCA? We can take: A = diag [a 1... a K ] I (with a 1 a 2 a K 1) Q = I AA T 0 (setting the stationary latent covariance to I) R 0. This limit yields a spectral algorithm called Slow Feature Analysis (or SFA). SFA is conventionally defined as slowness pursuit (cf PCA and variance pursuit): Given zero-mean series {x 1, x 2,... x T } find a K D matrix W (= C T ) such that y t = W x t changes as slowly as possible.

113 Slow Feature Analysis (SFA) Is there a zero-noise limit for an LGSSM analagous to PCA? We can take: A = diag [a 1... a K ] I (with a 1 a 2 a K 1) Q = I AA T 0 (setting the stationary latent covariance to I) R 0. This limit yields a spectral algorithm called Slow Feature Analysis (or SFA). SFA is conventionally defined as slowness pursuit (cf PCA and variance pursuit): Given zero-mean series {x 1, x 2,... x T } find a K D matrix W (= C T ) such that y t = W x t changes as slowly as possible. Specifically, find W = argmin W y t y t 1 2 t subject to y ty T t = I. The variance constraint prevents the trivial solutions W = 0 and W = 1w T. t

114 Slow Feature Analysis (SFA) Is there a zero-noise limit for an LGSSM analagous to PCA? We can take: A = diag [a 1... a K ] I (with a 1 a 2 a K 1) Q = I AA T 0 (setting the stationary latent covariance to I) R 0. This limit yields a spectral algorithm called Slow Feature Analysis (or SFA). SFA is conventionally defined as slowness pursuit (cf PCA and variance pursuit): Given zero-mean series {x 1, x 2,... x T } find a K D matrix W (= C T ) such that y t = W x t changes as slowly as possible. Specifically, find W = argmin W y t y t 1 2 t subject to y ty T t = I. The variance constraint prevents the trivial solutions W = 0 and W = 1w T. W can be found by solving the generalised eigenvalue problem WA = ΩWB where A = t (xt xt 1)(xt xt 1)T and B = t xtxt t. See maneesh/papers/turner-sahani-2007-sfa.pdf. t

115 (Generalised) Method of Moments estimators What if we don t want the A I limit?

116 (Generalised) Method of Moments estimators What if we don t want the A I limit? The limit is necessary to make the spectral estimate and the (limiting) ML estimates coincide. However, if we give up on ML, then there are general spectral estimates available.

117 (Generalised) Method of Moments estimators What if we don t want the A I limit? The limit is necessary to make the spectral estimate and the (limiting) ML estimates coincide. However, if we give up on ML, then there are general spectral estimates available. The ML parameter estimates are defined: θ ML = argmax θ log P(X θ) and as you found, if P ExpFam with sufficient statistic T then T (x) θ ML = 1 T (x i) (moment matching). N i

118 (Generalised) Method of Moments estimators What if we don t want the A I limit? The limit is necessary to make the spectral estimate and the (limiting) ML estimates coincide. However, if we give up on ML, then there are general spectral estimates available. The ML parameter estimates are defined: θ ML = argmax θ log P(X θ) and as you found, if P ExpFam with sufficient statistic T then T (x) θ ML = 1 T (x i) (moment matching). N i We can generalise this condition to arbitrary T even if P / ExpFam. That is, we find estimate θ with T (x) θ = 1 N i T (x i) or θ = argmin T (x) θ 1 θ N T (x i) C i

119 (Generalised) Method of Moments estimators What if we don t want the A I limit? The limit is necessary to make the spectral estimate and the (limiting) ML estimates coincide. However, if we give up on ML, then there are general spectral estimates available. The ML parameter estimates are defined: θ ML = argmax θ log P(X θ) and as you found, if P ExpFam with sufficient statistic T then T (x) θ ML = 1 T (x i) (moment matching). N i We can generalise this condition to arbitrary T even if P / ExpFam. That is, we find estimate θ with T (x) θ = 1 N i T (x i) or θ = argmin T (x) θ 1 θ N T (x i) C Judicious choice of T and metric C might make solution unique (no local optima) and consistent (correct given infinite within-model data). i

120 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2

121 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 M τ x t+τ x T t

122 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 C x t 2 x t 1 x t x t+1 x t+2 M τ x t+τ x T t = x t+τ (Cy t + η) T

123 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 C x t 2 x t 1 x t x t+1 x t+2 M τ x t+τ x T t = x t+τ y T t C T

124 Ho-Kalman SSID for LGSSMs A A y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 C M τ x t+τ x T t = x t+τ y T t C T = (CA τ y t + η)y T t C T

125 Ho-Kalman SSID for LGSSMs A A y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 C M τ x t+τ x T t = x t+τ y T t C T = CA τ y ty T t C T

126 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 M τ x t+τ x T t = x t+τ y T t C T = CA τ y ty T t C T = CA τ ΠC T

127 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 x t = [x t 1:t L] x + t = [x t:t+l 1] M τ x t+τ x T t = x t+τ y T t C T = CA τ y ty T t C T = CA τ ΠC T H x + t, x t T

128 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 x t = [x t 1:t L] x + t = [x t:t+l 1] M τ x t+τ x T t = x t+τ y T t C T = CA τ y ty T t C T = CA τ ΠC T H x + t, x t T = M 1 M 2 M L M 2 M 3.. = M L M 2L 1 C CA. CA L 1 [ AΠC T A L ΠC T]

129 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 x t = [x t 1:t L] x + t = [x t:t+l 1] M τ x t+τ x T t = x t+τ y T t C T = CA τ y ty T t C T = CA τ ΠC T H x + t, x t T = M 1 M 2 M L M 2 M 3.. M L M 2L 1 = C CA Ξ. CA L 1 [ AΠC T Υ A L ΠC T] K LD LD LD LD K

130 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 x t = [x t 1:t L] x + t = [x t:t+l 1] M τ x t+τ x T t = x t+τ y T t C T = CA τ y ty T t C T = CA τ ΠC T H x + t, x t T = M 1 M 2 M L M 2 M 3.. M L M 2L 1 = C CA Ξ. CA L 1 [ AΠC T Υ A L ΠC T] K LD LD LD LD K Off-diagonal correlation unaffected by noise. SVD( 1 T x + t x T t ) yields least-squares estimates of Ξ and Υ. Regression between blocks of Ξ yields Â and Ĉ.

131 HMMs OOMs s 1 s 2 s 3 s T x 1 x 2 x 3 x T Now consider an HMM with discrete output symbols.

132 HMMs OOMs s 1 s 2 s 3 s T x 1 x 2 x 3 x T Now consider an HMM with discrete output symbols. The likelihood can be written: P(x 1:T π, Φ, A) = i π ia i(x 1) j Φ jia j(x 2) k Φ kja K (x 3)...

133 HMMs OOMs s 1 s 2 s 3 s T x 1 x 2 x 3 x T Now consider an HMM with discrete output symbols. The likelihood can be written: P(x 1:T π, Φ, A) = i π ia i(x 1) j Φ jia j(x 2) k Φ kja K (x 3)... = π T A x1 Φ T A x2 Φ T A x [A x1 = diag [A 1(x 1), A 2(x 1),...]]

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project