Probabilistic & Unsupervised Learning. Latent Variable Models for Time Series
|
|
- Kelly Tucker
- 5 years ago
- Views:
Transcription
1 Probabilistic & Unsupervised Learning Latent Variable Models for Time Series Maneesh Sahani Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2016
2 Modeling time series Thus far, our data have been (marginally) iid. Now consider a sequence of observations: x 1, x 2, x 3,..., x t that are not independent.
3 Modeling time series Thus far, our data have been (marginally) iid. Now consider a sequence of observations: x 1, x 2, x 3,..., x t that are not independent. Examples: Sequence of images Stock prices Recorded speech or music, English sentences Kinematic variables in a robot Sensor readings from an industrial process Amino acids, DNA, etc...
4 Modeling time series Thus far, our data have been (marginally) iid. Now consider a sequence of observations: x 1, x 2, x 3,..., x t that are not independent. Examples: Sequence of images Stock prices Recorded speech or music, English sentences Kinematic variables in a robot Sensor readings from an industrial process Amino acids, DNA, etc... Goal: To build a joint probabilistic model of the data p(x 1,..., x t).
5 Modeling time series Thus far, our data have been (marginally) iid. Now consider a sequence of observations: x 1, x 2, x 3,..., x t that are not independent. Examples: Sequence of images Stock prices Recorded speech or music, English sentences Kinematic variables in a robot Sensor readings from an industrial process Amino acids, DNA, etc... Goal: To build a joint probabilistic model of the data p(x 1,..., x t). Predict p(x t x 1,..., x t 1)
6 Modeling time series Thus far, our data have been (marginally) iid. Now consider a sequence of observations: x 1, x 2, x 3,..., x t that are not independent. Examples: Sequence of images Stock prices Recorded speech or music, English sentences Kinematic variables in a robot Sensor readings from an industrial process Amino acids, DNA, etc... Goal: To build a joint probabilistic model of the data p(x 1,..., x t). Predict p(x t x 1,..., x t 1) Detect abnormal/changed behaviour (if p(x t, x t+1,... x 1,..., x t 1) small)
7 Modeling time series Thus far, our data have been (marginally) iid. Now consider a sequence of observations: x 1, x 2, x 3,..., x t that are not independent. Examples: Sequence of images Stock prices Recorded speech or music, English sentences Kinematic variables in a robot Sensor readings from an industrial process Amino acids, DNA, etc... Goal: To build a joint probabilistic model of the data p(x 1,..., x t). Predict p(x t x 1,..., x t 1) Detect abnormal/changed behaviour (if p(x t, x t+1,... x 1,..., x t 1) small) Recover underlying/latent/hidden causes linking entire sequence
8 Markov models In general: P(x 1,..., x t) = P(x 1)P(x 2 x 1)P(x 3 x 1, x 2) P(x t x 1, x 2... x t 1)
9 Markov models In general: P(x 1,..., x t) = P(x 1)P(x 2 x 1)P(x 3 x 1, x 2) P(x t x 1, x 2... x t 1) First-order Markov model: P(x 1,..., x t) = P(x 1)P(x 2 x 1) P(x t x t 1) x 1 x 2 x 3 x T The term Markov refers to a conditional independence relationship. In this case, the Markov property is that, given the present observation (x t), the future (x t+1,...) is independent of the past (x 1,..., x t 1).
10 Markov models In general: P(x 1,..., x t) = P(x 1)P(x 2 x 1)P(x 3 x 1, x 2) P(x t x 1, x 2... x t 1) First-order Markov model: P(x 1,..., x t) = P(x 1)P(x 2 x 1) P(x t x t 1) x 1 x 2 x 3 x T The term Markov refers to a conditional independence relationship. In this case, the Markov property is that, given the present observation (x t), the future (x t+1,...) is independent of the past (x 1,..., x t 1). Second-order Markov model: P(x 1,..., x t) = P(x 1)P(x 2 x 1) P(x t 1 x t 3, x t 2)P(x t x t 2, x t 1) x 1 x 2 x 3 x T
11 Causal structure and latent variables y 1 y 2 y 3 y T x 1 x 2 x 3 x T Temporal dependence captured by latents, with observations conditionally independent. Speech recognition: Vision: y - underlying phonemes or words x - acoustic waveform y - object identities, poses, illumination x - image pixel values Industrial Monitoring: y - current state of molten steel in caster x - temperature and pressure sensor readings
12 Latent-Chain models y 1 y 2 y 3 y T x 1 x 2 x 3 x T Joint probability factorizes: T P(y 1:T, x 1:T ) = P(y 1)P(x 1 y 1) P(y t y t 1)P(x t y t) where y t and x t are both real-valued vectors, and 1:T 1,..., T. t=2 Two frequently-used tractable models: Linear-Gaussian state-space models Hidden Markov models
13 Linear-Gaussian state-space models (SSMs) y 1 y 2 y 3 y T x 1 x 2 x 3 x T In a linear Gaussian SSM all conditional distributions are linear and Gaussian: Output equation: State dynamics equation: x t= Cy t + v t y t= Ay t 1 + w t where v t and w t are uncorrelated zero-mean multivariate Gaussian noise vectors. Also assume y 1 is multivariate Gaussian. The joint distribution over all variables x 1:T, y 1:T is (one big) multivariate Gaussian. These models are also known as stochastic linear dynamical systems, Kalman filter models.
14 From factor analysis to state space models y 1 y 2 y K y t x 1 x 2 x D K Factor analysis: x i = Λ ij y j + ɛ i SSM output: x t,i = j=1 x t K C ij y t,j + v i. j=1 Interpretation 1: Observations confined near low-dimensional subspace (as in FA/PCA). Successive observations are generated from correlated points in the latent space. However: FA requires K < D and Ψ diagonal; SSMs may have K D and arbitrary output noise. Why? Thus ML estimates of subspace by FA and SSM may differ.
15 Linear dynamical systems y 1 y 2 y 3 y T x 1 x 2 x 3 x T Interpretation 2: Markov chain with linear dynamics y t = Ay t perturbed by Gaussian innovations noise may describe stochasticity, unknown control, or model mismatch. Observations are a linear projection of the dynamical state, with additive iid Gaussian noise. Note: Dynamical process (y t) may be higher dimensional than the observations (x t). Observations do not form a Markov chain longer-scale dependence reflects/reveals latent dynamics.
16 State Space Models with Control Inputs u 1 u 2 u 3 u T y 1 y 2 y 3 y T x 1 x 2 x 3 x T State space models can be used to model the input output behaviour of controlled systems. The observed variables are divided into inputs (u t) and outputs (x t). State dynamics equation: y t = Ay t 1 + Bu t 1 + w t. Output equation: x t = Cy t + Du t + v t. Note that we can have many variants, e.g. y t = Ay t 1 + Bu t + w t or even y t = Ay t 1 + Bx t 1 + w t.
17 Hidden Markov models s 1 s 2 s 3 s T x 1 x 2 x 3 x T Discrete hidden states s t {1..., K }; outputs x t can be discrete or continuous. Joint probability factorizes: P(s 1:T, x 1:T ) = P(s 1)P(x 1 s 1) Generative process: T P(s t s t 1)P(x t s t) t=2 A first-order Markov chain generates the hidden state sequence (path): initial state probs: π j = P(s 1 = j) transition matrix: Φ ij = P(s t+1 = j s t = i) A set of emission (output) distributions A j( ) (one per state) converts state path to a sequence of observations x t. A j(x) = P(x t = x s t = j) (for continuous x t) A jk = P(x t = k s t = j) (for discrete x t)
18 Hidden Markov models Two interpretations: a Markov chain with stochastic measurements: s 1 s 2 s 3 s T x 1 x 2 x 3 x T or a mixture model with states coupled across time: s 1 s 2 s 3 s T x 1 x 2 x 3 x T Even though hidden state sequence is first-order Markov, the output process may not be Markov of any order (for example: ). Discrete state, discrete output models can approximate any continuous dynamics and observation mapping even if nonlinear; however this is usually not practical. HMMs are related to stochastic finite state machines/automata.
19 Input-output hidden Markov models u 1 u 2 u 3 u T s 1 s 2 s 3 s T x 1 x 2 x 3 x T Hidden Markov models can also be used to model sequential input-output behaviour: P(s 1:T, x 1:T u 1:T ) = P(s 1 u 1)P(x 1 s 1, u 1) T P(s t s t 1, u t 1)P(x t s t, u t) IOHMMs can capture arbitrarily complex input-output relationships, however the number of states required is often impractical. t=2
20 HMMs and SSMs (Linear Gaussian) State space models are the continuous state analogue of hidden Markov models. s1 s2 s3 st y1 y2 y3 yt x1 x2 x3 xt x1 x2 x3 xt A continuous vector state is a very powerful representation. For an HMM to communicate N bits of information about the past, it needs 2 N states! But a real-valued state vector can store an arbitrary number of bits in principle. s1 s2 s3 st x1 x2 x3 xt Linear-Gaussian output/dynamics are very weak. The types of dynamics linear SSMs can capture is very limited. HMMs can in principle represent arbitrary stochastic dynamics and output mappings.
21 Many Extensions 1 Constrained HMMs Continuous state models with discrete outputs for time series and static data Hierarchical models Hybrid systems Mixed continuous & discrete states, switching state-space models
22 Richer state representations A t A t+1 A t+2 B t B t+1 B t+2... C t C t+1 C t+2 D t D t+1 D t+2 Factorial HMMs Dynamic Bayesian Networks These are hidden Markov models with many state variables (i.e. a distributed representation of the state). The state can capture many more bits of information about the sequence (linear in the number of state variables).
23 Chain models: ML Learning with EM y1 y2 y3 yt s1 s2 s3 st x1 x2 x3 xt x1 x2 x3 xt y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) s 1 π s t s t 1 Φ st 1, x t s t A st The structure of learning and inference for both models is dictated by the factored structure.
24 Chain models: ML Learning with EM y1 y2 y3 yt s1 s2 s3 st x1 x2 x3 xt x1 x2 x3 xt y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) s 1 π s t s t 1 Φ st 1, x t s t A st The structure of learning and inference for both models is dictated by the factored structure. T T P(x 1,..., x T, y 1,..., y T ) = P(y 1) P(y t y t 1) P(x t y t) t=2 t=1
25 Chain models: ML Learning with EM y1 y2 y3 yt s1 s2 s3 st x1 x2 x3 xt x1 x2 x3 xt y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) s 1 π s t s t 1 Φ st 1, x t s t A st The structure of learning and inference for both models is dictated by the factored structure. T T P(x 1,..., x T, y 1,..., y T ) = P(y 1) P(y t y t 1) P(x t y t) t=2 t=1 Learning (M-step): argmax log P(x 1,..., x T, y 1,..., y T ) q(y1,...,y T ) = [ ] T T argmax log P(y 1) q(y1 ) + log P(y t y t 1) q(yt,y t 1 ) + log P(x t y t) q(yt ) t=2 t=1
26 Chain models: ML Learning with EM y1 y2 y3 yt s1 s2 s3 st x1 x2 x3 xt x1 x2 x3 xt y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) s 1 π s t s t 1 Φ st 1, x t s t A st The structure of learning and inference for both models is dictated by the factored structure. T T P(x 1,..., x T, y 1,..., y T ) = P(y 1) P(y t y t 1) P(x t y t) t=2 t=1 Learning (M-step): argmax log P(x 1,..., x T, y 1,..., y T ) q(y1,...,y T ) = [ ] T T argmax log P(y 1) q(y1 ) + log P(y t y t 1) q(yt,y t 1 ) + log P(x t y t) q(yt ) t=2 t=1 So the expectations needed (in E-step) are derived from singleton and pairwise marginals.
27 Chain models: Inference Three general inference problems: Filtering: P(y t x 1,..., x t) Smoothing: P(y t x 1,..., x T ) (also P(y t, y t 1 x 1,..., x T ) for learning) Prediction: P(y t x 1,..., x t t )
28 Chain models: Inference Three general inference problems: Filtering: P(y t x 1,..., x t) Smoothing: P(y t x 1,..., x T ) (also P(y t, y t 1 x 1,..., x T ) for learning) Prediction: P(y t x 1,..., x t t ) Naively, these marginal posteriors seem to require very large integrals (or sums) P(y t x 1,..., x t) = dy 1... dy t 1 P(y 1,..., y t x 1,..., x t)
29 Chain models: Inference Three general inference problems: Filtering: P(y t x 1,..., x t) Smoothing: P(y t x 1,..., x T ) (also P(y t, y t 1 x 1,..., x T ) for learning) Prediction: P(y t x 1,..., x t t ) Naively, these marginal posteriors seem to require very large integrals (or sums) P(y t x 1,..., x t) = dy 1... dy t 1 P(y 1,..., y t x 1,..., x t) but again the factored structure of the distributions will help us. The algorithms rely on a form of temporal updating or message passing.
30 Crawling the HMM state-lattice s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P(s t=k x 1... x t) = P(s 1=k 1,..., s t=k x 1... x t) = π k1 A k1 (x 1)Φ k1,k 2 A k2 (x 2)... Φ kt 1,k A k(x t) k 1,...,k t 1 k 1,...,k t 1
31 Crawling the HMM state-lattice s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P(s t=k x 1... x t) = P(s 1=k 1,..., s t=k x 1... x t) = π k1 A k1 (x 1)Φ k1,k 2 A k2 (x 2)... Φ kt 1,k A k(x t) k 1,...,k t 1 k 1,...,k t 1 Naïve algorithm: start a bug at each of the K states at t = 1 holding value 1
32 Crawling the HMM state-lattice s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P(s t=k x 1... x t) = P(s 1=k 1,..., s t=k x 1... x t) = π k1 A k1 (x 1)Φ k1,k 2 A k2 (x 2)... Φ kt 1,k A k(x t) k 1,...,k t 1 k 1,...,k t 1 Naïve algorithm: start a bug at each of the K states at t = 1 holding value 1 move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. output emission prob.
33 Crawling the HMM state-lattice s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P(s t=k x 1... x t) = P(s 1=k 1,..., s t=k x 1... x t) = π k1 A k1 (x 1)Φ k1,k 2 A k2 (x 2)... Φ kt 1,k A k(x t) k 1,...,k t 1 k 1,...,k t 1 Naïve algorithm: start a bug at each of the K states at t = 1 holding value 1 move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. output emission prob. repeat
34 Crawling the HMM state-lattice s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P(s t=k x 1... x t) = P(s 1=k 1,..., s t=k x 1... x t) = π k1 A k1 (x 1)Φ k1,k 2 A k2 (x 2)... Φ kt 1,k A k(x t) k 1,...,k t 1 k 1,...,k t 1 Naïve algorithm: start a bug at each of the K states at t = 1 holding value 1 move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. output emission prob. repeat until all bugs have reached time t sum up values on all K t 1 bugs that reach state s t = k (one bug per state path)
35 Crawling the HMM state-lattice s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P(s t=k x 1... x t) = P(s 1=k 1,..., s t=k x 1... x t) = π k1 A k1 (x 1)Φ k1,k 2 A k2 (x 2)... Φ kt 1,k A k(x t) k 1,...,k t 1 k 1,...,k t 1 Naïve algorithm: start a bug at each of the K states at t = 1 holding value 1 move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. output emission prob. repeat until all bugs have reached time t sum up values on all K t 1 bugs that reach state s t = k (one bug per state path) Clever recursion: at every step, replace bugs at each node with a single bug carrying sum of values
36 Probability updating: Bayesian filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:t) = P(y t, y t 1 x 1:t) dy t 1
37 Probability updating: Bayesian filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:t) = P(y t, y t 1 x t, x 1:t 1) dy t 1
38 Probability updating: Bayesian filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:t) = = P(y t, y t 1 x t, x 1:t 1) dy t 1 P(xt, y t, y t 1 x 1:t 1) P(x t x 1:t 1) dy t 1
39 Probability updating: Bayesian filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:t) = P(y t, y t 1 x t, x 1:t 1) dy t 1 P(xt, y t, y t 1 x 1:t 1) = dy t 1 P(x t x 1:t 1) P(x t y t, y t 1, x 1:t 1)P(y t y t 1, x 1:t 1)P(y t 1 x 1:t 1) dy t 1
40 Probability updating: Bayesian filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:t) = = P(y t, y t 1 x t, x 1:t 1) dy t 1 P(xt, y t, y t 1 x 1:t 1) P(x t x 1:t 1) dy t 1 P(x t y t, y t 1, x 1:t 1)P(y t y t 1, x 1:t 1)P(y t 1 x 1:t 1) dy t 1 = P(x t y t)p(y t y t 1)P(y t 1 x 1:t 1) dy t 1 Markov property
41 Probability updating: Bayesian filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:t) = = = P(y t, y t 1 x t, x 1:t 1) dy t 1 P(xt, y t, y t 1 x 1:t 1) P(x t x 1:t 1) dy t 1 P(x t y t, y t 1, x 1:t 1)P(y t y t 1, x 1:t 1)P(y t 1 x 1:t 1) dy t 1 P(x t y t)p(y t y t 1)P(y t 1 x 1:t 1) dy t 1 Markov property This is a forward recursion based on Bayes rule.
42 The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming.
43 The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming. Define: α t(i) = P(x 1,..., x t, s t = i θ)
44 The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming. Define: α t(i) = P(x 1,..., x t, s t = i θ) Then much like the Bayesian filtering updates, we have: ( K ) α 1(i) = π ia i(x 1) α t+1(i) = α t(j)φ ji A i(x t+1) j=1
45 The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming. Define: α t(i) = P(x 1,..., x t, s t = i θ) Then much like the Bayesian filtering updates, we have: ( K ) α 1(i) = π ia i(x 1) α t+1(i) = α t(j)φ ji A i(x t+1) We ve defined α t(i) to be a joint rather than a posterior. j=1
46 The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming. Define: α t(i) = P(x 1,..., x t, s t = i θ) Then much like the Bayesian filtering updates, we have: ( K ) α 1(i) = π ia i(x 1) α t+1(i) = α t(j)φ ji A i(x t+1) We ve defined α t(i) to be a joint rather than a posterior. It s easy to obtain the posterior by normalisation: P(s t = i x 1,..., x t, θ) = αt(i) k αt(k) j=1
47 The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming. Define: α t(i) = P(x 1,..., x t, s t = i θ) Then much like the Bayesian filtering updates, we have: ( K ) α 1(i) = π ia i(x 1) α t+1(i) = α t(j)φ ji A i(x t+1) We ve defined α t(i) to be a joint rather than a posterior. It s easy to obtain the posterior by normalisation: P(s t = i x 1,..., x t, θ) = αt(i) k αt(k) This form enables us to compute the likelihood for θ = {A, Φ, π} efficiently in O(TK 2 ) time: P(x 1... x T θ) = K P(x 1,..., x T, s 1,..., s T, θ) = α T (k) s 1,...,s T avoiding the exponential number of paths in the naïve sum (number of paths = K T ). j=1 k=1
48 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals.
49 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 = Q 0; then (cf. FA)
50 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 = Q 0; then (cf. FA) K 1 = ˆV 0 1 C T (C ˆV 0 1 C T + R) 1
51 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 }{{}}{{} ŷ 1 1 ˆV 1 1 = Q 0; then (cf. FA) K 1 = ˆV 0 1 C T (C ˆV 0 1 C T + R) 1
52 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 = Q 0; then (cf. FA) P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 K 1 = ˆV 1 C T 0 (C ˆV 1 C T + R) 1 }{{}}{{} ŷ 1 1 ˆV 1 1 In general, we define ŷ τ t E[y t x 1,..., x τ ] and ˆV τ t V[y t x 1,..., x τ ].
53 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 = Q 0; then (cf. FA) P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 K 1 = ˆV 1 C T 0 (C ˆV 1 C T + R) 1 }{{}}{{} ŷ 1 1 ˆV 1 1 In general, we define ŷ τ τ t E[y t x 1,..., x τ ] and ˆV t V[y t x 1,..., x τ ]. Then, P(y t x 1:t 1) = dy t 1P(y t y t 1)P(y t 1 x 1:t 1)
54 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 = Q 0; then (cf. FA) P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 K 1 = ˆV 1 C T 0 (C ˆV 1 C T + R) 1 }{{}}{{} ŷ 1 1 ˆV 1 1 In general, we define ŷ τ τ t E[y t x 1,..., x τ ] and ˆV t V[y t x 1,..., x τ ]. Then, P(y t x 1:t 1) = dy t 1P(y t y t 1)P(y t 1 x 1:t 1) = N (Aŷ t 1 t 1, AˆV t 1 A T + Q) }{{}}{{} ŷ t 1 t ˆV t 1 t
55 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 = Q 0; then (cf. FA) P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 K 1 = ˆV 1 C T 0 (C ˆV 1 C T + R) 1 }{{}}{{} ŷ 1 1 ˆV 1 1 In general, we define ŷ τ τ t E[y t x 1,..., x τ ] and ˆV t V[y t x 1,..., x τ ]. Then, P(y t x 1:t 1) = dy t 1P(y t y t 1)P(y t 1 x 1:t 1) = N (Aŷ t 1 t 1, AˆV t 1 A T + Q) }{{}}{{} ŷ t 1 t ˆV t 1 t P(y t x 1:t) = N (ŷ t 1 t + K t(x t Cŷ t 1 t 1 t 1 t ), ˆV t K tc ˆV t ) }{{}}{{} ŷ t t ˆV t t t 1 K t = ˆV t C T t 1 (C ˆV t C T + R) 1 }{{} Kalman gain
56 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 = Q 0; then (cf. FA) P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 K 1 = ˆV 1 C T 0 (C ˆV 1 C T + R) 1 }{{}}{{} ŷ 1 1 ˆV 1 1 In general, we define ŷ τ τ t E[y t x 1,..., x τ ] and ˆV t V[y t x 1,..., x τ ]. Then, P(y t x 1:t 1) = dy t 1P(y t y t 1)P(y t 1 x 1:t 1) = N (Aŷ t 1 t 1, AˆV t 1 A T + Q) }{{}}{{} ŷ t 1 t ˆV t 1 t P(y t x 1:t) = N (ŷ t 1 t + K t(x t Cŷ t 1 t 1 t 1 t ), ˆV t K tc ˆV t ) }{{}}{{} ŷ t t ˆV t t K t = yx T {}}{ ˆV t 1 t C T ( xx T {}}{ t 1 C ˆV t C T + R) 1 }{{} Kalman gain
57 The LGSSM: Kalman Filtering y 1 y 2 y 3 y T x 1 x 2 x 3 x T y 1 N (µ 0, Q 0) y t y t 1 N (Ay t 1, Q) x t y t N (Cy t, R) For the SSM, the sums become integrals. Let ŷ = µ 0 and ˆV 1 = Q 0; then (cf. FA) P(y 1 x 1) = N ( ) ŷ K 1(x 1 Cŷ i ), ˆV 1 K 1C ˆV 1 K 1 = ˆV 1 C T 0 (C ˆV 1 C T + R) 1 }{{}}{{} ŷ 1 1 ˆV 1 1 In general, we define ŷ τ τ t E[y t x 1,..., x τ ] and ˆV t V[y t x 1,..., x τ ]. Then, P(y t x 1:t 1) = dy t 1P(y t y t 1)P(y t 1 x 1:t 1) = N (Aŷ t 1 t 1, AˆV t 1 A T + Q) }{{}}{{} ŷ t 1 t ˆV t 1 t P(y t x 1:t) = N (ŷ t 1 t + K t(x t Cŷ t 1 t 1 t 1 t ), ˆV t K tc ˆV t ) }{{}}{{} ŷ t t ˆV t t K t = yx T {}}{ ˆV t 1 t C T ( xx T {}}{ t 1 C ˆV t C T + R) 1 }{{} Kalman gain FA: β = (I + Λ T Ψ 1 Λ) 1 Λ T 1 mat. inv. lem. Ψ = Λ T (ΛΛ T + Ψ) 1 ; µ = βx n; Σ = I βλ.
58 The marginal posterior: Bayesian smoothing y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:T )
59 The marginal posterior: Bayesian smoothing y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:T ) = P(yt, xt+1:t x1:t) P(x t+1:t x 1:t)
60 The marginal posterior: Bayesian smoothing y 1 y 2 y 3 y T x 1 x 2 x 3 x T P(y t x 1:T ) = = P(yt, xt+1:t x1:t) P(x t+1:t x 1:t) P(xt+1:T yt)p(yt x1:t) P(x t+1:t x 1:t) The marginal combines a backward message with the forward message found by filtering.
61 Bugs again: the bugs run forward from time 0 to t and backward from time T to t. The HMM: Forward Backward Algorithm State estimation: compute marginal posterior distribution over state at time t: γ t(i) P(s t=i x 1:T ) = P(st=i, x1:t)p(xt+1:t st=i) P(x 1:T ) where there is a simple backward recursion for K β t(i) P(x t+1:t s t=i) = P(s t+1=j, x t+1, x t+2:t s t=i) = j=1 = αt(i)βt(i) j α t(j)β t(j) K P(s t+1=j s t=i)p(x t+1 s t+1=j)p(x t+2:t s t+1=j) = j=1 K Φ ija j(x t+1)β t+1(j) α t(i) gives total inflow of probabilities to node (t, i); β t(i) gives total outflow of probabiilties. j= s 1 s 2 s 3 s 4 s 5 s 6
62 Viterbi decoding The numbers γ t(i) computed by forward-backward give the marginal posterior distribution over states at each time.
63 Viterbi decoding The numbers γ t(i) computed by forward-backward give the marginal posterior distribution over states at each time. By choosing the state i t with the largest γ t(i) at each time, we can make a best state path. This is the path with the maximum expected number of correct states.
64 Viterbi decoding The numbers γ t(i) computed by forward-backward give the marginal posterior distribution over states at each time. By choosing the state i t with the largest γ t(i) at each time, we can make a best state path. This is the path with the maximum expected number of correct states. But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero!
65 Viterbi decoding The numbers γ t(i) computed by forward-backward give the marginal posterior distribution over states at each time. By choosing the state i t with the largest γ t(i) at each time, we can make a best state path. This is the path with the maximum expected number of correct states. But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! To find the single best path, we use the Viterbi decoding algorithm which is just Bellman s dynamic programming algorithm applied to this problem. This is an inference algorithm which computes the most probable state sequences: argmax s 1:T P(s 1:T x 1:T, θ)
66 Viterbi decoding The numbers γ t(i) computed by forward-backward give the marginal posterior distribution over states at each time. By choosing the state i t with the largest γ t(i) at each time, we can make a best state path. This is the path with the maximum expected number of correct states. But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! To find the single best path, we use the Viterbi decoding algorithm which is just Bellman s dynamic programming algorithm applied to this problem. This is an inference algorithm which computes the most probable state sequences: argmax s 1:T P(s 1:T x 1:T, θ) The recursions look the same as forward-backward, except with max instead of.
67 Viterbi decoding The numbers γ t(i) computed by forward-backward give the marginal posterior distribution over states at each time. By choosing the state i t with the largest γ t(i) at each time, we can make a best state path. This is the path with the maximum expected number of correct states. But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! To find the single best path, we use the Viterbi decoding algorithm which is just Bellman s dynamic programming algorithm applied to this problem. This is an inference algorithm which computes the most probable state sequences: argmax s 1:T P(s 1:T x 1:T, θ) The recursions look the same as forward-backward, except with max instead of. Bugs once more: same trick except at each step kill all bugs but the one with the highest value at the node.
68 Viterbi decoding The numbers γ t(i) computed by forward-backward give the marginal posterior distribution over states at each time. By choosing the state i t with the largest γ t(i) at each time, we can make a best state path. This is the path with the maximum expected number of correct states. But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! To find the single best path, we use the Viterbi decoding algorithm which is just Bellman s dynamic programming algorithm applied to this problem. This is an inference algorithm which computes the most probable state sequences: argmax s 1:T P(s 1:T x 1:T, θ) The recursions look the same as forward-backward, except with max instead of. Bugs once more: same trick except at each step kill all bugs but the one with the highest value at the node. There is also a modified EM training based on the Viterbi decoder (assignment).
69 The LGSSM: Kalman smoothing y 1 y 2 y 3 y T x 1 x 2 x 3 x T We use a slightly different decomposition: P(y t x 1:T ) = P(y t, y t+1 x 1:T ) dy t+1 = P(y t y t+1, x 1:T )P(y t+1 x 1:T ) dy t+1 = P(y t y t+1, x 1:t)P(y t+1 x 1:T ) dy t+1 Markov property This gives the additional backward recursion: J t = ˆV t t A T (ˆV t t+1) 1 ŷ T t = ŷ t t + J t(ŷ T t+1 Aŷ t t) ˆV T t T t T t = ˆV t + J t(ˆv t+1 ˆV t+1)j t
70 ML Learning for SSMs using batch EM y 1 A y 2 A y 3 A A y T C C C C x 1 x 2 x 3 x T Parameters: θ = {µ 0, Q 0, A, Q, C, R} Free energy: F(q, θ) = dy 1:T q(y 1:T )(log P(x 1:T, y 1:T θ) log q(y 1:T )) E-step: Maximise F w.r.t. q with θ fixed: q (y) = p(y x, θ) This can be achieved with a two-state extension of the Kalman smoother. M-step: Maximize F w.r.t. θ with q fixed. This boils down to solving a few weighted least squares problems, since all the variables in: p(y, x θ) = p(y 1)p(x 1 y 1) form a multivariate Gaussian. T p(y t y t 1)p(x t y t) t=2
71 The M step for C p(x t y t) exp [ 1 2 (xt Cyt)T R 1 (x t Cy t) ]
72 The M step for C p(x t y t) exp [ 1 2 (xt Cyt)T R 1 (x t Cy t) ] C new = argmax C ln p(x t y t) t q
73 The M step for C p(x t y t) exp [ 1 2 (xt Cyt)T R 1 (x t Cy t) ] C new = argmax C = argmax C ln p(x t y t) t q 1 (x t Cy t) T R 1 (x t Cy t) + const 2 t q
74 The M step for C p(x t y t) exp [ 1 2 (xt Cyt)T R 1 (x t Cy t) ] C new = argmax C = argmax C = argmax C ln p(x t y t) t q 1 (x t Cy t) T R 1 (x t Cy t) + const 2 t q { } 1 x T t R 1 x t 2x T t R 1 C y t + y T t C T R 1 Cy t 2 t
75 The M step for C p(x t y t) exp [ 1 2 (xt Cyt)T R 1 (x t Cy t) ] C new = argmax C = argmax C = argmax C = argmax C ln p(x t y t) t q 1 (x t Cy t) T R 1 (x t Cy t) + const 2 t q { } 1 x T t R 1 x t 2x T t R 1 C y t + y T t C T R 1 Cy t 2 t { [ ] ]} Tr C y t x T t R 1 [C 1 2 Tr T R 1 C y ty T t t t
76 The M step for C p(x t y t) exp [ 1 2 (xt Cyt)T R 1 (x t Cy t) ] using Tr[AB] A C new = argmax C = argmax C = argmax C = argmax C ln p(x t y t) t q 1 (x t Cy t) T R 1 (x t Cy t) + const 2 t q { } 1 x T t R 1 x t 2x T t R 1 C y t + y T t C T R 1 Cy t 2 t { [ ] ]} Tr C y t x T t R 1 [C 1 2 Tr T R 1 C y ty T t = B T, we have { } C t = R 1 t x t y t T R 1 C y ty T t t t
77 The M step for C p(x t y t) exp [ 1 2 (xt Cyt)T R 1 (x t Cy t) ] using Tr[AB] A C new = argmax C = argmax C = argmax C = argmax C ln p(x t y t) t q 1 (x t Cy t) T R 1 (x t Cy t) + const 2 t q { } 1 x T t R 1 x t 2x T t R 1 C y t + y T t C T R 1 Cy t 2 t { [ ] ]} Tr C y t x T t R 1 [C 1 2 Tr T R 1 C y ty T t = B T, we have { } C t = R 1 t x t y t T R 1 C y ty T t t t ( ) ( C new = x t y t T t t ) 1 y ty T t Notice that this is exactly the same equation as in factor analysis and linear regression!
78 The M step for A p(y t+1 y t) exp { 1 2 (yt+1 Ayt)T Q 1 (y t+1 Ay t) }
79 The M step for A p(y t+1 y t) exp { 1 2 (yt+1 Ayt)T Q 1 (y t+1 Ay t) } A new = argmax A ln p(y t+1 y t) t q
80 The M step for A p(y t+1 y t) exp { 1 2 (yt+1 Ayt)T Q 1 (y t+1 Ay t) } A new = argmax A = argmax A ln p(y t+1 y t) t q 1 (y t+1 Ay t) T Q 1 (y t+1 Ay t) + const 2 t q
81 The M step for A p(y t+1 y t) exp { 1 2 (yt+1 Ayt)T Q 1 (y t+1 Ay t) } A new = argmax A = argmax A = argmax A ln p(y t+1 y t) t q 1 (y t+1 Ay t) T Q 1 (y t+1 Ay t) + const 2 t q { 1 y T t+1q 1 y t+1 2 y Tt+1Q 1 Ay t + 2 t y T t A T Q 1 Ay t }
82 The M step for A p(y t+1 y t) exp { 1 2 (yt+1 Ayt)T Q 1 (y t+1 Ay t) } A new = argmax A = argmax A = argmax A = argmax A ln p(y t+1 y t) t q 1 (y t+1 Ay t) T Q 1 (y t+1 Ay t) + const 2 t q { 1 y T t+1q 1 y t+1 2 y Tt+1Q 1 Ay t + 2 t { [ ] Tr A y ty T t+1 Q 1 1 [A 2 Tr T Q 1 A t y T t A T Q 1 Ay t } t ]} y ty T t
83 The M step for A p(y t+1 y t) exp { 1 2 (yt+1 Ayt)T Q 1 (y t+1 Ay t) } using Tr[AB] A A new = argmax A = argmax A = argmax A = argmax A ln p(y t+1 y t) t q 1 (y t+1 Ay t) T Q 1 (y t+1 Ay t) + const 2 t q { 1 y T t+1q 1 y t+1 2 y Tt+1Q 1 Ay t + 2 t { [ ] Tr A y ty T t+1 Q 1 1 [A 2 Tr T Q 1 A = B T, we have { } A t = Q 1 t y t+1y T t Q 1 A t y T t A T Q 1 Ay t } t ]} y ty T t y ty T t
84 The M step for A p(y t+1 y t) exp { 1 2 (yt+1 Ayt)T Q 1 (y t+1 Ay t) } using Tr[AB] A A new = argmax A A new = = argmax A = argmax A = argmax A ln p(y t+1 y t) t q 1 (y t+1 Ay t) T Q 1 (y t+1 Ay t) + const 2 t q { 1 y T t+1q 1 y t+1 2 y Tt+1Q 1 Ay t + 2 t { [ ] Tr A y ty T t+1 Q 1 1 [A 2 Tr T Q 1 A = B T, we have { } A ( t y t+1y T t t = Q 1 t ) ( t y t+1y T t Q 1 A ) 1 y ty T t t y T t A T Q 1 Ay t } t ]} y ty T t y ty T t This is still analagous to factor analysis and linear regression, with expected correlations.
85 Learning (online gradient) Time series data must often be processed in real-time, and we may want to update parameters online as observations arrive. We can do so by updating a local version of the likelihood based on the Kalman filter estimates. Consider the log likelihood contributed by each data point (l t): Then, l = T ln p(x t x 1,..., x t 1) = t=1 T t=1 l t l t = D 2 ln 2π 1 2 ln Σ 1 2 where D is dimension of x, and: ŷ t 1 t ˆV t 1 t = Aŷ t 1 t 1 t 1 Σ = C ˆV t C T + R = AˆV t 1 t 1 A T + Q (xt Cŷt 1t ) T Σ 1 (x t Cŷ t 1 t ) We differentiate l t to obtain gradient rules for A, C, Q, R. The size of the gradient step (learning rate) reflects our expectation about nonstationarity.
86 Learning HMMs using EM s 1 T s 2 T s 3 T T s T A A A A x 1 x 2 x 3 x T Parameters: θ = {π, Φ, A} Free energy: F(q, θ) = q(s 1:T )(log P(x 1:T, s 1:T θ) log q(s 1:T )) s1:t E-step: Maximise F w.r.t. q with θ fixed: q (s 1:T ) = P(s 1:T x 1:T, θ) We will only need the marginal probabilities q(s t, s t+1), which can also be obtained from the forward backward algorithm. M-step: Maximize F w.r.t. θ with q fixed. We can re-estimate the parameters by computing the expected number of times the HMM was in state i, emitted symbol k and transitioned to state j. This is the Baum-Welch algorithm and it predates the (more general) EM algorithm.
87 M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ.
88 M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ. The initial state distribution is the expected number of times in state i at t = 1: ˆπ i = γ 1(i)
89 M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ. The initial state distribution is the expected number of times in state i at t = 1: ˆπ i = γ 1(i) The expected number of transitions from state i to j which begin at time t is: ξ t(i j) P(s t = i, s t+1 = j x 1:T ) = α t(i)φ ija j(x t+1)β t+1(j)/p(x 1:T ) so the estimated transition probabilities are: T 1 Φ ij = ξ t(i j) / T 1 t=1 t=1 γ t(i)
90 M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ. The initial state distribution is the expected number of times in state i at t = 1: ˆπ i = γ 1(i) The expected number of transitions from state i to j which begin at time t is: ξ t(i j) P(s t = i, s t+1 = j x 1:T ) = α t(i)φ ija j(x t+1)β t+1(j)/p(x 1:T ) so the estimated transition probabilities are: T 1 Φ ij = ξ t(i j) t=1 / T 1 γ t(i) t=1 The output distributions are the expected number of times we observe a particular symbol in a particular state: Â ik = / T γ t(i) γ t(i) t:x t =k t=1 (or the state-probability-weighted mean and variance for a Gaussian output model).
91 HMM practicalities Numerical scaling: the conventional message definition is in terms of a large joint: Rescale: α t(i) = P(x 1:t, s t=i) 0 as t grows, and so can easily underflow. α t(i) = A i(x t) j α t 1(j)Φ ji ρ t = K α t(i) α t(i) = α t(i)/ρ t i=1 Exercise: show that: ρ t = P(x t x 1:t 1, θ) T ρ t = P(x 1:T θ) t=1 What does this make α t(i)?
92 HMM practicalities Numerical scaling: the conventional message definition is in terms of a large joint: Rescale: α t(i) = P(x 1:t, s t=i) 0 as t grows, and so can easily underflow. α t(i) = A i(x t) j α t 1(j)Φ ji ρ t = K α t(i) α t(i) = α t(i)/ρ t i=1 Exercise: show that: ρ t = P(x t x 1:t 1, θ) T ρ t = P(x 1:T θ) t=1 What does this make α t(i)? Multiple observed sequences: average numerators and denominators in the ratios of updates.
93 HMM practicalities Numerical scaling: the conventional message definition is in terms of a large joint: Rescale: α t(i) = P(x 1:t, s t=i) 0 as t grows, and so can easily underflow. α t(i) = A i(x t) j α t 1(j)Φ ji ρ t = K α t(i) α t(i) = α t(i)/ρ t i=1 Exercise: show that: ρ t = P(x t x 1:t 1, θ) T ρ t = P(x 1:T θ) t=1 What does this make α t(i)? Multiple observed sequences: average numerators and denominators in the ratios of updates. Local optima (random restarts, annealing; see discussion later).
94 HMM pseudocode: inference (E step) Forward-backward including scaling tricks. [ is the element-by-element (Hadamard/Schur) product:. in matlab.] for t = 1:T, i = 1:K p t(i) = A i(x t) α 1 = π p 1 ρ 1 = K i=1 α1(i) α1 = α1/ρ1 for t = 2:T α t = (Φ T α t 1) p t ρ t = K i=1 αt(i) αt = αt/ρt β T = 1 for t = T 1:1 β t = Φ (β t+1 p t+1)/ρ t+1 for t = 1:T log P(x 1:T ) = T log(ρt) t=1 γ t = α t β t for t = 1:T 1 ξ t = Φ (α t (β t+1 p t+1) T )/ρ t+1
95 HMM pseudocode: parameter re-estimation (M step) Baum-Welch parameter updates: For each sequence l = 1 : L, run forward backward to get γ (l) and ξ (l), then π i = 1 L l=1 L γ(l) (i) 1 L T (l) 1 l=1 Φ ij = L T (l) 1 l=1 t=1 ξ (l) t t=1 γ (l) t (ij) (i) L T (l) l=1 t=1 δ(xt = k)γ(l) t (i) A ik = L T (l) l=1 (i) t=1 γ(l) t
96 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: P(y) = N (0, I) P(x y) = N (Λy, Ψ)
97 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: P(y) = N (0, I) P(x y) = N (Λy, Ψ) & ỹ = Uy Λ = ΛU T
98 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: ) P(y) = N (0, I) & ỹ = Uy P(ỹ) = N (U0, UIU T = N (0, I) ( ) P(x y) = N (Λy, Ψ) Λ = ΛU T P(x ỹ) = N ΛU T Uy, Ψ = N ( Λỹ, Ψ)
99 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: ) P(y) = N (0, I) & ỹ = Uy P(ỹ) = N (U0, UIU T = N (0, I) ( ) P(x y) = N (Λy, Ψ) Λ = ΛU T P(x ỹ) = N ΛU T Uy, Ψ = N ( Λỹ, Ψ) Similarly, a mixture model is invariant to permutations of the latent.
100 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: ) P(y) = N (0, I) & ỹ = Uy P(ỹ) = N (U0, UIU T = N (0, I) ( ) P(x y) = N (Λy, Ψ) Λ = ΛU T P(x ỹ) = N ΛU T Uy, Ψ = N ( Λỹ, Ψ) Similarly, a mixture model is invariant to permutations of the latent. The LGSSM likelihood is conserved with respect to any invertible transform of the latent: P(y t+1 y t) = N (Ay t, Q) P(x t y t) = N (Cy, R)
101 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: ) P(y) = N (0, I) & ỹ = Uy P(ỹ) = N (U0, UIU T = N (0, I) ( ) P(x y) = N (Λy, Ψ) Λ = ΛU T P(x ỹ) = N ΛU T Uy, Ψ = N ( Λỹ, Ψ) Similarly, a mixture model is invariant to permutations of the latent. The LGSSM likelihood is conserved with respect to any invertible transform of the latent: P(y t+1 y t) = N (Ay t, Q) P(x t y t) = N (Cy, R) & ỹ = Gy à = GAG 1 Q = GQG T C = CG 1
102 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: ) P(y) = N (0, I) & ỹ = Uy P(ỹ) = N (U0, UIU T = N (0, I) ( ) P(x y) = N (Λy, Ψ) Λ = ΛU T P(x ỹ) = N ΛU T Uy, Ψ = N ( Λỹ, Ψ) Similarly, a mixture model is invariant to permutations of the latent. The LGSSM likelihood is conserved with respect to any invertible transform of the latent: P(y t+1 y t) = N (Ay t, Q) P(x t y t) = N (Cy, R) & ỹ = Gy à = GAG 1 Q = GQG T C = CG 1 ) P(ỹ t+1 ỹ t ) = N (GAG 1 Gy t, GQG T = N ( Ãỹ t, Q) P(x t ỹ t ) = N ( CG 1 Gy, R ) = N ( Cỹ, ) R
103 Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y: ) P(y) = N (0, I) & ỹ = Uy P(ỹ) = N (U0, UIU T = N (0, I) ( ) P(x y) = N (Λy, Ψ) Λ = ΛU T P(x ỹ) = N ΛU T Uy, Ψ = N ( Λỹ, Ψ) Similarly, a mixture model is invariant to permutations of the latent. The LGSSM likelihood is conserved with respect to any invertible transform of the latent: P(y t+1 y t) = N (Ay t, Q) P(x t y t) = N (Cy, R) & ỹ = Gy à = GAG 1 Q = GQG T C = CG 1 ) P(ỹ t+1 ỹ t ) = N (GAG 1 Gy t, GQG T = N ( Ãỹ t, Q) P(x t ỹ t ) = N ( CG 1 Gy, R ) = N ( Cỹ, ) R and the HMM is invariant to permutations (and to relaxations into something called an observable operator model).
104 The likelihood landscape Besides the degenerate solutions, both LGSSMs and HMMs likelihood functions tend to have multiple local maxima. This complicates any ML-based approach to learning (including EM and gradient methods).
105 The likelihood landscape Besides the degenerate solutions, both LGSSMs and HMMs likelihood functions tend to have multiple local maxima. This complicates any ML-based approach to learning (including EM and gradient methods). Strategies: Restart EM/gradient ascent from different (often random) initial parameter values.
106 The likelihood landscape Besides the degenerate solutions, both LGSSMs and HMMs likelihood functions tend to have multiple local maxima. This complicates any ML-based approach to learning (including EM and gradient methods). Strategies: Restart EM/gradient ascent from different (often random) initial parameter values. Initialise output parameters with a stationary model: PCA FA LGSSM k-means mixture HMM
107 The likelihood landscape Besides the degenerate solutions, both LGSSMs and HMMs likelihood functions tend to have multiple local maxima. This complicates any ML-based approach to learning (including EM and gradient methods). Strategies: Restart EM/gradient ascent from different (often random) initial parameter values. Initialise output parameters with a stationary model: PCA FA LGSSM k-means mixture HMM Stochastic gradient methods and momentum. Overstepping EM.
108 The likelihood landscape Besides the degenerate solutions, both LGSSMs and HMMs likelihood functions tend to have multiple local maxima. This complicates any ML-based approach to learning (including EM and gradient methods). Strategies: Restart EM/gradient ascent from different (often random) initial parameter values. Initialise output parameters with a stationary model: PCA FA LGSSM k-means mixture HMM Stochastic gradient methods and momentum. Overstepping EM. Deterministic annealing.
109 The likelihood landscape Besides the degenerate solutions, both LGSSMs and HMMs likelihood functions tend to have multiple local maxima. This complicates any ML-based approach to learning (including EM and gradient methods). Strategies: Restart EM/gradient ascent from different (often random) initial parameter values. Initialise output parameters with a stationary model: PCA FA LGSSM k-means mixture HMM Stochastic gradient methods and momentum. Overstepping EM. Deterministic annealing. Non-ML learning (spectral methods).
110 Slow Feature Analysis (SFA) Is there a zero-noise limit for an LGSSM analagous to PCA?
111 Slow Feature Analysis (SFA) Is there a zero-noise limit for an LGSSM analagous to PCA? We can take: A = diag [a 1... a K ] I (with a 1 a 2 a K 1) Q = I AA T 0 (setting the stationary latent covariance to I) R 0. This limit yields a spectral algorithm called Slow Feature Analysis (or SFA).
112 Slow Feature Analysis (SFA) Is there a zero-noise limit for an LGSSM analagous to PCA? We can take: A = diag [a 1... a K ] I (with a 1 a 2 a K 1) Q = I AA T 0 (setting the stationary latent covariance to I) R 0. This limit yields a spectral algorithm called Slow Feature Analysis (or SFA). SFA is conventionally defined as slowness pursuit (cf PCA and variance pursuit): Given zero-mean series {x 1, x 2,... x T } find a K D matrix W (= C T ) such that y t = W x t changes as slowly as possible.
113 Slow Feature Analysis (SFA) Is there a zero-noise limit for an LGSSM analagous to PCA? We can take: A = diag [a 1... a K ] I (with a 1 a 2 a K 1) Q = I AA T 0 (setting the stationary latent covariance to I) R 0. This limit yields a spectral algorithm called Slow Feature Analysis (or SFA). SFA is conventionally defined as slowness pursuit (cf PCA and variance pursuit): Given zero-mean series {x 1, x 2,... x T } find a K D matrix W (= C T ) such that y t = W x t changes as slowly as possible. Specifically, find W = argmin W y t y t 1 2 t subject to y ty T t = I. The variance constraint prevents the trivial solutions W = 0 and W = 1w T. t
114 Slow Feature Analysis (SFA) Is there a zero-noise limit for an LGSSM analagous to PCA? We can take: A = diag [a 1... a K ] I (with a 1 a 2 a K 1) Q = I AA T 0 (setting the stationary latent covariance to I) R 0. This limit yields a spectral algorithm called Slow Feature Analysis (or SFA). SFA is conventionally defined as slowness pursuit (cf PCA and variance pursuit): Given zero-mean series {x 1, x 2,... x T } find a K D matrix W (= C T ) such that y t = W x t changes as slowly as possible. Specifically, find W = argmin W y t y t 1 2 t subject to y ty T t = I. The variance constraint prevents the trivial solutions W = 0 and W = 1w T. W can be found by solving the generalised eigenvalue problem WA = ΩWB where A = t (xt xt 1)(xt xt 1)T and B = t xtxt t. See maneesh/papers/turner-sahani-2007-sfa.pdf. t
115 (Generalised) Method of Moments estimators What if we don t want the A I limit?
116 (Generalised) Method of Moments estimators What if we don t want the A I limit? The limit is necessary to make the spectral estimate and the (limiting) ML estimates coincide. However, if we give up on ML, then there are general spectral estimates available.
117 (Generalised) Method of Moments estimators What if we don t want the A I limit? The limit is necessary to make the spectral estimate and the (limiting) ML estimates coincide. However, if we give up on ML, then there are general spectral estimates available. The ML parameter estimates are defined: θ ML = argmax θ log P(X θ) and as you found, if P ExpFam with sufficient statistic T then T (x) θ ML = 1 T (x i) (moment matching). N i
118 (Generalised) Method of Moments estimators What if we don t want the A I limit? The limit is necessary to make the spectral estimate and the (limiting) ML estimates coincide. However, if we give up on ML, then there are general spectral estimates available. The ML parameter estimates are defined: θ ML = argmax θ log P(X θ) and as you found, if P ExpFam with sufficient statistic T then T (x) θ ML = 1 T (x i) (moment matching). N i We can generalise this condition to arbitrary T even if P / ExpFam. That is, we find estimate θ with T (x) θ = 1 N i T (x i) or θ = argmin T (x) θ 1 θ N T (x i) C i
119 (Generalised) Method of Moments estimators What if we don t want the A I limit? The limit is necessary to make the spectral estimate and the (limiting) ML estimates coincide. However, if we give up on ML, then there are general spectral estimates available. The ML parameter estimates are defined: θ ML = argmax θ log P(X θ) and as you found, if P ExpFam with sufficient statistic T then T (x) θ ML = 1 T (x i) (moment matching). N i We can generalise this condition to arbitrary T even if P / ExpFam. That is, we find estimate θ with T (x) θ = 1 N i T (x i) or θ = argmin T (x) θ 1 θ N T (x i) C Judicious choice of T and metric C might make solution unique (no local optima) and consistent (correct given infinite within-model data). i
120 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2
121 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 M τ x t+τ x T t
122 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 C x t 2 x t 1 x t x t+1 x t+2 M τ x t+τ x T t = x t+τ (Cy t + η) T
123 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 C x t 2 x t 1 x t x t+1 x t+2 M τ x t+τ x T t = x t+τ y T t C T
124 Ho-Kalman SSID for LGSSMs A A y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 C M τ x t+τ x T t = x t+τ y T t C T = (CA τ y t + η)y T t C T
125 Ho-Kalman SSID for LGSSMs A A y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 C M τ x t+τ x T t = x t+τ y T t C T = CA τ y ty T t C T
126 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 M τ x t+τ x T t = x t+τ y T t C T = CA τ y ty T t C T = CA τ ΠC T
127 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 x t = [x t 1:t L] x + t = [x t:t+l 1] M τ x t+τ x T t = x t+τ y T t C T = CA τ y ty T t C T = CA τ ΠC T H x + t, x t T
128 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 x t = [x t 1:t L] x + t = [x t:t+l 1] M τ x t+τ x T t = x t+τ y T t C T = CA τ y ty T t C T = CA τ ΠC T H x + t, x t T = M 1 M 2 M L M 2 M 3.. = M L M 2L 1 C CA. CA L 1 [ AΠC T A L ΠC T]
129 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 x t = [x t 1:t L] x + t = [x t:t+l 1] M τ x t+τ x T t = x t+τ y T t C T = CA τ y ty T t C T = CA τ ΠC T H x + t, x t T = M 1 M 2 M L M 2 M 3.. M L M 2L 1 = C CA Ξ. CA L 1 [ AΠC T Υ A L ΠC T] K LD LD LD LD K
130 Ho-Kalman SSID for LGSSMs y t 2 y t 1 y t y t+1 y t+2 x t 2 x t 1 x t x t+1 x t+2 x t = [x t 1:t L] x + t = [x t:t+l 1] M τ x t+τ x T t = x t+τ y T t C T = CA τ y ty T t C T = CA τ ΠC T H x + t, x t T = M 1 M 2 M L M 2 M 3.. M L M 2L 1 = C CA Ξ. CA L 1 [ AΠC T Υ A L ΠC T] K LD LD LD LD K Off-diagonal correlation unaffected by noise. SVD( 1 T x + t x T t ) yields least-squares estimates of Ξ and Υ. Regression between blocks of Ξ yields  and Ĉ.
131 HMMs OOMs s 1 s 2 s 3 s T x 1 x 2 x 3 x T Now consider an HMM with discrete output symbols.
132 HMMs OOMs s 1 s 2 s 3 s T x 1 x 2 x 3 x T Now consider an HMM with discrete output symbols. The likelihood can be written: P(x 1:T π, Φ, A) = i π ia i(x 1) j Φ jia j(x 2) k Φ kja K (x 3)...
133 HMMs OOMs s 1 s 2 s 3 s T x 1 x 2 x 3 x T Now consider an HMM with discrete output symbols. The likelihood can be written: P(x 1:T π, Φ, A) = i π ia i(x 1) j Φ jia j(x 2) k Φ kja K (x 3)... = π T A x1 Φ T A x2 Φ T A x [A x1 = diag [A 1(x 1), A 2(x 1),...]]
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College
More informationorder is number of previous outputs
Markov Models Lecture : Markov and Hidden Markov Models PSfrag Use past replacements as state. Next output depends on previous output(s): y t = f[y t, y t,...] order is number of previous outputs y t y
More informationWeek 3: The EM algorithm
Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationLinear Dynamical Systems
Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations
More informationHidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010
Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data
More informationBayesian Hidden Markov Models and Extensions
Bayesian Hidden Markov Models and Extensions Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yunus Saatci, Tom Stepleton, Yee Whye Teh Modeling
More informationHidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing
Hidden Markov Models By Parisa Abedi Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed data Sequential (non i.i.d.) data Time-series data E.g. Speech
More informationLecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010
Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition
More informationLinear Dynamical Systems (Kalman filter)
Linear Dynamical Systems (Kalman filter) (a) Overview of HMMs (b) From HMMs to Linear Dynamical Systems (LDS) 1 Markov Chains with Discrete Random Variables x 1 x 2 x 3 x T Let s assume we have discrete
More informationProbabilistic & Unsupervised Learning. Latent Variable Models
Probabilistic & Unsupervised Learning Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabás Póczos & Aarti Singh Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed
More informationFactor Analysis and Kalman Filtering (11/2/04)
CS281A/Stat241A: Statistical Learning Theory Factor Analysis and Kalman Filtering (11/2/04) Lecturer: Michael I. Jordan Scribes: Byung-Gon Chun and Sunghoon Kim 1 Factor Analysis Factor analysis is used
More informationDynamic Approaches: The Hidden Markov Model
Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message
More informationHidden Markov Models and Gaussian Mixture Models
Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian
More informationMachine Learning 4771
Machine Learning 4771 Instructor: ony Jebara Kalman Filtering Linear Dynamical Systems and Kalman Filtering Structure from Motion Linear Dynamical Systems Audio: x=pitch y=acoustic waveform Vision: x=object
More informationHidden Markov models
Hidden Markov models Charles Elkan November 26, 2012 Important: These lecture notes are based on notes written by Lawrence Saul. Also, these typeset notes lack illustrations. See the classroom lectures
More informationCOMP90051 Statistical Machine Learning
COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 24. Hidden Markov Models & message passing Looking back Representation of joint distributions Conditional/marginal independence
More informationBasic math for biology
Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood
More informationHidden Markov Models and Gaussian Mixture Models
Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 25&29 January 2018 ASR Lectures 4&5 Hidden Markov Models and Gaussian
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London
More informationUniversity of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I
University of Cambridge MPhil in Computer Speech Text & Internet Technology Module: Speech Processing II Lecture 2: Hidden Markov Models I o o o o o 1 2 3 4 T 1 b 2 () a 12 2 a 3 a 4 5 34 a 23 b () b ()
More informationLEARNING DYNAMIC SYSTEMS: MARKOV MODELS
LEARNING DYNAMIC SYSTEMS: MARKOV MODELS Markov Process and Markov Chains Hidden Markov Models Kalman Filters Types of dynamic systems Problem of future state prediction Predictability Observability Easily
More informationSequential Supervised Learning
Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given
More informationWe Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named
We Live in Exciting Times ACM (an international computing research society) has named CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Apr. 2, 2019 Yoshua Bengio,
More informationNote Set 5: Hidden Markov Models
Note Set 5: Hidden Markov Models Probabilistic Learning: Theory and Algorithms, CS 274A, Winter 2016 1 Hidden Markov Models (HMMs) 1.1 Introduction Consider observed data vectors x t that are d-dimensional
More informationMaster 2 Informatique Probabilistic Learning and Data Analysis
Master 2 Informatique Probabilistic Learning and Data Analysis Faicel Chamroukhi Maître de Conférences USTV, LSIS UMR CNRS 7296 email: chamroukhi@univ-tln.fr web: chamroukhi.univ-tln.fr 2013/2014 Faicel
More informationInference and estimation in probabilistic time series models
1 Inference and estimation in probabilistic time series models David Barber, A Taylan Cemgil and Silvia Chiappa 11 Time series The term time series refers to data that can be represented as a sequence
More informationPROBABILISTIC REASONING OVER TIME
PROBABILISTIC REASONING OVER TIME In which we try to interpret the present, understand the past, and perhaps predict the future, even when very little is crystal clear. Outline Time and uncertainty Inference:
More informationHidden Markov Models
Hidden Markov Models Slides mostly from Mitch Marcus and Eric Fosler (with lots of modifications). Have you seen HMMs? Have you seen Kalman filters? Have you seen dynamic programming? HMMs are dynamic
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 12 Dynamical Models CS/CNS/EE 155 Andreas Krause Homework 3 out tonight Start early!! Announcements Project milestones due today Please email to TAs 2 Parameter learning
More informationChris Bishop s PRML Ch. 8: Graphical Models
Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular
More informationHidden Markov Models
Hidden Markov Models Dr Philip Jackson Centre for Vision, Speech & Signal Processing University of Surrey, UK 1 3 2 http://www.ee.surrey.ac.uk/personal/p.jackson/isspr/ Outline 1. Recognizing patterns
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are
More informationCOMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma
COMS 4771 Probabilistic Reasoning via Graphical Models Nakul Verma Last time Dimensionality Reduction Linear vs non-linear Dimensionality Reduction Principal Component Analysis (PCA) Non-linear methods
More informationCourse 495: Advanced Statistical Machine Learning/Pattern Recognition
Course 495: Advanced Statistical Machine Learning/Pattern Recognition Lecturer: Stefanos Zafeiriou Goal (Lectures): To present discrete and continuous valued probabilistic linear dynamical systems (HMMs
More informationState-Space Methods for Inferring Spike Trains from Calcium Imaging
State-Space Methods for Inferring Spike Trains from Calcium Imaging Joshua Vogelstein Johns Hopkins April 23, 2009 Joshua Vogelstein (Johns Hopkins) State-Space Calcium Imaging April 23, 2009 1 / 78 Outline
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 12: Gaussian Belief Propagation, State Space Models and Kalman Filters Guest Kalman Filter Lecture by
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted
More informationData-Intensive Computing with MapReduce
Data-Intensive Computing with MapReduce Session 8: Sequence Labeling Jimmy Lin University of Maryland Thursday, March 14, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationStatistical NLP: Hidden Markov Models. Updated 12/15
Statistical NLP: Hidden Markov Models Updated 12/15 Markov Models Markov models are statistical tools that are useful for NLP because they can be used for part-of-speech-tagging applications Their first
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationSTATS 306B: Unsupervised Learning Spring Lecture 5 April 14
STATS 306B: Unsupervised Learning Spring 2014 Lecture 5 April 14 Lecturer: Lester Mackey Scribe: Brian Do and Robin Jia 5.1 Discrete Hidden Markov Models 5.1.1 Recap In the last lecture, we introduced
More informationChapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang
Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationProbabilistic Time Series Classification
Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign
More informationHidden Markov Models Part 2: Algorithms
Hidden Markov Models Part 2: Algorithms CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Hidden Markov Model An HMM consists of:
More informationLecture 11: Hidden Markov Models
Lecture 11: Hidden Markov Models Cognitive Systems - Machine Learning Cognitive Systems, Applied Computer Science, Bamberg University slides by Dr. Philip Jackson Centre for Vision, Speech & Signal Processing
More informationHidden Markov Models
CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each
More informationLecture 6: April 19, 2002
EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More informationStatistical Methods for NLP
Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition
More informationHidden Markov Models
Hidden Markov Models CI/CI(CS) UE, SS 2015 Christian Knoll Signal Processing and Speech Communication Laboratory Graz University of Technology June 23, 2015 CI/CI(CS) SS 2015 June 23, 2015 Slide 1/26 Content
More informationChapter 3 - Temporal processes
STK4150 - Intro 1 Chapter 3 - Temporal processes Odd Kolbjørnsen and Geir Storvik January 23 2017 STK4150 - Intro 2 Temporal processes Data collected over time Past, present, future, change Temporal aspect
More informationHidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208
Hidden Markov Model Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/19 Outline Example: Hidden Coin Tossing Hidden
More informationKalman filtering and friends: Inference in time series models. Herke van Hoof slides mostly by Michael Rubinstein
Kalman filtering and friends: Inference in time series models Herke van Hoof slides mostly by Michael Rubinstein Problem overview Goal Estimate most probable state at time k using measurement up to time
More informationGraphical Models Seminar
Graphical Models Seminar Forward-Backward and Viterbi Algorithm for HMMs Bishop, PRML, Chapters 13.2.2, 13.2.3, 13.2.5 Dinu Kaufmann Departement Mathematik und Informatik Universität Basel April 8, 2013
More informationEstimating Covariance Using Factorial Hidden Markov Models
Estimating Covariance Using Factorial Hidden Markov Models João Sedoc 1,2 with: Jordan Rodu 3, Lyle Ungar 1, Dean Foster 1 and Jean Gallier 1 1 University of Pennsylvania Philadelphia, PA joao@cis.upenn.edu
More informationHuman-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg
Temporal Reasoning Kai Arras, University of Freiburg 1 Temporal Reasoning Contents Introduction Temporal Reasoning Hidden Markov Models Linear Dynamical Systems (LDS) Kalman Filter 2 Temporal Reasoning
More informationIntroduction to Graphical Models
Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Hidden Markov Models Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Additional References: David
More information1 What is a hidden Markov model?
1 What is a hidden Markov model? Consider a Markov chain {X k }, where k is a non-negative integer. Suppose {X k } embedded in signals corrupted by some noise. Indeed, {X k } is hidden due to noise and
More informationA Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute
More informationp L yi z n m x N n xi
y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationHidden Markov Models. AIMA Chapter 15, Sections 1 5. AIMA Chapter 15, Sections 1 5 1
Hidden Markov Models AIMA Chapter 15, Sections 1 5 AIMA Chapter 15, Sections 1 5 1 Consider a target tracking problem Time and uncertainty X t = set of unobservable state variables at time t e.g., Position
More information1. Markov models. 1.1 Markov-chain
1. Markov models 1.1 Markov-chain Let X be a random variable X = (X 1,..., X t ) taking values in some set S = {s 1,..., s N }. The sequence is Markov chain if it has the following properties: 1. Limited
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationHidden Markov Models
Hidden Markov Models Lecture Notes Speech Communication 2, SS 2004 Erhard Rank/Franz Pernkopf Signal Processing and Speech Communication Laboratory Graz University of Technology Inffeldgasse 16c, A-8010
More informationStatistical Sequence Recognition and Training: An Introduction to HMMs
Statistical Sequence Recognition and Training: An Introduction to HMMs EECS 225D Nikki Mirghafori nikki@icsi.berkeley.edu March 7, 2005 Credit: many of the HMM slides have been borrowed and adapted, with
More informationDynamic models. Dependent data The AR(p) model The MA(q) model Hidden Markov models. 6 Dynamic models
6 Dependent data The AR(p) model The MA(q) model Hidden Markov models Dependent data Dependent data Huge portion of real-life data involving dependent datapoints Example (Capture-recapture) capture histories
More informationHidden Markov Modelling
Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models
More informationMachine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017
1 Introduction Let x = (x 1,..., x M ) denote a sequence (e.g. a sequence of words), and let y = (y 1,..., y M ) denote a corresponding hidden sequence that we believe explains or influences x somehow
More informationRobert Collins CSE586 CSE 586, Spring 2015 Computer Vision II
CSE 586, Spring 2015 Computer Vision II Hidden Markov Model and Kalman Filter Recall: Modeling Time Series State-Space Model: You have a Markov chain of latent (unobserved) states Each state generates
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationBayesian Machine Learning - Lecture 7
Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1
More informationVariational Learning : From exponential families to multilinear systems
Variational Learning : From exponential families to multilinear systems Ananth Ranganathan th February 005 Abstract This note aims to give a general overview of variational inference on graphical models.
More informationSupervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore
Supervised Learning Hidden Markov Models Some of these slides were inspired by the tutorials of Andrew Moore A Markov System S 2 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1,.
More informationFactor Analysis (10/2/13)
STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.
More informationMathematical Formulation of Our Example
Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot
More informationCS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas
CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1
More informationBioinformatics 2 - Lecture 4
Bioinformatics 2 - Lecture 4 Guido Sanguinetti School of Informatics University of Edinburgh February 14, 2011 Sequences Many data types are ordered, i.e. you can naturally say what is before and what
More informationTowards a Bayesian model for Cyber Security
Towards a Bayesian model for Cyber Security Mark Briers (mbriers@turing.ac.uk) Joint work with Henry Clausen and Prof. Niall Adams (Imperial College London) 27 September 2017 The Alan Turing Institute
More informationLecture 3: Pattern Classification
EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures
More informationHidden Markov Models,99,100! Markov, here I come!
Hidden Markov Models,99,100! Markov, here I come! 16.410/413 Principles of Autonomy and Decision-Making Pedro Santana (psantana@mit.edu) October 7 th, 2015. Based on material by Brian Williams and Emilio
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationp(d θ ) l(θ ) 1.2 x x x
p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationProbabilistic Graphical Models
2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector
More informationBrief Introduction of Machine Learning Techniques for Content Analysis
1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview
More information