Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics ) James Martens University of Toronto June 24, 2010 Computer Science UNIVERSITY OF TORONTO James Martens (U of T) Learning the LDS with ASOS June 24, 2010 1 / 21
The Linear Dynamical System model model of vector-valued time-series {y t R Ny } T t=1 James Martens (U of T) Learning the LDS with ASOS June 24, 2010 2 / 21
The Linear Dynamical System model model of vector-valued time-series {y t R Ny } T t=1 widely-applied due to predictable behavior, easy inference, etc James Martens (U of T) Learning the LDS with ASOS June 24, 2010 2 / 21
The Linear Dynamical System model model of vector-valued time-series {y t R Ny } T t=1 widely-applied due to predictable behavior, easy inference, etc vector-valued hidden states ({x t R Nx } T t=1 ) evolve via linear dynamics, x t+1 = Ax t + ɛ t A R Nx Nx ɛ t N(0, Q) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 2 / 21
The Linear Dynamical System model model of vector-valued time-series {y t R Ny } T t=1 widely-applied due to predictable behavior, easy inference, etc vector-valued hidden states ({x t R Nx } T t=1 ) evolve via linear dynamics, x t+1 = Ax t + ɛ t A R Nx Nx ɛ t N(0, Q) linearly generated observations: y t = Cx t + δ t C R Ny Nx δ t N(0, R) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 2 / 21
Learning the LDS Expectation Maximization (EM) finds local optimum of log-likelihood pretty slow - convergence requires lots of iterations and E-step is expensive James Martens (U of T) Learning the LDS with ASOS June 24, 2010 3 / 21
Learning the LDS Expectation Maximization (EM) finds local optimum of log-likelihood pretty slow - convergence requires lots of iterations and E-step is expensive Subspace identification hidden states estimated directly from the data, and the parameters from these asymptotically unbiased / consistent non-iterative algorithm, but solution not optimal in any objective good way to initialize EM or other iterative optimizers James Martens (U of T) Learning the LDS with ASOS June 24, 2010 3 / 21
Our contribution accelerate the EM algorithm by reducing its per-iteration cost to be constant time w.r.t. T (length of the time-series) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 4 / 21
Our contribution accelerate the EM algorithm by reducing its per-iteration cost to be constant time w.r.t. T (length of the time-series) key idea: approximate the inference done in the E-step James Martens (U of T) Learning the LDS with ASOS June 24, 2010 4 / 21
Our contribution accelerate the EM algorithm by reducing its per-iteration cost to be constant time w.r.t. T (length of the time-series) key idea: approximate the inference done in the E-step E-step approximation is unbiased and asymptotically consistent also convergences exponentially with L, where L is a meta-parameter that trades off approximation quality with speed (notation change: L is k lim from the paper) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 4 / 21
Learning via E.M. the Algorithm E.M. Objective Function At each iteration we maximize the following objective where θ n is the current parameter estimate: Q n (θ) = E θn [log p(x, y) y] = p(x y, θ n ) log p(x, y θ) x James Martens (U of T) Learning the LDS with ASOS June 24, 2010 5 / 21
Learning via E.M. the Algorithm E.M. Objective Function At each iteration we maximize the following objective where θ n is the current parameter estimate: Q n (θ) = E θn [log p(x, y) y] = p(x y, θ n ) log p(x, y θ) x E-Step E-Step computes expectation of log p(x, y θ) under p(x y, θ n ) uses the classical Kalman filtering/smoothing algorithm James Martens (U of T) Learning the LDS with ASOS June 24, 2010 5 / 21
Learning via E.M. the Algorithm (cont.) M-Step maximize objective Q n (θ) w.r.t. to θ, producing a new estimate θ n+1 θ n+1 = arg max Q n (θ) θ very easy - similar to linear-regression James Martens (U of T) Learning the LDS with ASOS June 24, 2010 6 / 21
Learning via E.M. the Algorithm (cont.) M-Step maximize objective Q n (θ) w.r.t. to θ, producing a new estimate θ n+1 θ n+1 = arg max Q n (θ) θ very easy - similar to linear-regression Problem EM can get very slow for when we have lots of data mainly due to call to expensive Kalman filter/smoother in the E-step O(N 3 x T ) where T = length of the training time-series, N x = hidden state dim. James Martens (U of T) Learning the LDS with ASOS June 24, 2010 6 / 21
the Kalman filter/smoother estimates hidden-state means and covariances: x k t E θn [ x t y k ] V k t,s Cov θn [ x t, x s y k ] for each t = {1,..., T } and s = t, t + 1. James Martens (U of T) Learning the LDS with ASOS June 24, 2010 7 / 21
the Kalman filter/smoother estimates hidden-state means and covariances: x k t E θn [ x t y k ] V k t,s Cov θn [ x t, x s y k ] for each t = {1,..., T } and s = t, t + 1. these are summed over time to obtain the statistics required for M-step, e.g.: T 1 E θn [x t+1 x t y k ] = (x T, x T ) 1 + where (a, b) k T k t=1 a t+kb t t=1 V T t+1,t James Martens (U of T) Learning the LDS with ASOS June 24, 2010 7 / 21
the Kalman filter/smoother estimates hidden-state means and covariances: x k t E θn [ x t y k ] V k t,s Cov θn [ x t, x s y k ] for each t = {1,..., T } and s = t, t + 1. these are summed over time to obtain the statistics required for M-step, e.g.: T 1 E θn [x t+1 x t y k ] = (x T, x T ) 1 + where (a, b) k T k t=1 a t+kb t t=1 V T t+1,t but we only care about the M-statistics, not the individual inferences for each time-step so let s estimate these directly! James Martens (U of T) Learning the LDS with ASOS June 24, 2010 7 / 21
Steady-state first we need a basic tool from linear systems/control theory: steady-state James Martens (U of T) Learning the LDS with ASOS June 24, 2010 8 / 21
Steady-state first we need a basic tool from linear systems/control theory: steady-state the covariance terms, and the filtering and smoothing matrices (denoted K t and J t ) do not depend on the data y - only the current parameters James Martens (U of T) Learning the LDS with ASOS June 24, 2010 8 / 21
Steady-state first we need a basic tool from linear systems/control theory: steady-state the covariance terms, and the filtering and smoothing matrices (denoted K t and J t ) do not depend on the data y - only the current parameters and they rapidly converge to steady-state values: V T t,t, V T t,t 1, J t, K t Λ 0, Λ 1, J, K as min(t, T t) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 8 / 21
we can approximate the Kalman filter/smoother equations using the steady-state matrices James Martens (U of T) Learning the LDS with ASOS June 24, 2010 9 / 21
we can approximate the Kalman filter/smoother equations using the steady-state matrices this gives the highly simplified recurrences x t = Hx t 1 + Ky t x T t = Jx T t+1 + Px t where x t x t t E θn [ x t y t ], H A KCA and P I JA James Martens (U of T) Learning the LDS with ASOS June 24, 2010 9 / 21
we can approximate the Kalman filter/smoother equations using the steady-state matrices this gives the highly simplified recurrences x t = Hx t 1 + Ky t x T t = Jx T t+1 + Px t where x t x t t E θn [ x t y t ], H A KCA and P I JA these don t require any matrix multiplications or inversions James Martens (U of T) Learning the LDS with ASOS June 24, 2010 9 / 21
we can approximate the Kalman filter/smoother equations using the steady-state matrices this gives the highly simplified recurrences x t = Hx t 1 + Ky t x T t = Jx T t+1 + Px t where x t x t t E θn [ x t y t ], H A KCA and P I JA these don t require any matrix multiplications or inversions we apply the approximate filter/smoother everywhere except first and last i time-steps yields a run-time of O(N 2 x T + N 3 x i). James Martens (U of T) Learning the LDS with ASOS June 24, 2010 9 / 21
ASOS: Approximated Second-Order Statistics steady-state makes covariance terms easy to estimate in time independent of T James Martens (U of T) Learning the LDS with ASOS June 24, 2010 10 / 21
ASOS: Approximated Second-Order Statistics steady-state makes covariance terms easy to estimate in time independent of T we want something similar for sum-of-products of means terms like (x T, x T ) 0 t x T t (x T t ) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 10 / 21
ASOS: Approximated Second-Order Statistics steady-state makes covariance terms easy to estimate in time independent of T we want something similar for sum-of-products of means terms like (x T, x T ) 0 t x T t (x T t ) such sums we will call 2 nd -order statistics. The ones-needed for the M-step are the M-statistics James Martens (U of T) Learning the LDS with ASOS June 24, 2010 10 / 21
ASOS: Approximated Second-Order Statistics steady-state makes covariance terms easy to estimate in time independent of T we want something similar for sum-of-products of means terms like (x T, x T ) 0 t x T t (x T t ) such sums we will call 2 nd -order statistics. The ones-needed for the M-step are the M-statistics idea #1: derive recursions and equations that relate the 2 nd -order statistics of different time-lags time-lag refers to the value of k in (a, b) k T k t=1 a t+kb t James Martens (U of T) Learning the LDS with ASOS June 24, 2010 10 / 21
ASOS: Approximated Second-Order Statistics steady-state makes covariance terms easy to estimate in time independent of T we want something similar for sum-of-products of means terms like (x T, x T ) 0 t x T t (x T t ) such sums we will call 2 nd -order statistics. The ones-needed for the M-step are the M-statistics idea #1: derive recursions and equations that relate the 2 nd -order statistics of different time-lags time-lag refers to the value of k in (a, b) k T k t=1 a t+kb t idea #2: evaluate these efficiently using approximations James Martens (U of T) Learning the LDS with ASOS June 24, 2010 10 / 21
Deriving the 2 nd -order recursions/equations: An example suppose we wish to find the recursion for (x, y) k James Martens (U of T) Learning the LDS with ASOS June 24, 2010 11 / 21
Deriving the 2 nd -order recursions/equations: An example suppose we wish to find the recursion for (x, y) k steady-state Kalman recursion for x t+k is: x t+k = Hx t+k 1 + Ky t+k right-multiply both sides by y t and sum over t T k T (x, y) k xt+k y k t = (Hxt+k 1 y t + Ky t+k y t) t=1 t=1 James Martens (U of T) Learning the LDS with ASOS June 24, 2010 11 / 21
Deriving the 2 nd -order recursions/equations: An example suppose we wish to find the recursion for (x, y) k steady-state Kalman recursion for x t+k is: x t+k = Hx t+k 1 + Ky t+k right-multiply both sides by y t and sum over t factor out matrices H and K T k T (x, y) k xt+k y k t = (Hxt+k 1 y t + Ky t+k y t) t=1 = H T k t=1 t=1 x t+k 1 y t + K T k t=1 y t+k y t James Martens (U of T) Learning the LDS with ASOS June 24, 2010 11 / 21
Deriving the 2 nd -order recursions/equations: An example suppose we wish to find the recursion for (x, y) k steady-state Kalman recursion for x t+k is: x t+k = Hx t+k 1 + Ky t+k right-multiply both sides by y t and sum over t factor out matrices H and K finally, re-write everything using our special notation for 2 nd -order statistics: (a, b) k T k t=1 a t+kb t T k T (x, y) k xt+k y k t = (Hxt+k 1 y t + Ky t+k y t) t=1 = H T k t=1 t=1 x t+k 1 y t + K T k t=1 y t+k y t = H( (x, y) k 1 x T y T k+1 ) + K (y, y) k James Martens (U of T) Learning the LDS with ASOS June 24, 2010 11 / 21
The complete list (don t bother to memorize this) The recursions: The equations: (y, x ) k = (y, x ) k+1 H + ((y, y) k y 1+k y 1 )K + y 1+k x 1 (x, y) k = H((x, y) k 1 x T y T k+1 ) + K(y, y) k (x, x ) k = (x, x ) k+1 H + ((x, y) k x 1+k y 1 )K + x 1+k x 1 (x, x ) k = H((x, x ) k 1 x T x T k+1 ) + K(y, x ) k (x T, y) k = J(x T, y) k+1 + P((x, y) k x T y T k ) + x T T y T k (x T, x ) k = J(x T, x ) k+1 + P((x, x ) k x T x T k ) + x T T x T k (x T, x T ) k = ((x T, x T ) k 1 x T k xt 1 )J + (x T, x ) k P (x T, x T ) k = J(x T, x T ) k+1 + P((x, x T ) k xt xt T k ) + x T T xt T k (x, x ) k = H(x, x ) k H + ((x, y) k x 1+k y 1 )K Hx T x T k H + K(y, x ) k+1 H + x 1+k x 1 (x T, x T ) k = J(x T, x T ) k J + P((x, x T ) k xt xt T k ) Jx T k+1 x1 T J + J(x T, x ) k+1 P + xt T xt T k James Martens (U of T) Learning the LDS with ASOS June 24, 2010 12 / 21
noting that statistics of time-lag T + 1 are 0 by definition we can start the 2 nd -order recursions at t = T but this doesn t get us anywhere - would be even more expensive than the usual Kalman recursions on the 1 st -order terms James Martens (U of T) Learning the LDS with ASOS June 24, 2010 13 / 21
noting that statistics of time-lag T + 1 are 0 by definition we can start the 2 nd -order recursions at t = T but this doesn t get us anywhere - would be even more expensive than the usual Kalman recursions on the 1 st -order terms instead, start the recursions at time-lag L with unbiased approximations ( ASOS approximations ) (y, x ) L+1 CA ( (x, x ) L x T x T L ), (x T, x ) L (x, x ) L, (x T, y) L (x, y) L James Martens (U of T) Learning the LDS with ASOS June 24, 2010 13 / 21
noting that statistics of time-lag T + 1 are 0 by definition we can start the 2 nd -order recursions at t = T but this doesn t get us anywhere - would be even more expensive than the usual Kalman recursions on the 1 st -order terms instead, start the recursions at time-lag L with unbiased approximations ( ASOS approximations ) (y, x ) L+1 CA ( (x, x ) L x T x T L ), (x T, x ) L (x, x ) L, (x T, y) L (x, y) L we also need xt T for t {1, 2,..., L} {T L, T L+1,..., T } but these can be approximated by a separate procedure (see paper) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 13 / 21
Why might this be reasonable? 2 nd -order statistics with large time lag quantify relationships between variables that are far apart in time weaker and less important than relationships between variables that are close in time James Martens (U of T) Learning the LDS with ASOS June 24, 2010 14 / 21
Why might this be reasonable? 2 nd -order statistics with large time lag quantify relationships between variables that are far apart in time weaker and less important than relationships between variables that are close in time in steady-state Kalman recursions, information is propagated via multiplication by H and J: x t = Hx t 1 + Ky t x T t = Jx T t+1 + Px t both of these have spectral radius (denoted σ( )) less than 1, and so they decay the signal exponentially James Martens (U of T) Learning the LDS with ASOS June 24, 2010 14 / 21
Procedure for estimating the M-statistics how do we compute an estimate of the M-statistics consistent with the 2nd-order recursions/equations and the approximations? James Martens (U of T) Learning the LDS with ASOS June 24, 2010 15 / 21
Procedure for estimating the M-statistics how do we compute an estimate of the M-statistics consistent with the 2nd-order recursions/equations and the approximations? essentially it is just a large linear system of dimension O(N 2 x L) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 15 / 21
Procedure for estimating the M-statistics how do we compute an estimate of the M-statistics consistent with the 2nd-order recursions/equations and the approximations? essentially it is just a large linear system of dimension O(N 2 x L) but using a general solver would be far too expensive: O(N 6 x L 3 ) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 15 / 21
Procedure for estimating the M-statistics how do we compute an estimate of the M-statistics consistent with the 2nd-order recursions/equations and the approximations? essentially it is just a large linear system of dimension O(N 2 x L) but using a general solver would be far too expensive: O(N 6 x L 3 ) fortunately, using the special structure of this system, we have developed a (non-trivial) algorithm which is much more efficient equations can be solved using an efficient iterative algorithm we developed for a generalization of the Sylvester equation evaluating recursions is then straightforward James Martens (U of T) Learning the LDS with ASOS June 24, 2010 15 / 21
Procedure for estimating the M-statistics how do we compute an estimate of the M-statistics consistent with the 2nd-order recursions/equations and the approximations? essentially it is just a large linear system of dimension O(N 2 x L) but using a general solver would be far too expensive: O(N 6 x L 3 ) fortunately, using the special structure of this system, we have developed a (non-trivial) algorithm which is much more efficient equations can be solved using an efficient iterative algorithm we developed for a generalization of the Sylvester equation evaluating recursions is then straightforward the cost is then just O(N 3 x L) after (y, y) k t y t+ky t has been pre-computed for k = 0,..., L James Martens (U of T) Learning the LDS with ASOS June 24, 2010 15 / 21
First convergence result: our intuition confirmed First result: For a fixed θ the l 2 -error in the M-statistics is bounded by a quantity proportional to L 2 λ L 1, where λ = σ(h) = σ(j) < 1 (σ( ) denotes the spectral radius) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 16 / 21
First convergence result: our intuition confirmed First result: For a fixed θ the l 2 -error in the M-statistics is bounded by a quantity proportional to L 2 λ L 1, where λ = σ(h) = σ(j) < 1 (σ( ) denotes the spectral radius) so as L grows, the estimation error for the M-statistics will decay exponentially James Martens (U of T) Learning the LDS with ASOS June 24, 2010 16 / 21
First convergence result: our intuition confirmed First result: For a fixed θ the l 2 -error in the M-statistics is bounded by a quantity proportional to L 2 λ L 1, where λ = σ(h) = σ(j) < 1 (σ( ) denotes the spectral radius) so as L grows, the estimation error for the M-statistics will decay exponentially but, λ might be close enough to 1 so that we need to make L too big James Martens (U of T) Learning the LDS with ASOS June 24, 2010 16 / 21
First convergence result: our intuition confirmed First result: For a fixed θ the l 2 -error in the M-statistics is bounded by a quantity proportional to L 2 λ L 1, where λ = σ(h) = σ(j) < 1 (σ( ) denotes the spectral radius) so as L grows, the estimation error for the M-statistics will decay exponentially but, λ might be close enough to 1 so that we need to make L too big fortunately we have a 2nd result which provides a very different type of guarantee James Martens (U of T) Learning the LDS with ASOS June 24, 2010 16 / 21
Second convergence result Second result: updates produced by ASOS procedure are asymptotically consistent with the usual EM updates in the limit as T assumes data is generated from the model James Martens (U of T) Learning the LDS with ASOS June 24, 2010 17 / 21
Second convergence result Second result: updates produced by ASOS procedure are asymptotically consistent with the usual EM updates in the limit as T assumes data is generated from the model first result didn t make strong use any property of the approx. (could use 0 for each and result would still hold) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 17 / 21
Second convergence result Second result: updates produced by ASOS procedure are asymptotically consistent with the usual EM updates in the limit as T assumes data is generated from the model first result didn t make strong use any property of the approx. (could use 0 for each and result would still hold) this second one is justified in the opposite way strong use of the approximation follows from convergence of 1 T -scaled expected l 2 error of approx. towards zero result holds for any value of L value James Martens (U of T) Learning the LDS with ASOS June 24, 2010 17 / 21
Experiments we considered 3 real datasets of varying sizes and dimensionality each algorithm initialized from same random parameters latent dimension N x determined by trial-and-error Experimental parameters Name length (T ) N y N x evaporator 6305 3 15 motion capture 15300 10 40 warship sounds 750000 1 20 James Martens (U of T) Learning the LDS with ASOS June 24, 2010 18 / 21
Experimental results (cont.) Name length (T ) N y N x evaporator 6305 3 15 motion capture 15300 10 40 warship sounds 750000 1 20 Negative Log Likelihood 3200 3000 2800 2600 ASOS EM 5 (13.1667 s) ASOS EM 10 (13.9436 s) ASOS EM 20 (16.6538 s) ASOS EM 35 (21.0141 s) ASOS EM 75 (27.1616 s) SS EM (56.804 s) EM (692.6135 s) 0 100 200 300 400 500 600 Iterations James Martens (U of T) Learning the LDS with ASOS June 24, 2010 19 / 21
Experimental results (cont.) Name length (T ) N y N x evaporator 6305 3 15 motion capture 15300 10 40 warship sounds 750000 1 20 Negative Log Likelihood x 10 5 1.96 1.97 1.98 1.99 2 2.01 ASOS EM 20 (52.0094 s) ASOS EM 30 (55.623 s) ASOS EM 50 (59.9489 s) SS EM (215.6504 s) EM (6018.4049 s) 0 50 100 150 200 250 300 350 400 450 Iterations James Martens (U of T) Learning the LDS with ASOS June 24, 2010 19 / 21
Experimental results (cont.) Name length (T ) N y N x evaporator 6305 3 15 motion capture 15300 10 40 warship sounds 750000 1 20 x 10 6 Negative Log Likelihood 6.71 6.708 6.706 6.704 6.702 ASOS EM 150 (49.6942 s) ASOS EM 300 (84.4925 s) ASOS EM 850 (214.039 s) ASOS EM 1700 (409.0395 s) SS EM (9179.0774 s) 0 50 100 150 200 250 300 350 400 Iterations James Martens (U of T) Learning the LDS with ASOS June 24, 2010 19 / 21
Conclusion we applied steady-state approximations to derive a set of 2nd-order recursions and equations James Martens (U of T) Learning the LDS with ASOS June 24, 2010 20 / 21
Conclusion we applied steady-state approximations to derive a set of 2nd-order recursions and equations approximated statistics of time-lag L James Martens (U of T) Learning the LDS with ASOS June 24, 2010 20 / 21
Conclusion we applied steady-state approximations to derive a set of 2nd-order recursions and equations approximated statistics of time-lag L produced an efficient algorithm for solving the resulting system Per-iteration run-times: EM SS-EM ASOS-EM O(Nx 3 T ) O(Nx 2 T + Nx 3 i) O(Nx 3 k lim ) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 20 / 21
Conclusion we applied steady-state approximations to derive a set of 2nd-order recursions and equations approximated statistics of time-lag L produced an efficient algorithm for solving the resulting system Per-iteration run-times: gave 2 formal convergence results EM SS-EM ASOS-EM O(Nx 3 T ) O(Nx 2 T + Nx 3 i) O(Nx 3 k lim ) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 20 / 21
Conclusion we applied steady-state approximations to derive a set of 2nd-order recursions and equations approximated statistics of time-lag L produced an efficient algorithm for solving the resulting system Per-iteration run-times: gave 2 formal convergence results EM SS-EM ASOS-EM O(Nx 3 T ) O(Nx 2 T + Nx 3 i) O(Nx 3 k lim ) demonstrated significant performance benefits for learning with long time-series James Martens (U of T) Learning the LDS with ASOS June 24, 2010 20 / 21
Thank you for your attention James Martens (U of T) Learning the LDS with ASOS June 24, 2010 21 / 21