Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics )

Similar documents
Learning the Linear Dynamical System with ASOS

1 Proof of Claim Proof of Claim Proof of claim Solving The Primary Equation 5. 5 Psuedo-code for ASOS 7

Linear Dynamical Systems

Linear Dynamical Systems (Kalman filter)

STA 4273H: Statistical Machine Learning

Machine Learning 4771

Introduction to Graphical Models

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

STA 414/2104: Machine Learning

Sequential Monte Carlo Methods for Bayesian Computation

Factor Analysis and Kalman Filtering (11/2/04)

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Probabilistic Graphical Models

Probabilistic Graphical Models

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Approximate Bayesian Computation and Particle Filters

Mathematical Formulation of Our Example

Learning about State. Geoff Gordon Machine Learning Department Carnegie Mellon University

CPSC 540: Machine Learning

Least Squares Estimation Namrata Vaswani,

Dynamic Approaches: The Hidden Markov Model

Bayesian Methods for Machine Learning

CPSC 540: Machine Learning

CSC2541 Lecture 5 Natural Gradient

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Introduction to Machine Learning CMU-10701

Stabilization and Acceleration of Algebraic Multigrid Method

CSCI-567: Machine Learning (Spring 2019)

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

The Expectation-Maximization Algorithm

Lecture 6: Bayesian Inference in SDE Models

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

K-Means and Gaussian Mixture Models

Expectation Propagation Algorithm

Markov Chain Monte Carlo Methods for Stochastic Optimization

The K-FAC method for neural network optimization

Linear discriminant functions

The Kalman Filter ImPr Talk

Improving the Convergence of Back-Propogation Learning with Second Order Methods

CS281 Section 4: Factor Analysis and PCA

Expectation maximization tutorial

State-Space Methods for Inferring Spike Trains from Calcium Imaging

Parameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets

STA 4273H: Statistical Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

An analogy from Calculus: limits

Probabilistic Graphical Models

LECTURE NOTE #3 PROF. ALAN YUILLE

Statistics 910, #15 1. Kalman Filter

Unsupervised Learning

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Computer Intensive Methods in Mathematical Statistics

Bias-Variance Tradeoff

Asymptotic and finite-sample properties of estimators based on stochastic gradients

Latent Variable Models and Expectation Maximization

Regression with correlation for the Sales Data

Outline lecture 6 2(35)

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Statistical Techniques in Robotics (16-831, F12) Lecture#17 (Wednesday October 31) Kalman Filters. Lecturer: Drew Bagnell Scribe:Greydon Foil 1

Expectation Maximization

1 Kalman Filter Introduction

Mixtures of Gaussians. Sargur Srihari

Technical Details about the Expectation Maximization (EM) Algorithm

Lecture 5: Logistic Regression. Neural Networks

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Particle Filtering Approaches for Dynamic Stochastic Optimization

The Multivariate Gaussian Distribution [DRAFT]

Latent Variable Models and Expectation Maximization

14 : Theory of Variational Inference: Inner and Outer Approximation

Gaussian Mixture Models

Lecture 4: Probabilistic Learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

PMR Learning as Inference

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

4 Derivations of the Discrete-Time Kalman Filter

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Gaussian Mixture Models

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Particle Filtering for Data-Driven Simulation and Optimization

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Chapter 3. Tohru Katayama

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Why should you care about the solution strategies?

Expectation Propagation in Dynamical Systems

Expectation Maximization (EM)

OPTIMIZATION METHODS IN DEEP LEARNING

Robust Monte Carlo Methods for Sequential Planning and Decision Making

CS181 Midterm 2 Practice Solutions

Towards a Bayesian model for Cyber Security

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Transcription:

Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics ) James Martens University of Toronto June 24, 2010 Computer Science UNIVERSITY OF TORONTO James Martens (U of T) Learning the LDS with ASOS June 24, 2010 1 / 21

The Linear Dynamical System model model of vector-valued time-series {y t R Ny } T t=1 James Martens (U of T) Learning the LDS with ASOS June 24, 2010 2 / 21

The Linear Dynamical System model model of vector-valued time-series {y t R Ny } T t=1 widely-applied due to predictable behavior, easy inference, etc James Martens (U of T) Learning the LDS with ASOS June 24, 2010 2 / 21

The Linear Dynamical System model model of vector-valued time-series {y t R Ny } T t=1 widely-applied due to predictable behavior, easy inference, etc vector-valued hidden states ({x t R Nx } T t=1 ) evolve via linear dynamics, x t+1 = Ax t + ɛ t A R Nx Nx ɛ t N(0, Q) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 2 / 21

The Linear Dynamical System model model of vector-valued time-series {y t R Ny } T t=1 widely-applied due to predictable behavior, easy inference, etc vector-valued hidden states ({x t R Nx } T t=1 ) evolve via linear dynamics, x t+1 = Ax t + ɛ t A R Nx Nx ɛ t N(0, Q) linearly generated observations: y t = Cx t + δ t C R Ny Nx δ t N(0, R) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 2 / 21

Learning the LDS Expectation Maximization (EM) finds local optimum of log-likelihood pretty slow - convergence requires lots of iterations and E-step is expensive James Martens (U of T) Learning the LDS with ASOS June 24, 2010 3 / 21

Learning the LDS Expectation Maximization (EM) finds local optimum of log-likelihood pretty slow - convergence requires lots of iterations and E-step is expensive Subspace identification hidden states estimated directly from the data, and the parameters from these asymptotically unbiased / consistent non-iterative algorithm, but solution not optimal in any objective good way to initialize EM or other iterative optimizers James Martens (U of T) Learning the LDS with ASOS June 24, 2010 3 / 21

Our contribution accelerate the EM algorithm by reducing its per-iteration cost to be constant time w.r.t. T (length of the time-series) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 4 / 21

Our contribution accelerate the EM algorithm by reducing its per-iteration cost to be constant time w.r.t. T (length of the time-series) key idea: approximate the inference done in the E-step James Martens (U of T) Learning the LDS with ASOS June 24, 2010 4 / 21

Our contribution accelerate the EM algorithm by reducing its per-iteration cost to be constant time w.r.t. T (length of the time-series) key idea: approximate the inference done in the E-step E-step approximation is unbiased and asymptotically consistent also convergences exponentially with L, where L is a meta-parameter that trades off approximation quality with speed (notation change: L is k lim from the paper) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 4 / 21

Learning via E.M. the Algorithm E.M. Objective Function At each iteration we maximize the following objective where θ n is the current parameter estimate: Q n (θ) = E θn [log p(x, y) y] = p(x y, θ n ) log p(x, y θ) x James Martens (U of T) Learning the LDS with ASOS June 24, 2010 5 / 21

Learning via E.M. the Algorithm E.M. Objective Function At each iteration we maximize the following objective where θ n is the current parameter estimate: Q n (θ) = E θn [log p(x, y) y] = p(x y, θ n ) log p(x, y θ) x E-Step E-Step computes expectation of log p(x, y θ) under p(x y, θ n ) uses the classical Kalman filtering/smoothing algorithm James Martens (U of T) Learning the LDS with ASOS June 24, 2010 5 / 21

Learning via E.M. the Algorithm (cont.) M-Step maximize objective Q n (θ) w.r.t. to θ, producing a new estimate θ n+1 θ n+1 = arg max Q n (θ) θ very easy - similar to linear-regression James Martens (U of T) Learning the LDS with ASOS June 24, 2010 6 / 21

Learning via E.M. the Algorithm (cont.) M-Step maximize objective Q n (θ) w.r.t. to θ, producing a new estimate θ n+1 θ n+1 = arg max Q n (θ) θ very easy - similar to linear-regression Problem EM can get very slow for when we have lots of data mainly due to call to expensive Kalman filter/smoother in the E-step O(N 3 x T ) where T = length of the training time-series, N x = hidden state dim. James Martens (U of T) Learning the LDS with ASOS June 24, 2010 6 / 21

the Kalman filter/smoother estimates hidden-state means and covariances: x k t E θn [ x t y k ] V k t,s Cov θn [ x t, x s y k ] for each t = {1,..., T } and s = t, t + 1. James Martens (U of T) Learning the LDS with ASOS June 24, 2010 7 / 21

the Kalman filter/smoother estimates hidden-state means and covariances: x k t E θn [ x t y k ] V k t,s Cov θn [ x t, x s y k ] for each t = {1,..., T } and s = t, t + 1. these are summed over time to obtain the statistics required for M-step, e.g.: T 1 E θn [x t+1 x t y k ] = (x T, x T ) 1 + where (a, b) k T k t=1 a t+kb t t=1 V T t+1,t James Martens (U of T) Learning the LDS with ASOS June 24, 2010 7 / 21

the Kalman filter/smoother estimates hidden-state means and covariances: x k t E θn [ x t y k ] V k t,s Cov θn [ x t, x s y k ] for each t = {1,..., T } and s = t, t + 1. these are summed over time to obtain the statistics required for M-step, e.g.: T 1 E θn [x t+1 x t y k ] = (x T, x T ) 1 + where (a, b) k T k t=1 a t+kb t t=1 V T t+1,t but we only care about the M-statistics, not the individual inferences for each time-step so let s estimate these directly! James Martens (U of T) Learning the LDS with ASOS June 24, 2010 7 / 21

Steady-state first we need a basic tool from linear systems/control theory: steady-state James Martens (U of T) Learning the LDS with ASOS June 24, 2010 8 / 21

Steady-state first we need a basic tool from linear systems/control theory: steady-state the covariance terms, and the filtering and smoothing matrices (denoted K t and J t ) do not depend on the data y - only the current parameters James Martens (U of T) Learning the LDS with ASOS June 24, 2010 8 / 21

Steady-state first we need a basic tool from linear systems/control theory: steady-state the covariance terms, and the filtering and smoothing matrices (denoted K t and J t ) do not depend on the data y - only the current parameters and they rapidly converge to steady-state values: V T t,t, V T t,t 1, J t, K t Λ 0, Λ 1, J, K as min(t, T t) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 8 / 21

we can approximate the Kalman filter/smoother equations using the steady-state matrices James Martens (U of T) Learning the LDS with ASOS June 24, 2010 9 / 21

we can approximate the Kalman filter/smoother equations using the steady-state matrices this gives the highly simplified recurrences x t = Hx t 1 + Ky t x T t = Jx T t+1 + Px t where x t x t t E θn [ x t y t ], H A KCA and P I JA James Martens (U of T) Learning the LDS with ASOS June 24, 2010 9 / 21

we can approximate the Kalman filter/smoother equations using the steady-state matrices this gives the highly simplified recurrences x t = Hx t 1 + Ky t x T t = Jx T t+1 + Px t where x t x t t E θn [ x t y t ], H A KCA and P I JA these don t require any matrix multiplications or inversions James Martens (U of T) Learning the LDS with ASOS June 24, 2010 9 / 21

we can approximate the Kalman filter/smoother equations using the steady-state matrices this gives the highly simplified recurrences x t = Hx t 1 + Ky t x T t = Jx T t+1 + Px t where x t x t t E θn [ x t y t ], H A KCA and P I JA these don t require any matrix multiplications or inversions we apply the approximate filter/smoother everywhere except first and last i time-steps yields a run-time of O(N 2 x T + N 3 x i). James Martens (U of T) Learning the LDS with ASOS June 24, 2010 9 / 21

ASOS: Approximated Second-Order Statistics steady-state makes covariance terms easy to estimate in time independent of T James Martens (U of T) Learning the LDS with ASOS June 24, 2010 10 / 21

ASOS: Approximated Second-Order Statistics steady-state makes covariance terms easy to estimate in time independent of T we want something similar for sum-of-products of means terms like (x T, x T ) 0 t x T t (x T t ) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 10 / 21

ASOS: Approximated Second-Order Statistics steady-state makes covariance terms easy to estimate in time independent of T we want something similar for sum-of-products of means terms like (x T, x T ) 0 t x T t (x T t ) such sums we will call 2 nd -order statistics. The ones-needed for the M-step are the M-statistics James Martens (U of T) Learning the LDS with ASOS June 24, 2010 10 / 21

ASOS: Approximated Second-Order Statistics steady-state makes covariance terms easy to estimate in time independent of T we want something similar for sum-of-products of means terms like (x T, x T ) 0 t x T t (x T t ) such sums we will call 2 nd -order statistics. The ones-needed for the M-step are the M-statistics idea #1: derive recursions and equations that relate the 2 nd -order statistics of different time-lags time-lag refers to the value of k in (a, b) k T k t=1 a t+kb t James Martens (U of T) Learning the LDS with ASOS June 24, 2010 10 / 21

ASOS: Approximated Second-Order Statistics steady-state makes covariance terms easy to estimate in time independent of T we want something similar for sum-of-products of means terms like (x T, x T ) 0 t x T t (x T t ) such sums we will call 2 nd -order statistics. The ones-needed for the M-step are the M-statistics idea #1: derive recursions and equations that relate the 2 nd -order statistics of different time-lags time-lag refers to the value of k in (a, b) k T k t=1 a t+kb t idea #2: evaluate these efficiently using approximations James Martens (U of T) Learning the LDS with ASOS June 24, 2010 10 / 21

Deriving the 2 nd -order recursions/equations: An example suppose we wish to find the recursion for (x, y) k James Martens (U of T) Learning the LDS with ASOS June 24, 2010 11 / 21

Deriving the 2 nd -order recursions/equations: An example suppose we wish to find the recursion for (x, y) k steady-state Kalman recursion for x t+k is: x t+k = Hx t+k 1 + Ky t+k right-multiply both sides by y t and sum over t T k T (x, y) k xt+k y k t = (Hxt+k 1 y t + Ky t+k y t) t=1 t=1 James Martens (U of T) Learning the LDS with ASOS June 24, 2010 11 / 21

Deriving the 2 nd -order recursions/equations: An example suppose we wish to find the recursion for (x, y) k steady-state Kalman recursion for x t+k is: x t+k = Hx t+k 1 + Ky t+k right-multiply both sides by y t and sum over t factor out matrices H and K T k T (x, y) k xt+k y k t = (Hxt+k 1 y t + Ky t+k y t) t=1 = H T k t=1 t=1 x t+k 1 y t + K T k t=1 y t+k y t James Martens (U of T) Learning the LDS with ASOS June 24, 2010 11 / 21

Deriving the 2 nd -order recursions/equations: An example suppose we wish to find the recursion for (x, y) k steady-state Kalman recursion for x t+k is: x t+k = Hx t+k 1 + Ky t+k right-multiply both sides by y t and sum over t factor out matrices H and K finally, re-write everything using our special notation for 2 nd -order statistics: (a, b) k T k t=1 a t+kb t T k T (x, y) k xt+k y k t = (Hxt+k 1 y t + Ky t+k y t) t=1 = H T k t=1 t=1 x t+k 1 y t + K T k t=1 y t+k y t = H( (x, y) k 1 x T y T k+1 ) + K (y, y) k James Martens (U of T) Learning the LDS with ASOS June 24, 2010 11 / 21

The complete list (don t bother to memorize this) The recursions: The equations: (y, x ) k = (y, x ) k+1 H + ((y, y) k y 1+k y 1 )K + y 1+k x 1 (x, y) k = H((x, y) k 1 x T y T k+1 ) + K(y, y) k (x, x ) k = (x, x ) k+1 H + ((x, y) k x 1+k y 1 )K + x 1+k x 1 (x, x ) k = H((x, x ) k 1 x T x T k+1 ) + K(y, x ) k (x T, y) k = J(x T, y) k+1 + P((x, y) k x T y T k ) + x T T y T k (x T, x ) k = J(x T, x ) k+1 + P((x, x ) k x T x T k ) + x T T x T k (x T, x T ) k = ((x T, x T ) k 1 x T k xt 1 )J + (x T, x ) k P (x T, x T ) k = J(x T, x T ) k+1 + P((x, x T ) k xt xt T k ) + x T T xt T k (x, x ) k = H(x, x ) k H + ((x, y) k x 1+k y 1 )K Hx T x T k H + K(y, x ) k+1 H + x 1+k x 1 (x T, x T ) k = J(x T, x T ) k J + P((x, x T ) k xt xt T k ) Jx T k+1 x1 T J + J(x T, x ) k+1 P + xt T xt T k James Martens (U of T) Learning the LDS with ASOS June 24, 2010 12 / 21

noting that statistics of time-lag T + 1 are 0 by definition we can start the 2 nd -order recursions at t = T but this doesn t get us anywhere - would be even more expensive than the usual Kalman recursions on the 1 st -order terms James Martens (U of T) Learning the LDS with ASOS June 24, 2010 13 / 21

noting that statistics of time-lag T + 1 are 0 by definition we can start the 2 nd -order recursions at t = T but this doesn t get us anywhere - would be even more expensive than the usual Kalman recursions on the 1 st -order terms instead, start the recursions at time-lag L with unbiased approximations ( ASOS approximations ) (y, x ) L+1 CA ( (x, x ) L x T x T L ), (x T, x ) L (x, x ) L, (x T, y) L (x, y) L James Martens (U of T) Learning the LDS with ASOS June 24, 2010 13 / 21

noting that statistics of time-lag T + 1 are 0 by definition we can start the 2 nd -order recursions at t = T but this doesn t get us anywhere - would be even more expensive than the usual Kalman recursions on the 1 st -order terms instead, start the recursions at time-lag L with unbiased approximations ( ASOS approximations ) (y, x ) L+1 CA ( (x, x ) L x T x T L ), (x T, x ) L (x, x ) L, (x T, y) L (x, y) L we also need xt T for t {1, 2,..., L} {T L, T L+1,..., T } but these can be approximated by a separate procedure (see paper) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 13 / 21

Why might this be reasonable? 2 nd -order statistics with large time lag quantify relationships between variables that are far apart in time weaker and less important than relationships between variables that are close in time James Martens (U of T) Learning the LDS with ASOS June 24, 2010 14 / 21

Why might this be reasonable? 2 nd -order statistics with large time lag quantify relationships between variables that are far apart in time weaker and less important than relationships between variables that are close in time in steady-state Kalman recursions, information is propagated via multiplication by H and J: x t = Hx t 1 + Ky t x T t = Jx T t+1 + Px t both of these have spectral radius (denoted σ( )) less than 1, and so they decay the signal exponentially James Martens (U of T) Learning the LDS with ASOS June 24, 2010 14 / 21

Procedure for estimating the M-statistics how do we compute an estimate of the M-statistics consistent with the 2nd-order recursions/equations and the approximations? James Martens (U of T) Learning the LDS with ASOS June 24, 2010 15 / 21

Procedure for estimating the M-statistics how do we compute an estimate of the M-statistics consistent with the 2nd-order recursions/equations and the approximations? essentially it is just a large linear system of dimension O(N 2 x L) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 15 / 21

Procedure for estimating the M-statistics how do we compute an estimate of the M-statistics consistent with the 2nd-order recursions/equations and the approximations? essentially it is just a large linear system of dimension O(N 2 x L) but using a general solver would be far too expensive: O(N 6 x L 3 ) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 15 / 21

Procedure for estimating the M-statistics how do we compute an estimate of the M-statistics consistent with the 2nd-order recursions/equations and the approximations? essentially it is just a large linear system of dimension O(N 2 x L) but using a general solver would be far too expensive: O(N 6 x L 3 ) fortunately, using the special structure of this system, we have developed a (non-trivial) algorithm which is much more efficient equations can be solved using an efficient iterative algorithm we developed for a generalization of the Sylvester equation evaluating recursions is then straightforward James Martens (U of T) Learning the LDS with ASOS June 24, 2010 15 / 21

Procedure for estimating the M-statistics how do we compute an estimate of the M-statistics consistent with the 2nd-order recursions/equations and the approximations? essentially it is just a large linear system of dimension O(N 2 x L) but using a general solver would be far too expensive: O(N 6 x L 3 ) fortunately, using the special structure of this system, we have developed a (non-trivial) algorithm which is much more efficient equations can be solved using an efficient iterative algorithm we developed for a generalization of the Sylvester equation evaluating recursions is then straightforward the cost is then just O(N 3 x L) after (y, y) k t y t+ky t has been pre-computed for k = 0,..., L James Martens (U of T) Learning the LDS with ASOS June 24, 2010 15 / 21

First convergence result: our intuition confirmed First result: For a fixed θ the l 2 -error in the M-statistics is bounded by a quantity proportional to L 2 λ L 1, where λ = σ(h) = σ(j) < 1 (σ( ) denotes the spectral radius) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 16 / 21

First convergence result: our intuition confirmed First result: For a fixed θ the l 2 -error in the M-statistics is bounded by a quantity proportional to L 2 λ L 1, where λ = σ(h) = σ(j) < 1 (σ( ) denotes the spectral radius) so as L grows, the estimation error for the M-statistics will decay exponentially James Martens (U of T) Learning the LDS with ASOS June 24, 2010 16 / 21

First convergence result: our intuition confirmed First result: For a fixed θ the l 2 -error in the M-statistics is bounded by a quantity proportional to L 2 λ L 1, where λ = σ(h) = σ(j) < 1 (σ( ) denotes the spectral radius) so as L grows, the estimation error for the M-statistics will decay exponentially but, λ might be close enough to 1 so that we need to make L too big James Martens (U of T) Learning the LDS with ASOS June 24, 2010 16 / 21

First convergence result: our intuition confirmed First result: For a fixed θ the l 2 -error in the M-statistics is bounded by a quantity proportional to L 2 λ L 1, where λ = σ(h) = σ(j) < 1 (σ( ) denotes the spectral radius) so as L grows, the estimation error for the M-statistics will decay exponentially but, λ might be close enough to 1 so that we need to make L too big fortunately we have a 2nd result which provides a very different type of guarantee James Martens (U of T) Learning the LDS with ASOS June 24, 2010 16 / 21

Second convergence result Second result: updates produced by ASOS procedure are asymptotically consistent with the usual EM updates in the limit as T assumes data is generated from the model James Martens (U of T) Learning the LDS with ASOS June 24, 2010 17 / 21

Second convergence result Second result: updates produced by ASOS procedure are asymptotically consistent with the usual EM updates in the limit as T assumes data is generated from the model first result didn t make strong use any property of the approx. (could use 0 for each and result would still hold) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 17 / 21

Second convergence result Second result: updates produced by ASOS procedure are asymptotically consistent with the usual EM updates in the limit as T assumes data is generated from the model first result didn t make strong use any property of the approx. (could use 0 for each and result would still hold) this second one is justified in the opposite way strong use of the approximation follows from convergence of 1 T -scaled expected l 2 error of approx. towards zero result holds for any value of L value James Martens (U of T) Learning the LDS with ASOS June 24, 2010 17 / 21

Experiments we considered 3 real datasets of varying sizes and dimensionality each algorithm initialized from same random parameters latent dimension N x determined by trial-and-error Experimental parameters Name length (T ) N y N x evaporator 6305 3 15 motion capture 15300 10 40 warship sounds 750000 1 20 James Martens (U of T) Learning the LDS with ASOS June 24, 2010 18 / 21

Experimental results (cont.) Name length (T ) N y N x evaporator 6305 3 15 motion capture 15300 10 40 warship sounds 750000 1 20 Negative Log Likelihood 3200 3000 2800 2600 ASOS EM 5 (13.1667 s) ASOS EM 10 (13.9436 s) ASOS EM 20 (16.6538 s) ASOS EM 35 (21.0141 s) ASOS EM 75 (27.1616 s) SS EM (56.804 s) EM (692.6135 s) 0 100 200 300 400 500 600 Iterations James Martens (U of T) Learning the LDS with ASOS June 24, 2010 19 / 21

Experimental results (cont.) Name length (T ) N y N x evaporator 6305 3 15 motion capture 15300 10 40 warship sounds 750000 1 20 Negative Log Likelihood x 10 5 1.96 1.97 1.98 1.99 2 2.01 ASOS EM 20 (52.0094 s) ASOS EM 30 (55.623 s) ASOS EM 50 (59.9489 s) SS EM (215.6504 s) EM (6018.4049 s) 0 50 100 150 200 250 300 350 400 450 Iterations James Martens (U of T) Learning the LDS with ASOS June 24, 2010 19 / 21

Experimental results (cont.) Name length (T ) N y N x evaporator 6305 3 15 motion capture 15300 10 40 warship sounds 750000 1 20 x 10 6 Negative Log Likelihood 6.71 6.708 6.706 6.704 6.702 ASOS EM 150 (49.6942 s) ASOS EM 300 (84.4925 s) ASOS EM 850 (214.039 s) ASOS EM 1700 (409.0395 s) SS EM (9179.0774 s) 0 50 100 150 200 250 300 350 400 Iterations James Martens (U of T) Learning the LDS with ASOS June 24, 2010 19 / 21

Conclusion we applied steady-state approximations to derive a set of 2nd-order recursions and equations James Martens (U of T) Learning the LDS with ASOS June 24, 2010 20 / 21

Conclusion we applied steady-state approximations to derive a set of 2nd-order recursions and equations approximated statistics of time-lag L James Martens (U of T) Learning the LDS with ASOS June 24, 2010 20 / 21

Conclusion we applied steady-state approximations to derive a set of 2nd-order recursions and equations approximated statistics of time-lag L produced an efficient algorithm for solving the resulting system Per-iteration run-times: EM SS-EM ASOS-EM O(Nx 3 T ) O(Nx 2 T + Nx 3 i) O(Nx 3 k lim ) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 20 / 21

Conclusion we applied steady-state approximations to derive a set of 2nd-order recursions and equations approximated statistics of time-lag L produced an efficient algorithm for solving the resulting system Per-iteration run-times: gave 2 formal convergence results EM SS-EM ASOS-EM O(Nx 3 T ) O(Nx 2 T + Nx 3 i) O(Nx 3 k lim ) James Martens (U of T) Learning the LDS with ASOS June 24, 2010 20 / 21

Conclusion we applied steady-state approximations to derive a set of 2nd-order recursions and equations approximated statistics of time-lag L produced an efficient algorithm for solving the resulting system Per-iteration run-times: gave 2 formal convergence results EM SS-EM ASOS-EM O(Nx 3 T ) O(Nx 2 T + Nx 3 i) O(Nx 3 k lim ) demonstrated significant performance benefits for learning with long time-series James Martens (U of T) Learning the LDS with ASOS June 24, 2010 20 / 21

Thank you for your attention James Martens (U of T) Learning the LDS with ASOS June 24, 2010 21 / 21