An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems

Size: px

Start display at page:

Download "An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems"

Dorothy Roberts
5 years ago
Views:

1 An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems Byron Boots and Geoffrey J. Gordon AAAI 2011 Select Lab Carnegie Mellon University

2 What is out there? o t+2 o t 2 o t 1 o t o t+1

3 What is out there? Dynamical Systems Past Future o t+2 o t 2 o t 1 o t o t+1 State Dynamical System = A recursive rule for updating state based on observations

4 Learning a Dynamical System Past Future o t+2 o t 2 o t 1 o t o t+1 State we would like to a model of a dynamical system

5 Learning a Dynamical System Past Future o t+2 o t 2 o t 1 o t o t+1 State we would like to a model of a dynamical system today I will focus on Spectral Learning Algorithms for Predictive State Representations

6 Predictive State Representations (PSRs) comprised of: set of ions set of observations initial state set of transition matrices normalization vector A O x1 R d M ao R d d e R d

7 Predictive State Representations (PSRs) comprised of: set of ions set of observations initial state set of transition matrices normalization vector A O x1 R d M ao R d d e R d at each time step can predict: P (o x t, do(a t )) = e M at,ox t

8 Predictive State Representations (PSRs) comprised of: set of ions set of observations initial state set of transition matrices normalization vector A O x1 R d M ao R d d e R d at each time step can predict: P (o x t, do(a t )) = e M at,ox t

9 Predictive State Representations (PSRs) comprised of: set of ions set of observations initial state set of transition matrices normalization vector A O x1 R d M ao R d d e R d at each time step can predict: P (o x t, do(a t )) = e M at,ox t

10 Predictive State Representations (PSRs) comprised of: set of ions set of observations initial state set of transition matrices normalization vector A O x1 R d M ao R d d e R d at each time step can predict: P (o x t, do(a t )) = e M at,ox t and update state: x t+1 = M at,o t x t /P (o t x t, do(a t ))

11 Predictive State Representations (PSRs) comprised of: set of ions set of observations initial state set of transition matrices normalization vector A O x1 R d M ao R d d e R d at each time step can predict: P (o x t, do(a t )) = e M at,ox t and update state: x t+1 = M at,o t x t /P (o t x t, do(a t ))

12 Predictive State Representations (PSRs) comprised of: set of ions set of observations initial state set of transition matrices normalization vector A O x1 R d M ao R d d e R d at each time step can predict: P (o x t, do(a t )) = e M at,ox t and update state: x t+1 = M at,o t x t /P (o t x t, do(a t ))

13 Predictive State Representations parameters are only determined up to a similarity transform S R d d if we replace M ao S 1 M ao S x 1 S 1 x 1 e S e the resulting PSR makes exly the same predictions as the original one e.g. P (o x t, do(a t )) = e SS 1 M at,oss 1 x t

14 PSRs Are Very Expressive Predictive State Representations Reduced-Rank HMMs & Reduced Rank POMDPS HMMs & POMDPS for fixed latent dimension d

15 Learning PSRs can use fast, statistically consistent, spectral methods to PSR parameters

16 A General Principle predict data about past (many samples) state data about future (many samples) compress bottleneck expand

17 A General Principle predict data about past (many samples) state data about future (many samples) compress bottleneck expand If bottleneck = rank constraint, then get a spectral method

18 Why Spectral Methods? There are many ways to a dynamical system Maximum Likelihood via Expectation Maximization, Gradient Descent,... Bayesian inference via Gibbs, Metropolis Hastings,... In contrast to these methods, spectral ing algorithms give No local optima: Huge gain in computational efficiency Slight loss in statistical efficiency

19 Spectral Learning for PSRs moments of directly observable features Σ T,AO,H trivariance tensor of features of the future, present, and past Σ T,H Σ AO,AO covariance matrix of features of the future and past covariance matrix of features present

20 Spectral Learning for PSRs moments of directly observable features Σ T,AO,H trivariance tensor of features of the future, present, and past Σ T,H Σ AO,AO covariance matrix of features of the future and past covariance matrix of features present U left d singular vectors of Σ T,H

21 Spectral Learning for PSRs moments of directly observable features Σ T,AO,H trivariance tensor of features of the future, present, and past Σ T,H Σ AO,AO covariance matrix of features of the future and past covariance matrix of features present U left d singular vectors of Σ T,H S 1 M ao S := Σ T,AO,H 1 U 2 φ(ao) (Σ AO,AO ) 1 3 (Σ T,HU) the other parameters can be found analogously

22 Spectral Learning for PSRs Spectral Learning Algorithm: Estimate Σ T,AO,H, Σ T,H, and Σ AO,AO from data Find U by SVD Plug in to recover PSR parameters Learning is Statistically Consistent Only requires Linear Algebra For details, see: B. Boots, S. M. Siddiqi, and G. Gordon. Closing the ing-planning loop with predictive state representations. RSS, 2010.

23 Infinite Features Can extend the ing algorithm to infinite feature spaces Kernels Learning algorithm that we have seen is linear algebra works just fine in an arbitrary RKHS Can rewrite all of the formulas in terms of Gram matrices Uses kernel SVD instead of SVD Result: Hilbert Space Embeddings of Dynamical Systems handles near arbitrary observation distributions good prediction performance For details, see: L. Song, B. Boots, S. M. Siddiqi, G. Gordon, and A. J. Smola. Hilbert space embeddings of hidden Markov models. ICML, 2010.

24 Accuracy (%) 6 MSE Err. Avg. Prediction 6 Avg. Prediction Err. Slot Car 8 An Experiment Mean 7 Last 6 LDS Hilbert Space Embeddings of Hidde 5 Hilbert Space Embeddings of H IMU 4 8 x 10 B. B. 8A.x HMM Mean 7 HMM Mean PSR RR-HMM Last 85 RR-HMM LDS Last Embedded 6 80 IMU 2 54 LDS Embedded 5 IMU thanks 2 to Dieter Fox s lab Slot Car Slot Car A. B. Avg. Prediction Err. A. sense x 10 6 Racetrack Prediction Horiz Figure 4. Slot car inertial measurement data. car platform and the IMU (top) and the rac man vs. this data while the slot car circled the track controlled 11 tom). (B) Squared error for prediction with d tom). (B) Squared error for prediction with diﬀerent estiembedd by a constant policy. The goal of this experiment was N mated models and baselines. to a model of the noisy IMU data, and, after diﬀeren mated models and baselines. ti Racetrack Prediction Horizon 1 60 horizon 0 Figure Slot 30 car inertial (A) The slot Fi measurement data car platform and the IMU (top) and the racetrack (botm Racetrack Prediction Horizon tom). (B) Squared error for prediction with diﬀerent estiem Figure 4.L. Slot inertial measurement The mated models and baselines. Song, B.car Boots, S. M. Siddiqi, G. Gordon, and A.data. J. Smola.(A) Hilbert spaceslot embeddingsfiguredi5 of hiddenand Markov models. ICML, and the racetrack (botcar platform the IMU (top) filtering, to predict future IMU readings. sp

25 Batch Methods Bottleneck: SVD of Gram or Covariance matrix G: (# time steps) 2 C: (# features window length) (# time steps) E.g., 1 hr video, 24 fps, , features of past and future are all pixels in 2 s windows G: ( ) ( ) 10 10

26 Making it Fast Two techniques online ing random projections Neither one new, but combination with spectral ing for PSRs is, and makes huge difference in price

27 Online Learning U left d singular vectors of Σ T,H S 1 M ao S := Σ T,AO,H 1 U 2 φ(ao) (Σ AO,AO ) 1 3 (Σ T,HU)

28 Online Learning U left d singular vectors of Σ T,H S 1 M ao S := Σ T,AO,H 1 U 2 φ(ao) (Σ AO,AO ) 1 3 (Σ T,HU) With each new observation, rank-1 update of: SVD (Brand) inverse (Sherman-Morrison) n features; latent dimension d; T steps space = O(nd): may fit in cache! time = O(nd 2 T): bounded time per example Problem: no rank-1 update of kernel SVD! can use random projections [Rahimi & Recht, 2007]

29 Random Projections U left d singular vectors of Σ T,H S 1 M ao S := Σ T,AO,H 1 U 2 φ(ao) (Σ AO,AO ) 1 3 (Σ T,HU) With each new observation, rank-1 update of: SVD (Brand) inverse (Sherman-Morrison) n features; latent dimension d; T steps space = O(nd): may fit in cache! time = O(nd 2 T): bounded time per example Problem: no rank-1 update of kernel SVD! can use random projections [Rahimi & Recht, 2007]

30 Hilbert Space B. (Revisited) Experiment 8 Avg.MSEPrediction Err. Mean 7 Last 6 LDS Hilbert Embeddings of Hidd 5 x 10Space 6 IMU 8 x 10 Mean Observation 47 B HMM 7 HSE-HMMHMM 36 MeanEmbedded 85 Last Online withrr-hmm Rand. Features LDS Embedded 4 5 IMU thanks 2 to Dieter Fox s lab Accuracy (%) Slot Car A. Slot Car A. x 10 6 Avg. Prediction Err. sense Racetrack Prediction Hori Prediction 70 80Horizon Figure measurement data. Racetrack 4. Slot car inertial Prediction Horizon Figure 4. Slot car inertial measurement data. (A) The slot Figure car platform and the IMU (top) and the ra car platform and the IMU (top) and the racetrack (botman vs (B) error Squared errorwith fordiﬀerent prediction with tom).tom). (B) Squared for prediction estiembed mated models and baselines. diﬀeren mated models and baselines

31 31 sense Online Learning Example Conference Room A. Table

32 32 sense Online Learning Example Conference Room Table A. online+random: 100k features, 11k frames, limit = avail. data offline: 2k frames, compressed & subsampled, compute-limited

33 33 sense Online Learning Example Conference Room Table A. online+random: 100k features, 11k frames, limit = avail. data offline: 2k frames, compressed & subsampled, compute-limited s 100 steps

34 34 sense Online Learning Example Conference Room Table A. online+random: 100k features, 11k frames, limit = avail. data offline: 2k frames, compressed & subsampled, compute-limited 350 steps

35 35 sense Online Learning Example Conference Room Table A. online+random: 100k features, 11k frames, limit = avail. data offline: 2k frames, compressed & subsampled, compute-limited 600 steps 0

36 sense Online Learning Example Conference Room A. online+random: 100k features, 11k frames, limit = avail. data Table offline: 2k frames, compressed & subsampled, compute-limited A. B. final embedding (colors = 3rd dim) 0 After 600 samples 0 Table 0 B. í í 600 steps C. C. After 1 $I sa Af sa í í 0 36

37 Paper Summary We present spectral ing algorithms for PSR models of partially observable nonlinear dynamical systems. We show how to update parameters of the estimated PSR model given new data efficient online spectral ing algorithm We show how to use random projections to approximate kernel-based ing algorithms

Learning about State. Geoff Gordon Machine Learning Department Carnegie Mellon University

Learning about State. Geoff Gordon Machine Learning Department Carnegie Mellon University Learning about State Geoff Gordon Machine Learning Department Carnegie Mellon University joint work with Byron Boots, Sajid Siddiqi, Le Song, Alex Smola 2 What s out there?...... ot-2 ot-1 ot ot+1 ot+2