Learning about State. Geoff Gordon Machine Learning Department Carnegie Mellon University

Size: px

Start display at page:

Download "Learning about State. Geoff Gordon Machine Learning Department Carnegie Mellon University"

Rolf Johnston
5 years ago
Views:

1 Learning about State Geoff Gordon Machine Learning Department Carnegie Mellon University joint work with Byron Boots, Sajid Siddiqi, Le Song, Alex Smola

2 2 What s out there? ot-2 ot-1 ot ot+1 ot+2

3 2 What s out there? ot-2 ot-1 ot ot+1 ot+2

4 2 What s out there? ot-2 ot-1 ot ot+1 ot+2 steam rising from a grate

5 What s out there? A dynamical system Past Future ot-2 ot-1 ot ot+1 ot+2 State Dynamical system = recursive rule for updating state based on observations 3

6 What s out there? A dynamical system Past Future ot-2 ot-1 ot ot+1 ot+2 State Dynamical system = recursive rule for updating state based on observations 3

7 Learning a dynamical system Past Future ot-2 ot-1 ot ot+1 ot+2 Given past observations from a partially observable system State 4

8 Learning a dynamical system Past Future ot-2 ot-1 ot ot+1 ot+2 Given past observations from a partially observable system State 4

9 Learning a dynamical system Past Future ot-2 ot-1 ot ot+1 ot+2 Given past observations from a partially observable system State Predict future observations 4

10 5 Examples Baum-Welsh EM algorithm for HMMs Tomasi-Kanade structure from motion Black-Scholes model of stock price SLAM (from lidars, cameras, beacons, ) System identification for Kalman filters

11 6 A general principle predict data about past (many samples) state data about future (many samples) compress bottleneck expand

12 6 A general principle predict data about past (many samples) state data about future (many samples) compress bottleneck expand If bottleneck = rank constraint, get a spectral method

13 7 Why spectral methods? Many ways to learn models of dynamical systems max likelihood via EM, gradient descent, Bayesian inference via Gibbs, MH, In contrast to these, spectral methods give no local optima! huge gain in computational efficiency slight loss in statistical efficiency

14 8 Example: SSID for Kalman filter n n x = A x + noise o = C x + noise n A m C Past data = last k observations Future data = next k observations must have k and k big enough Prediction = linear regression look at empirical covariance of past & future Spectral: bottleneck = SVD of covariance [van Overschee & de Moor, 1996]

15 Kalman SSID x = A x + noise o = C x + noise C A P C T Assume for simplicity m n, both A and C full rank For k 1, E[o t+k o t ] = E[E[o t+k o t x t ]] = E[E[o t+k x t ]E[o t x t ]] = E[CA k x t (Cx t ) ] = CA k E[x t x t ]C = CA k PC 9

16 10 Kalman SSID Σ k = E[o t+k o t ] = CA k PC Let U = left n leading singular vectors of Σ1 Â Ĉ def = U Σ 2 (U Σ 1 ) = U CA 2 PC (U CAP C ) = (U CA)AP C (PC ) (U CA) 1 = SAS 1 def = U(SAS 1 ) 1 = USA 1 S 1 = U(U CA)A 1 S 1 = CS 1

17 11 Kalman SSID Algorithm: estimate Σ1 and Σ2 from data, get Û by SVD, plug in for Â and Ĉ Consistent: continuity of formulas for Â and Ĉ, law of large numbers for Σ1 and Σ2 wrinkle: SVD for Û isn t continuous, but range(û) is Can also recover steady-state x

18 12 Variations Use arbitrary features of length-k window of past and future observations work from covariance of past, future features good features make a big difference in practice Impose constraints on learned model (e.g., stability)

19 13 Kalman SSID: example Works well for video textures steam grate example above fountain: observation = raw pixels (vector of reals over time)

14 Structure from motion feature 1, step 2 x 11 y 11 x 12 y 12... x 1T y 1T x 21 y 21 x 22 y 22... x 2T y 2T.

20 14 Structure from motion feature 1, step 2 x 11 y 11 x 12 y x 1T y 1T x 21 y 21 x 22 y x 2T y 2T x N1 y N1 x N2 y N2... x NT y NT Track N features over T steps [Tomasi & Kanade, 1992]

21 15 Structure from motion x 11 y 11 x 12 y x 1T y 1T x 21 y 21 x 22 y x 2T y 2T x N1 y N1 x N2 y N2... x NT y NT xit is projection of feature i onto camera s horizontal axis at time t (and yit, vertical) [ui, vi, wi] = feature i coordinates [h1t, h2t, h3t] = camera horizontal [v1t, v2t, v3t] = camera vertical o o o o

22 16 Structure from motion u 1 v 1 w 1 u 2 v 2 w h 11 v 11 h 12 v h 1T v 1T h 21 v 21 h 22 v h 2T v 2T h 31 v 31 h 32 v h 3T v 3T u N v N w N xit is projection of feature i onto camera s horizontal axis at time t (and yit, vertical) [ui, vi, wi] = feature i coordinates [h1t, h2t, h3t] = camera horizontal [v1t, v2t, v3t] = camera vertical o o o o

23 17 Structure from motion only determined up to an invertible transform

24 18 SfM as SSID cov = x 11 y 11 x 12 y x 1T y 1T x 21 y 21 x 22 y x 2T y 2T x N1 y N1 x N2 y N2... x NT y NT Past data: indicator of time step & h/v axis means we get to memorize each time step no attempt to learn dynamics Future data: observed screen coordinates (column of matrix)

25 19 Kalman SSID: failure HMM (Baum-Welsh) Kalman Filter (SSID) Preview all models: 10 latent dimensions

26 19 Kalman SSID: failure HMM (Baum-Welsh) Kalman Filter (SSID) Preview all models: 10 latent dimensions

27 19 Kalman SSID: failure HMM (Baum-Welsh) Kalman Filter (SSID) Preview all models: 10 latent dimensions

28 19 Kalman SSID: failure HMM (Baum-Welsh) Kalman Filter (SSID) Preview all models: 10 latent dimensions

29 Can we generalize? n n x = A x + noise o = C x + noise n A m C Get rid of Gaussian noise assumption HMM: same form as Kalman filter, but A 0, A1 = 1, C 0, C1 = 1 noise ~ multinomial x, o are indicators: e.g., 4 = [ ] T 20

30 Derivations for Kalman v. HMM Kalman filter HMM E[o t+k o t ] = E[E[o t+k o t x t ]] = E[E[o t+k x t ]E[o t x t ]] = E[CA k x t (Cx t ) ] = CA k E[x t x t ]C = CA k PC Assume for simplicity m n, both A and C full rank 21

31 Derivations for Kalman v. HMM Kalman filter HMM E[o t+k o t ] = E[E[o t+k o t x t ]] = E[E[o t+k x t ]E[o t x t ]] = E[CA k x t (Cx t ) ] = CA k E[x t x t ]C = CA k PC E[o t+k o t ] = E[E[o t+k o t x t ]] = E[E[o t+k x t ]E[o t x t ]] = E[CA k x t (Cx t ) ] = CA k E[x t x t ]C = CA k PC Assume for simplicity m n, both A and C full rank 21

32 HMM SSID: first try U Σ 2 (U Σ 1 ) = U CA 2 PC (U CAP C ) = (U CA)AP C (PC ) (U CA) 1 = SAS 1 As before, recover Â and Ĉ from E[ot+1ot T ] & E[ot+2ot T ] C A P C T Doesn t satisfy A 0, A1 = 1, C 0, C1 = 1 is this a problem? 22

33 23 Merging A & C HMM tracking: write bt = P[xt o1:t] P[xt o1:t-1] = bt-0.5 = A bt-1 P[ot o1:t-1] = C bt-0.5 if ot = o: where Z = P(ot=o bt-1) P(xt=x o1:t) = P(o xt=x) P(xt=x o1:t-1) / Z i.e., bt = diag(co,:) bt-0.5 / Z Write Ao = diag(co,:) A bt = Ao bt-1 / Z

34 23 Merging A & C HMM tracking: write bt = P[xt o1:t] P[xt o1:t-1] = bt-0.5 = A bt-1 P[ot o1:t-1] = C bt-0.5 if ot = o: where Z = P(ot=o bt-1) P(xt=x o1:t) = P(o xt=x) P(xt=x o1:t-1) / Z i.e., bt = diag(co,:) bt-0.5 / Z Write Ao = diag(co,:) A It s enough to estimate Ao bt = Ao bt-1 / Z

35 23 Merging A & C HMM tracking: write bt = P[xt o1:t] P[xt o1:t-1] = bt-0.5 = A bt-1 P[ot o1:t-1] = C bt-0.5 if ot = o: where Z = P(ot=o bt-1) P(xt=x o1:t) = P(o xt=x) P(xt=x o1:t-1) / Z i.e., bt = diag(co,:) bt-0.5 / Z Write Ao = diag(co,:) A It s enough to estimate Ao bt = Ao bt-1 / Z P(ot=o bt-1) = 1 T Ao bt-1

36 24 HMM SSID: try #2 Σ o 2 def = E[o t+2 (δ o o t+1 )o t ] = E[E[o t+2 (δ o o t+1 )o t x t ]] = E[E[o t+2 (δ o o t+1 ) x t ]E[o t x t ]] = E[E[o t+2 x t,o t+1 = o]p[o t+1 = o x t ](Cx t ) ] = E[E[o t+2 x t,o t+1 = o](1 A o x t )(Cx t ) ] Ao x t (1 A o x t )(Cx t ) = E CA 1 A o x t = E[CAA o x t (Cx t ) ] = CAA o E[x t x t ]C = CAA o PC

37 25 HMM SSID: try #2 Â o def = U Σ o 2(U Σ 1 ) = U CAA o PC (U CAP C ) = (U CA)A o PC (PC ) (U CA) 1 = SA o S 1 x = Aox / P(o) o ~ Cx Co,: = e T Ao o 2 Estimate Σ1 and Σ from data; get Û = SVD(Σ1) Plug in to get Âo (for each o) Also need e = S -1 1 = leading left eigenvector of A1 + A2 +

38 Example: clock Discrete observations: sampled frames from training video when tracking: nearest neighbor or Parzen windows (mixture of Gaussians HMM) 10 latent dimensions 26

39 27 Can we generalize? HMMs had x Δ intuition: number of discrete states = number of dimensions We now have x SΔ essentially equally restrictive Can we allow x X for general X? # states > # dims

40 28 # states > # dims: the picture N=3 N=15 N=100 Random projections of N-dimensional simplex

41 29 SSID for OOMs PSRs without actions, multiplicity automata, x = Aox / P(o) x = Aox / P(o) OOM: o ~ Cx HMM: o ~ Cx Co,: = e T Ao Co,: = e T Ao OOM: defined by transition matrices Ao, normalization vector e like HMM, but lift restriction of X = SΔ instead of Aox 0, have Aox λx, λ 0 includes HMM as special case

42 30 OOM SSID No change!

43 31 OOM example No change! our HMM SSID was actually learning OOMs all along

44 Can we generalize? We ve allowed finer discretization of observation space Can we allow continuous observations? Yes: featurize! let ϕ(o) be a feature function 32

45 33 Featurize Σ φ 2 def = E[o t+2 φ(o t+1 )o t ] = o φ(o)e[o t+2 (δ o o t+1 )o t ] = o φ(o)caa o PC Â φ def = U Σ φ 2 (U Σ 1 ) = o φ(o)sa o S 1 Store Âϕ for many different ϕ, recover Âo as needed

46 Example: Range-only SLAM Robot measures distances to L landmarks as it moves; wants to reconstruct path and landmark locations T = 1000, L = 20, window = 1 obs, latent dimension = 15 Features = e d2 /2σ 2 34

47 Example: Range-only SLAM Robot measures distances to L landmarks as it moves; wants to reconstruct path and landmark locations T = 1000, L = 20, window = 1 obs, latent dimension = 15 Features = e d2 /2σ 2 34

48 35 Can we generalize? If some features are good, more must be better Kernels! Everything above is linear algebra Works just fine in an arbitrary RKHS Can rewrite in terms of Gram matrix (no infinite-d computations required) Caveat: regularization now more important

49 Avg. Prediction E Path Prediction performance 6 Avg. Prediction Err. Slot Ca Slot Car A. 7 6 Hilbert 5 Space Embeddin IMU x 10 4 B. 8 HMM 7 3 Mean RR-HMM Last Environment 6 2 LDS Embedded 5 IMU thanks to Dieter Fox s lab Hilbert Space Embeddings Hilbert of Hidden Markov Modelso Space Embeddings HMM RR-HMM Embedded Mean Last LDS Environment mated models and baselines. Geoff Gordon ARC colloquium Apr, 2011 horizon x Accuracy (%) Avg. Prediction Err. Avg.prediction Prediction Err. error Racetrack Predic B. A. B Figure measureme Racetrack 4. Slot car inertial Prediction Horizon U Figure 4. Slot car inertial measurement data. (A) The slot car platform and the IMU (top) and ar platform and the IMU (top) and the racetrack (bot(b) error Squared errorwith fordiﬀerent predictio Pathfor prediction om).tom). (B) Squared estix 10 1 Example0Images HMM Mean RR-HMM Last 60 LDS Embedded Space 80 90Dimens Latent 65 Prediction Horizon horizon

50 37 Learning in-the-loop: option pricing pt pt 100 Price a financial derivative: psychic call holder gets to say I bought call 100 days ago underlying stock follows Black Scholes (unknown parameters) One solution: ID the B-S parameters, plan but planning is itself hard

51 38 Option pricing pt pt 100 A better solution [Van Roy et al.]: use RL 16 hand-picked features (e.g., poly history) initialize policy arbitrarily least-squares temporal differences (LSTD) to estimate value function policy := greedy; repeat

52 38 Option pricing pt pt 100 A better solution [Van Roy et al.]: use RL 16 hand-picked features (e.g., poly history) initialize policy arbitrarily least-squares temporal differences (LSTD) to estimate value function policy := greedy; repeat

53 39 Option pricing Still better: use SSID inside policy iteration 16 original features from Van Roy et al. 204 additional low-originality features e.g., linear fns of price history of underlying SSID picks best 16-d dynamics to explain feature evolution solve for value function in closed form

54 Policy iteration w/ spectral learning Expected payoff / $ invested Policy Iteration Iterations of PI Threshold LSTD (16) LSTD LARS-TD PSTD 0.82 /$ better than best competitor; 1.33 /$ better than best previously published data: 1,000,000 time steps 40

55 41 Making it fast Bottleneck: SVD of Gram or Hankel matrix G: (# time steps) 2 H: (# obs window length) (# time steps) E.g., 1 hr video, 24 fps, , 2 s window G: ( ) ( ) H: ( ) ( ) >> k = 50; n = 2000; % n 2 = >> tic; x = randn(n,n); [u,s,v] = svds(x,k); toc Elapsed time is seconds.

56 42 Making it fast Two techniques online learning random embedding Neither one new, but combination with PSR SSID is, and makes huge difference in practice

57 Online learning With each new observation, rank-1 update of: SVD (Brand) inverse (Sherman-Morrison) n features; latent dimension d; T steps space = O(nd): may fit in cache! time = O(nd 2 T): bounded time per example Small loss in statistical efficiency (estimated subspace rotates), but can deal with it Problem: no rank-1 update of KSVD! 43

58 44 Random embedding k(z) = p(ω)e jω z dω R d = p(ω) cos(ω z)dω R d = (1/Z) E ω Zp [cos(ω z)] x Often, k(x,y) = k(x y); PD Fourier transform of k(z) satisfies p(ω) 0 e.g., Gaussian, Laplacian, Cauchy ω R D R Looks like an expectation [Rahimi & Recht, 2007]

59 45 Random embedding cos(ω (x y)) = cos(ω x) cos(ω y) + sin(ω x) sin(ω y) cos(ω x) cos(ω y) = sin(ω x) sin(ω y) Features cos(ω x)/k, sin(ω x)/k for k random ωi ~ p φ(x) φ(y) = 1 k k cos(ω i (x y)) E[cos(ω (x y)] i=1 Convergence uniform in z as k

60 Random embedding example basis size

61 C. A. B. Table B. 0 0 first few steps C. les í í online+random: 100k features, 11k frames, limit = avail. data offline: 2k frames, compressed & subsampled, compute-limited Table final embedding (colors = 3rd dim) $IWHU samples After 600 A. p After 100 sam Results: Closing the loop 0 í í 0 47

62 Results: Closing the loop A. online+random: 100k features, 11k frames, limit = avail. data offline: 2k frames, compressed & subsampled, compute-limited Table A. final embedding (colors = 3rd dim) Table first few steps B. 0 0 HU mples C. í í 00 samples B. 0 í í 0 47

63 Results: Closing the loop A. online+random: 100k features, 11k frames, limit = avail. data offline: 2k frames, compressed & subsampled, compute-limited Table C. 0 After 600 samples Table B. 0 0 final embedding (colors = 3rd dim) í í first few steps B. A. í í 0 47

64 48 Planning image wheel velocities Suppose we can predict future state Choose actions to maximize reward

65 49 Planning Value iteration: exactly the same math as POMDP value iteration point-based methods are fast, accurate need # points exponential in latent dim: possible big win for learned models

66 Value iteration: Data Bird s Eye View goal image t =1 16x16 RGB t = 10 3d view observation at t = 10 observation at t =1 actions: 6 different noisy translations and rotations Data: 10,000 random start positions; traces of 7 random actions, observations, and rewards (+1 for goal, ϵ otherwise) 50

67 VI: Learned subspace 2 dimensions of learned 5D subspace points take average color of next image 51

68 VI: Plans goal goal Near-optimal plans, only 5d latent space (compare to 5-state POMDP) 52

69 53 Summary Learn dynamical system models with no local optima, fast online computation Nonparametric (kernel-based) version handles near-arbitrary observation distributions One general principle yields algorithms for HMMs, OOMs, SfM, range-only SLAM (kn. corr.), Kalman system ID, RL, Good results from a general-purpose algorithm on problems typically tackled by lots of engineering

70 Papers B. Boots and G. An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems. AAAI, B. Boots and G. Predictive state temporal difference learning. NIPS, B. Boots, S. M. Siddiqi, and G. Closing the learning-planning loop with predictive state representations. RSS, L. Song, B. Boots, S. M. Siddiqi, G., and A. J. Smola. Hilbert space embeddings of hidden Markov models. ICML, (Best paper) 54

Learning about State. Geoff Gordon Machine Learning Department Carnegie Mellon University

Learning about State. Geoff Gordon Machine Learning Department Carnegie Mellon University Learning about State Geoff Gordon Machine Learning Department Carnegie Mellon University joint work with Byron Boots, Sajid Siddiqi, Le Song, Alex Smola What s out there?...... ot-2 ot-1 ot ot+1 ot+2 2