Learning about State. Geoff Gordon Machine Learning Department Carnegie Mellon University

Size: px

Start display at page:

Download "Learning about State. Geoff Gordon Machine Learning Department Carnegie Mellon University"

Duane Osborne
5 years ago
Views:

1 Learning about State Geoff Gordon Machine Learning Department Carnegie Mellon University joint work with Byron Boots, Sajid Siddiqi, Le Song, Alex Smola

2 What s out there? ot-2 ot-1 ot ot+1 ot+2 2

3 What s out there? ot-2 ot-1 ot ot+1 ot+2 2

4 What s out there? ot-2 ot-1 ot ot+1 ot+2 steam rising from a grate 2

5 What s out there? A dynamical system Past ot-2 ot-1 ot ot+1 ot+2 State Future Dynamical system = recursive rule for updating state based on observations 3

6 What s out there? A dynamical system Past ot-2 ot-1 ot ot+1 ot+2 State Future Dynamical system = recursive rule for updating state based on observations 3

7 Learning a dynamical system Past Future ot-2 ot-1 ot ot+1 ot+2 Given past observations from a partially observable system State 4

8 Learning a dynamical system Past Future ot-2 ot-1 ot ot+1 ot+2 Given past observations from a partially observable system State 4

9 Learning a dynamical system Past Future ot-2 ot-1 ot ot+1 ot+2 Given past observations from a partially observable system State Predict future observations 4

10 Examples Baum-Welsh EM algorithm for HMMs Tomasi-Kanade structure from motion Black-Scholes model of stock price SLAM (from lidars, cameras, beacons, ) System identification for Kalman filters 5

11 A general principle predict data about past (many samples) state data about future (many samples) compress bottleneck expand 6

12 A general principle predict data about past (many samples) state data about future (many samples) compress bottleneck expand If bottleneck = rank constraint, get a spectral method 6

13 Why spectral methods? Many ways to learn models of dynamical systems max likelihood via EM, gradient descent, Bayesian inference via Gibbs, MH, In contrast to these, spectral methods give no local optima! huge gain in computational efficiency slight loss in statistical efficiency 7

14 Example: SSID for Kalman filter x = A x + noise o = C x + noise n n A m n C Past data = last k observations Future data = next k observations must have k and k big enough Prediction = linear regression look at empirical covariance of past & future Spectral: bottleneck = SVD of covariance [van Overschee & de Moor, 1996] 8

15 Kalman SSID x = A x + noise o = C x + noise Assume for simplicity m n, both A and C full rank For k 1, C E[o t+k o t ] = E[E[o t+k o t x t ]] A P C T = E[E[o t+k x t ]E[o t x t ]] = E[CA k x t (Cx t ) ] = CA k E[x t x t ]C = CA k PC 9

16 Kalman SSID Σ k = E[o t+k o t ] = CA k PC Let U = left n leading singular vectors of Σ1 Â Ĉ def = U Σ 2 (U Σ 1 ) = U CA 2 PC (U CAP C ) = (U CA)AP C (PC ) (U CA) 1 = SAS 1 def = U(SAS 1 ) 1 = USA 1 S 1 = U(U CA)A 1 S 1 = CS 1 10

17 Kalman SSID Algorithm: estimate Σ1 and Σ2 from data, get Û by SVD, plug in for Â and Ĉ Consistent: continuity of formulas for Â and Ĉ, law of large numbers for Σ1 and Σ2 wrinkle: SVD for Û isn t continuous, but range(û) is Can also recover steady-state x 11

18 Variations Use arbitrary features of length-k window of past and future observations work from covariance of past, future features good features make a big difference in practice Impose constraints on learned model (e.g., stability) 12

19 Kalman SSID: example Works well for video textures steam grate example above fountain: observation = raw pixels (vector of reals over time) 13

Structure from motion feature 1, step 2 x 11 y 11 x 12 y 12... x 1T y 1T x 21 y 21 x 22 y 22.

20 Structure from motion feature 1, step 2 x 11 y 11 x 12 y x 1T y 1T x 21 y 21 x 22 y x 2T y 2T x N1 y N1 x N2 y N2... x NT y NT Track N features over T steps [Tomasi & Kanade, 1992] 14

21 Structure from motion x 11 y 11 x 12 y x 1T y 1T x 21 y 21 x 22 y x 2T y 2T x N1 y N1 x N2 y N2... x NT y NT xit is projection of feature i onto camera s horizontal axis at time t (and yit, vertical) [ui, vi, wi] = feature i coordinates [h1t, h2t, h3t] = camera horizontal [v1t, v2t, v3t] = camera vertical o o o o 15

22 Structure from motion u 1 v 1 w 1 u 2 v 2 w u N v N w N h 11 v 11 h 12 v h 1T v 1T h 21 v 21 h 22 v h 2T v 2T h 31 v 31 h 32 v h 3T v 3T xit is projection of feature i onto camera s horizontal axis at time t (and yit, vertical) [ui, vi, wi] = feature i coordinates [h1t, h2t, h3t] = camera horizontal [v1t, v2t, v3t] = camera vertical o o o o 16

23 Structure from motion only determined up to an invertible transform 17

24 SfM as SSID cov = x 11 y 11 x 12 y x 1T y 1T x 21 y 21 x 22 y x 2T y 2T x N1 y N1 x N2 y N2... x NT y NT Past data: indicator of time step & h/v axis means we get to memorize each time step no attempt to learn dynamics Future data: observed screen coordinates (column of matrix) 18

25 Kalman SSID: failure HMM (Baum-Welsh) Kalman Filter (SSID) Preview all models: 10 latent dimensions 19

26 Kalman SSID: failure HMM (Baum-Welsh) Kalman Filter (SSID) Preview all models: 10 latent dimensions 19

27 Kalman SSID: failure HMM (Baum-Welsh) Kalman Filter (SSID) Preview all models: 10 latent dimensions 19

28 Kalman SSID: failure HMM (Baum-Welsh) Kalman Filter (SSID) Preview all models: 10 latent dimensions 19

29 Can we generalize? x = A x + noise o = C x + noise n n A m n C Get rid of Gaussian noise assumption HMM: same form as Kalman filter, but A 0, A1 = 1, C 0, C1 = 1 noise ~ multinomial x, o are indicators: e.g., 4 = [ ] T 20

30 Derivations for Kalman v. HMM Kalman filter HMM E[o t+k o t ] = E[E[o t+k o t x t ]] = E[E[o t+k x t ]E[o t x t ]] = E[CA k x t (Cx t ) ] = CA k E[x t x t ]C = CA k PC Assume for simplicity m n, both A and C full rank 21

31 Derivations for Kalman v. HMM Kalman filter E[o t+k o t ] = E[E[o t+k o t x t ]] = E[E[o t+k x t ]E[o t x t ]] = E[CA k x t (Cx t ) ] = CA k E[x t x t ]C = CA k PC HMM E[o t+k o t ] = E[E[o t+k o t x t ]] = E[E[o t+k x t ]E[o t x t ]] = E[CA k x t (Cx t ) ] = CA k E[x t x t ]C = CA k PC Assume for simplicity m n, both A and C full rank 21

32 HMM SSID: first try U Σ 2 (U Σ 1 ) = U CA 2 PC (U CAP C ) = (U CA)AP C (PC ) (U CA) 1 = SAS 1 As before, recover Â and Ĉ from E[ot+1ot T ] & E[ot+2ot T ] C A P C T Doesn t satisfy A 0, A1 = 1, C 0, C1 = 1 is this a problem? 22

33 Merging A & C HMM tracking: write bt = P[xt o1:t] P[xt o1:t-1] = bt-0.5 = A bt-1 P[ot o1:t-1] = C bt-0.5 if ot = o: P(xt=x o1:t) = P(o xt=x) P(xt=x o1:t-1) / Z i.e., bt = diag(co,:) bt-0.5 / Z Write Ao = diag(co,:) A bt = Ao bt-1 / Z where Z = P(ot=o bt-1) 23

34 Merging A & C HMM tracking: write bt = P[xt o1:t] P[xt o1:t-1] = bt-0.5 = A bt-1 P[ot o1:t-1] = C bt-0.5 if ot = o: P(xt=x o1:t) = P(o xt=x) P(xt=x o1:t-1) / Z i.e., bt = diag(co,:) bt-0.5 / Z Write Ao = diag(co,:) A bt = Ao bt-1 / Z where Z = P(ot=o bt-1) It s enough to estimate Ao 23

35 Merging A & C HMM tracking: write bt = P[xt o1:t] P[xt o1:t-1] = bt-0.5 = A bt-1 P[ot o1:t-1] = C bt-0.5 if ot = o: P(xt=x o1:t) = P(o xt=x) P(xt=x o1:t-1) / Z i.e., bt = diag(co,:) bt-0.5 / Z Write Ao = diag(co,:) A bt = Ao bt-1 / Z where Z = P(ot=o bt-1) It s enough to estimate Ao P(ot=o bt-1) = 1 T Ao bt-1 23

36 HMM SSID: try #2 Σ o 2 def = E[o t+2 (δ o o t+1 )o t ] = E[E[o t+2 (δ o o t+1 )o t x t ]] = E[E[o t+2 (δ o o t+1 ) x t ]E[o t x t ]] = E[E[o t+2 x t,o t+1 = o]p[o t+1 = o x t ](Cx t ) ] = E[E[o t+2 x t,o t+1 = o](1 A o x t )(Cx t ) ] Ao x t = E CA 1 (1 A o x t )(Cx t ) A o x t = E[CAA o x t (Cx t ) ] = CAA o E[x t x t ]C = CAA o PC 24

37 HMM SSID: try #2 Â o def = U Σ o 2(U Σ 1 ) = U CAA o PC (U CAP C ) = (U CA)A o PC (PC ) (U CA) 1 = SA o S 1 x = Aox / P(o) o ~ Cx Co,: = e T Ao o 2 Estimate Σ1 and Σ from data; get Û = SVD(Σ1) Plug in to get Âo (for each o) Also need e = S -1 1 = leading left eigenvector of A1 + A2 + 25

38 Example: clock Discrete observations: sampled frames from training video when tracking: nearest neighbor or Parzen windows (mixture of Gaussians HMM) 10 latent dimensions 26

39 Can we generalize? HMMs had x Δ intuition: number of discrete states = number of dimensions We now have x SΔ essentially equally restrictive Can we allow x X for general X? # states > # dims 27

40 # states > # dims: the picture N=3 N=15 N=100 Random projections of N-dimensional simplex 28

41 SSID for OOMs PSRs without actions, multiplicity automata, OOM: x = Aox / P(o) o ~ Cx HMM: x = Aox / P(o) o ~ Cx Co,: = e T Ao Co,: = e T Ao OOM: defined by transition matrices Ao, normalization vector e like HMM, but lift restriction of X = SΔ instead of Aox 0, have Aox λx, λ 0 includes HMM as special case 29

42 OOM SSID No change! 30

43 OOM example No change! our HMM SSID was actually learning OOMs all along 31

Learning about State. Geoff Gordon Machine Learning Department Carnegie Mellon University

Learning about State. Geoff Gordon Machine Learning Department Carnegie Mellon University Learning about State Geoff Gordon Machine Learning Department Carnegie Mellon University joint work with Byron Boots, Sajid Siddiqi, Le Song, Alex Smola 2 What s out there?...... ot-2 ot-1 ot ot+1 ot+2