Spectral learning algorithms for dynamical systems

Size: px

Start display at page:

Download "Spectral learning algorithms for dynamical systems"

Silas Bridges
6 years ago
Views:

1 Spectral learning algorithms for dynamical systems Geoff Gordon Machine Learning Department Carnegie Mellon University joint work with Byron Boots, Sajid Siddiqi, Le Song, Alex Smola

2 2 What s out there? ot-2 ot-1 ot ot+1 ot+2 video frames, as vectors of pixel intensities

3 2 What s out there? ot-2 ot-1 ot ot+1 ot+2 video frames, as vectors of pixel intensities

4 A dynamical system Past Future ot-2 ot-1 ot ot+1 ot+2 Given past observations from a partially observable system State Predict future observations 3

5 A dynamical system Past Future ot-2 ot-1 ot ot+1 ot+2 Given past observations from a partially observable system State Predict future observations 3

6 This talk General class of models for dynamical systems Fast, statistically consistent learning algorithm no local optima Includes many well-known models & algorithms as special cases Also includes new models & algorithms that give state-of-the-art performance on interesting tasks 4

7 Includes models hidden Markov model n-grams, regexes, k-order HMMs PSR, OOM, multiplicity automaton, RR-HMM LTI system Kalman filter, AR, ARMA Kernel versions and manifold versions of above 5

8 Includes algorithms Subspace identification for LTI systems recent extensions to HMMs, RR-HMMs Tomasi-Kanade structure from motion Principal components analysis (e.g., eigenfaces); Laplacian eigenmaps New algorithms for learning PSRs, OOMs, etc. A new way to reduce noise in manifold learning A new range-only SLAM algorithm 6

9 Interesting applications Structure from motion Simultaneous localization and mapping Range-only SLAM SLAM from inertial sensors very simple vision-based SLAM (so far) Video textures Opponent modeling, option pricing, audio event classification, 7

10 Bayes filter (HMM, Kalman, etc.) ot-2 ot-1 ot ot+1 ot st-2 st-1 st st+1 st+2 Belief Pt(st) = P(st o1:t) Goal: given ot, update Pt-1(st-1) to Pt(st) Extend: Pt-1(st-1, ot, st) = Pt-1(st-1) P(st st-1) P(ot st) Marginalize: Pt-1(ot, st) =! Pt-1(st-1, ot, st) dst-1 Condition: Pt(st) = Pt-1(ot, st) / Pt-1(ot) 8

11 Bayes filter (HMM, Kalman, etc.) ot-2 ot-1 ot ot+1 ot st-2 st-1 st st+1 st+2 Belief Pt(st) = P(st o1:t) Goal: given ot, update Pt-1(st-1) to Pt(st) Extend: Pt-1(st-1, ot, st) = Pt-1(st-1) P(st st-1) P(ot st) Marginalize: Pt-1(ot, st) =! Pt-1(st-1, ot, st) dst-1 Condition: Pt(st) = Pt-1(ot, st) / Pt-1(ot) 8

12 nb: A common form for Bayes filters Find covariances as (linear) fns of previous state Σo(st-1) = E(!(ot)!(ot) T st-1) Σso(st-1) = E(st!(ot) st-1) uncentered covars; st &!(ot) include constant Linear regression to get current state -1 st = Σso Σo!(ot) Exact if discrete (HMM, PSR), Gaussian (Kalman, AR), RKHS w/ characteristic kernel [Fukumizu et al.] Approximates many more 9

13 Why this form? If Bayes filter takes (approximately) the above form, can design a simple spectral algorithm to identify the system Intuitions: predictive state rank bottleneck observable representation 10

14 Predictive state If Bayes filter takes above form, we can use a vector of predictions of observables as our state E(!(ot+k) st) = linear fn of st for big enough k, E([!(ot+1)!(ot+k)] st) = invertible linear fn of st so, take E([!(ot+1)!(ot+k)] st) as our state 11

15 12 Predictive state: example

16 12 Predictive state: example

17 12 Predictive state: example

18 12 Predictive state: example

19 13 Predictive state: minimal example P(s t s t 1 ) = = T P(o t s t 1 ) = OT = = O For big enough k, E([!(ot+1)!(ot+k)] st) = Wst invertible linear fn (if system observable)

20 13 Predictive state: minimal example P(s t s t 1 ) = = T P(o t s t 1 ) = OT = = O For big enough k, E([!(ot+1)!(ot+k)] st) = Wst invertible linear fn (if system observable)

21 13 Predictive state: minimal example P(s t s t 1 ) = = T P(o t s t 1 ) = OT = = O W For big enough k, E([!(ot+1)!(ot+k)] st) = Wst invertible linear fn (if system observable)

22 14 Predictive state: summary E([!(ot+1)!(ot+k)] st) is our state rep n interpretable observable a natural target for learning

23 14 Predictive state: summary E([!(ot+1)!(ot+k)] st) is our state rep n interpretable observable a natural target for learning To find st, predict!(ot+1)!(ot+k) from o1:t

24 Rank bottleneck predict data about past (many samples) state data about future (many samples) compress bottleneck expand Can find best rank-k bottleneck via matrix factorization spectral method 15

25 Finding a compact predictive state W Y " X linear function predicts Yt from Xt 16

26 Finding a compact predictive state W Y " X linear function predicts Yt from Xt 16

27 Finding a compact predictive state W Y " X linear function predicts Yt from Xt 16

28 Finding a compact predictive state W Y " X linear function predicts Yt from Xt 16

29 Finding a compact predictive state W Y " X linear function predicts Yt from Xt 16

30 Finding a compact predictive state W Y " X linear function predicts Yt from Xt 16

31 Finding a compact predictive state W rank(w) # k Y " X linear function predicts Yt from Xt 16

32 Observable representation Now that we have a compact predictive state, need to estimate fns Σo(st) and Σso(st) Insight: parameters are now observable: the problem is just to estimate some covariances from data to get Σo, regress (!(o) $!(o)) state to get Σso, regress (!(o) $ future+) state Past Future ot-2 ot-1 ot ot+1 ot+2 State Future+ 17

33 Algorithm Find compact predictive state: regress future past use any features of past/future, or a kernel, or even a learned kernel (manifold case) constrain rank of prediction weight matrix Extract model: regress (o $ future+) [predictive state] (o $ o) [predictive state] future+: use same rank-constrained basis as for predictive state 18

34 19 Does it work? Transitions X T Observations Simple HMM: 5 states, 7 observations N=300 thru N=900k Y T

35 Does it work? Model True P(o1, o2, o3) mean abs err (10 0 =predict uniform) 10 0 run 1 run 2 c/sqrt(n) training data 20

36 Discussion Impossibility learning DFA in poly time = breaking crypto primitives [Kearns & Valiant 89] so, clearly, we can t always be statistically efficient but, see McDonald [11], HKZ [09], us [09]: convergence depends on mixing rate Nonlinearity Bayes filter update is highly nonlinear in state (matrix inverse), even though we use a linear regression to identify the model nonlinearity is essential for expressiveness 21

37 Range-only SLAM Robot measures distance from current position to fixed beacons (e.g., time of flight or signal strength) may also have odometry Goal: recover robot path, landmark locations 22

38 Typical solution: EKF Prior After 1 range measurement to known landmark Bayes filter vs. EKF Problem: linear/gaussian approximation very bad

39 Typical solution: EKF Prior After 1 range measurement to known landmark Bayes filter vs. EKF Problem: linear/gaussian approximation very bad

40 24 Spectral solution (simple version) landmark 1, time step 2 d 2 11 d d 2 1T d 2 21 d d 2 2T d 2 N1 d 2 N2... d 2 NT

41 24 Spectral solution (simple version) 1/2 landmark 1, time step 2 d 2 11 d d 2 1T d 2 21 d d 2 2T d 2 N1 d 2 N2... d 2 NT = landmark position x, y 1 x 1 y 1 (x y 2 1)/2 1 x 2 y 2 (x y 2 2)/ x N y N (x 2 N + y2 N )/2 (u v1)/2 2 (u v2)/ (u 2 T + v2 T )/2 u 1 u 2... u T v 1 v 2... v T robot position u, v

42 25 Experiments (full version) autonomous lawn mower 60 dead reackoning Path true Path est. Path true Landmark est. Landmark B. Plaza 1 C. Plaza 2 60 Plaza 2 [Djugash & Singh, 2008] Plaza

43 Experiments (full version) urrent MAP dead reackoning Path true Path est. Path Method Plaza 1 Plaza 2 true Landmark Dead Reckoning (full path) est. Landmark 15.92m 27.28m [Djugash & Singh, 2008] Cartesian EKF (last 10%) 0.94m 0.92m Plaza B. 3OD]D C. Plaza 2 Fig. 4. The autonomous lawn mower and spectral SLAM. A.) 2 The ro )] (23) Optimization (last 10%) 0.68m 0.96m autonomous Batch lawn mower receiving 3,529 range measurements. This path minimizes the effect of (24) FastSLAM (last 10%) 0.73mmowing). 1.14m turns (a commonly used pattern for lawn Spectral SLAM recov ROP EKF traveled (last 10%) 0.65m the robot 1.3km receiving 1,816 range0.87m measurements. This path he standard 40 40same direction repeatedly. Again, spectral the SLAM Spectral SLAM (worst 10%) 1.17m 0.51m is able to recover t Spectral SLAM (best 10%) 0.28m 0.22m VI. C ONCLUSION 0.39m 0.35m Plaza 1Spectral SLAM (full path) ion experi- Fig. 3. Comparison of Range-Only SLAM Algorithms (Localization RMSE). 0 We proposed a novel formulation and solution for the rang 0 this paper. Boldface entries are best in column. í substantially from previo only SLAM problem that differs í th synthetic 20 í í 0 í í í 0 approaches. The via essence of thisfrom newmultispectral approach is to formul ved at each datasets were collected radio nodes ous lawn mower and spectral SLAM. A.) The robotic lawn mower platform. B.) In the first experiment, the robot travele Geoff Gordon UT Austin Feb, multivariate SLAM problem, which to deriv measurements. This path minimizes the as effecta offactorization heading error by balancing the number of leftallows turns with us an equal number

44 26 Structure from motion [Tomasi & Kanade, 1992]

45 Video textures Kalman filter works for some video textures steam grate example above fountain: observation = raw pixels (vector of reals over time) 27

46 28 Video textures, redux Original Kalman Filter PSR both models: 10 latent dimensions

47 28 Video textures, redux Original Kalman Filter PSR both models: 10 latent dimensions

48 29 Learning in the loop: option pricing pt pt 100 Price a financial derivative: psychic call holder gets to say I bought call 100 days ago underlying stock follows Black Scholes (unknown parameters)

49 30 Option pricing pt pt 100 Solution [Van Roy et al.]: use policy iteration 16 hand-picked features (e.g., poly history) initialize policy arbitrarily least-squares temporal differences (LSTD) to estimate value function policy := greedy; repeat

50 31 Option pricing Better solution: spectral SSID inside policy iteration 16 original features from Van Roy et al. 204 additional low-originality features e.g., linear fns of price history of underlying SSID picks best 16-d dynamics to explain feature evolution solve for value function in closed form

51 32 Policy iteration w/ spectral learning Expected payoff / $ of underlying Policy Iteration Iterations of PI Threshold LSTD (16) LSTD LARS-TD PSTD 0.82 /$ better than best competitor; 1.33 /$ better than best previously published data: 1,000,000 time steps

52 Avg. Prediction E Slot Ca Slot-car IMU data 6 Avg. Prediction Err. Slot Car A. 7 6 Hilbert 5 Space Embeddin IMU x 10 4 B. 8 HMM 7 3 Mean RR-HMM Last 6 2 LDS Embedded 5 IMU ï1 ï2 ï3 0 [Ko & Fox, 09; Boots et al., 10] Racetrack Predic Figure measureme Racetrack 4. Slot car inertial Prediction Horizon Figure 4. Slot car inertial measurement data. (A) The slot car platform and the IMU (top) and ar platform and the IMU (top) and the racetrack (bot(b) error Squared errorwith fordiﬀerent predictio om).tom). (B) Squared for prediction esti1 0 mated models and Geoff Gordon UT Austin Feb, 2012baselines

53 Kalman < HSE-HMM < manifold Manuscript under review by AISTATS Manuscript under review by AISTATS Mean Obs. HMM (EM) HMM (E Mean Obs. Obs.HSE-HMM HSE-HM PreviousPrevious Obs. KalmanKalman Filter FilterManifold-HMM Manifold Gaussian RBFRBF Gaussian RMS Error RMS Error Different State Spaces Linear Linear C. 160 C Laplacian Eigenmap Laplacian Eigenmap Horizon 20Prediction RMS prediction errorhorizon (XY Prediction position) v. horizon comparison of kernels measurement unit (IMU). (A) The slot car platform: the car and IMU (top) 34

54 Algorithm intuition Use separate one-manifold learners to estimate structure in past, future result: noisy Gram matrices GX, GY signal is correlated between GX, GY noise is independent Look at eigensystem of GXGY suppresses noise, leaves signal Geoff Gordon MURI review Nov,

55 36 Summary General class of models for dynamical systems Fast, statistically consistent learning method Includes many well-known models & algorithms as special cases HMM, Kalman filter, n-gram, PSR, kernel versions SfM, subspace ID, Kalman video textures Also includes new models & algorithms that give state-of-the-art performance on interesting tasks range-only SLAM, PSR video textures, HSE-HMM

56 Papers B. Boots and G. An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems. AAAI, B. Boots and G.% Predictive state temporal difference learning.%nips, B. Boots, S. M. Siddiqi, and G.% Closing the learning-planning loop with predictive state representations. RSS, L. Song, B. Boots, S. M. Siddiqi, G., and A. J. Smola.% Hilbert space embeddings of hidden Markov models.% ICML, (Best paper) 37

Learning about State. Geoff Gordon Machine Learning Department Carnegie Mellon University

Learning about State. Geoff Gordon Machine Learning Department Carnegie Mellon University Learning about State Geoff Gordon Machine Learning Department Carnegie Mellon University joint work with Byron Boots, Sajid Siddiqi, Le Song, Alex Smola 2 What s out there?...... ot-2 ot-1 ot ot+1 ot+2