Spectral learning algorithms for dynamical systems Geoff Gordon http://www.cs.cmu.edu/~ggordon/ Machine Learning Department Carnegie Mellon University joint work with Byron Boots, Sajid Siddiqi, Le Song, Alex Smola
2 What s out there?...... ot-2 ot-1 ot ot+1 ot+2 video frames, as vectors of pixel intensities
2 What s out there?...... ot-2 ot-1 ot ot+1 ot+2 video frames, as vectors of pixel intensities
A dynamical system Past Future...... ot-2 ot-1 ot ot+1 ot+2 Given past observations from a partially observable system State Predict future observations 3
A dynamical system Past Future...... ot-2 ot-1 ot ot+1 ot+2 Given past observations from a partially observable system State Predict future observations 3
This talk General class of models for dynamical systems Fast, statistically consistent learning algorithm no local optima Includes many well-known models & algorithms as special cases Also includes new models & algorithms that give state-of-the-art performance on interesting tasks 4
Includes models hidden Markov model n-grams, regexes, k-order HMMs PSR, OOM, multiplicity automaton, RR-HMM LTI system Kalman filter, AR, ARMA Kernel versions and manifold versions of above 5
Includes algorithms Subspace identification for LTI systems recent extensions to HMMs, RR-HMMs Tomasi-Kanade structure from motion Principal components analysis (e.g., eigenfaces); Laplacian eigenmaps New algorithms for learning PSRs, OOMs, etc. A new way to reduce noise in manifold learning A new range-only SLAM algorithm 6
Interesting applications Structure from motion Simultaneous localization and mapping Range-only SLAM SLAM from inertial sensors very simple vision-based SLAM (so far) Video textures Opponent modeling, option pricing, audio event classification, 7
Bayes filter (HMM, Kalman, etc.)...... ot-2 ot-1 ot ot+1 ot+2...... st-2 st-1 st st+1 st+2 Belief Pt(st) = P(st o1:t) Goal: given ot, update Pt-1(st-1) to Pt(st) Extend: Pt-1(st-1, ot, st) = Pt-1(st-1) P(st st-1) P(ot st) Marginalize: Pt-1(ot, st) =! Pt-1(st-1, ot, st) dst-1 Condition: Pt(st) = Pt-1(ot, st) / Pt-1(ot) 8
Bayes filter (HMM, Kalman, etc.)...... ot-2 ot-1 ot ot+1 ot+2...... st-2 st-1 st st+1 st+2 Belief Pt(st) = P(st o1:t) Goal: given ot, update Pt-1(st-1) to Pt(st) Extend: Pt-1(st-1, ot, st) = Pt-1(st-1) P(st st-1) P(ot st) Marginalize: Pt-1(ot, st) =! Pt-1(st-1, ot, st) dst-1 Condition: Pt(st) = Pt-1(ot, st) / Pt-1(ot) 8
nb: A common form for Bayes filters Find covariances as (linear) fns of previous state Σo(st-1) = E(!(ot)!(ot) T st-1) Σso(st-1) = E(st!(ot) st-1) uncentered covars; st &!(ot) include constant Linear regression to get current state -1 st = Σso Σo!(ot) Exact if discrete (HMM, PSR), Gaussian (Kalman, AR), RKHS w/ characteristic kernel [Fukumizu et al.] Approximates many more 9
Why this form? If Bayes filter takes (approximately) the above form, can design a simple spectral algorithm to identify the system Intuitions: predictive state rank bottleneck observable representation 10
Predictive state If Bayes filter takes above form, we can use a vector of predictions of observables as our state E(!(ot+k) st) = linear fn of st for big enough k, E([!(ot+1)!(ot+k)] st) = invertible linear fn of st so, take E([!(ot+1)!(ot+k)] st) as our state 11
12 Predictive state: example
12 Predictive state: example
12 Predictive state: example
12 Predictive state: example
13 Predictive state: minimal example P(s t s t 1 ) = 2 0 1 3 4 1 2 0 3 3 0 1 3 3 4 = T P(o t s t 1 ) = 1 2 1 0 OT = 2 3 1 3 1 2 0 1 2 0 3 1 7 3 8 = O For big enough k, E([!(ot+1)!(ot+k)] st) = Wst invertible linear fn (if system observable)
13 Predictive state: minimal example P(s t s t 1 ) = 2 0 1 3 4 1 2 0 3 3 0 1 3 3 4 = T P(o t s t 1 ) = 1 2 1 0 OT = 2 3 1 3 1 2 0 1 2 0 3 1 7 3 8 = O For big enough k, E([!(ot+1)!(ot+k)] st) = Wst invertible linear fn (if system observable)
13 Predictive state: minimal example P(s t s t 1 ) = 2 0 1 3 4 1 2 0 3 3 0 1 3 3 4 = T P(o t s t 1 ) = 1 2 1 0 OT = 2 3 1 3 1 2 0 1 2 0 3 1 7 3 8 = O W For big enough k, E([!(ot+1)!(ot+k)] st) = Wst invertible linear fn (if system observable)
14 Predictive state: summary E([!(ot+1)!(ot+k)] st) is our state rep n interpretable observable a natural target for learning
14 Predictive state: summary E([!(ot+1)!(ot+k)] st) is our state rep n interpretable observable a natural target for learning To find st, predict!(ot+1)!(ot+k) from o1:t
Rank bottleneck predict data about past (many samples) state data about future (many samples) compress bottleneck expand Can find best rank-k bottleneck via matrix factorization spectral method 15
Finding a compact predictive state W Y " X linear function predicts Yt from Xt 16
Finding a compact predictive state W Y " X linear function predicts Yt from Xt 16
Finding a compact predictive state W Y " X linear function predicts Yt from Xt 16
Finding a compact predictive state W Y " X linear function predicts Yt from Xt 16
Finding a compact predictive state W Y " X linear function predicts Yt from Xt 16
Finding a compact predictive state W Y " X linear function predicts Yt from Xt 16
Finding a compact predictive state W rank(w) # k Y " X linear function predicts Yt from Xt 16
Observable representation Now that we have a compact predictive state, need to estimate fns Σo(st) and Σso(st) Insight: parameters are now observable: the problem is just to estimate some covariances from data to get Σo, regress (!(o) $!(o)) state to get Σso, regress (!(o) $ future+) state Past Future...... ot-2 ot-1 ot ot+1 ot+2 State Future+ 17
Algorithm Find compact predictive state: regress future past use any features of past/future, or a kernel, or even a learned kernel (manifold case) constrain rank of prediction weight matrix Extract model: regress (o $ future+) [predictive state] (o $ o) [predictive state] future+: use same rank-constrained basis as for predictive state 18
19 Does it work? Transitions X T Observations Simple HMM: 5 states, 7 observations N=300 thru N=900k Y T
Does it work? Model True P(o1, o2, o3) mean abs err (10 0 =predict uniform) 10 0 run 1 run 2 c/sqrt(n) 10 1 10 2 10 3 10 4 10 5 training data 20
Discussion Impossibility learning DFA in poly time = breaking crypto primitives [Kearns & Valiant 89] so, clearly, we can t always be statistically efficient but, see McDonald [11], HKZ [09], us [09]: convergence depends on mixing rate Nonlinearity Bayes filter update is highly nonlinear in state (matrix inverse), even though we use a linear regression to identify the model nonlinearity is essential for expressiveness 21
Range-only SLAM Robot measures distance from current position to fixed beacons (e.g., time of flight or signal strength) may also have odometry Goal: recover robot path, landmark locations 22
Typical solution: EKF Prior After 1 range measurement to known landmark Bayes filter vs. EKF 2 0 2 2 0 2 2 0 2 Problem: linear/gaussian approximation very bad 2 0 2 23
Typical solution: EKF Prior After 1 range measurement to known landmark Bayes filter vs. EKF 2 0 2 2 0 2 2 0 2 Problem: linear/gaussian approximation very bad 2 0 2 23
24 Spectral solution (simple version) landmark 1, time step 2 d 2 11 d 2 12... d 2 1T d 2 21 d 2 22... d 2 2T............ d 2 N1 d 2 N2... d 2 NT
24 Spectral solution (simple version) 1/2 landmark 1, time step 2 d 2 11 d 2 12... d 2 1T d 2 21 d 2 22... d 2 2T............ d 2 N1 d 2 N2... d 2 NT = landmark position x, y 1 x 1 y 1 (x 2 1 + y 2 1)/2 1 x 2 y 2 (x 2 2 + y 2 2)/2.... 1 x N y N (x 2 N + y2 N )/2 (u 2 1 + v1)/2 2 (u 2 2 + v2)/2 2... (u 2 T + v2 T )/2 u 1 u 2... u T v 1 v 2... v T 1 1... 1 robot position u, v
25 Experiments (full version) autonomous lawn mower 60 dead reackoning Path true Path est. Path true Landmark est. Landmark B. Plaza 1 C. Plaza 2 60 Plaza 2 [Djugash & Singh, 2008] Plaza 1 50 40 30 20 10 0 10 40 20 0 20 50 40 30 20 10 0 10 60 40 20 0
Experiments (full version) urrent MAP dead reackoning Path true Path est. Path Method Plaza 1 Plaza 2 true Landmark Dead Reckoning (full path) est. Landmark 15.92m 27.28m [Djugash & Singh, 2008] Cartesian EKF (last 10%) 0.94m 0.92m Plaza B. 3OD]D C. Plaza 2 Fig. 4. The autonomous lawn mower and spectral SLAM. A.) 2 The ro )] (23) Optimization (last 10%) 0.68m 0.96m autonomous Batch lawn mower receiving 3,529 range measurements. This path minimizes the effect of 60 60 (24) FastSLAM (last 10%) 0.73mmowing). 1.14m turns (a commonly used pattern for lawn Spectral SLAM recov 50 50 ROP EKF traveled (last 10%) 0.65m the robot 1.3km receiving 1,816 range0.87m measurements. This path he standard 40 40same direction repeatedly. Again, spectral the SLAM Spectral SLAM (worst 10%) 1.17m 0.51m is able to recover t 30 30 Spectral SLAM (best 10%) 0.28m 0.22m 20 20 VI. C ONCLUSION 0.39m 0.35m Plaza 1Spectral SLAM (full path) ion experi- Fig. 3. Comparison of Range-Only SLAM Algorithms (Localization RMSE). 0 We proposed a novel formulation and solution for the rang 0 this paper. Boldface entries are best in column. í substantially from previo only SLAM problem that differs í th synthetic 20 í í 0 í í í 0 approaches. The via essence of thisfrom newmultispectral approach is to formul ved at each datasets were collected radio nodes ous lawn mower and spectral SLAM. A.) The robotic lawn mower platform. B.) In the first experiment, the robot travele Geoff Gordon UT Austin Feb, 2012 25 multivariate SLAM problem, which to deriv measurements. This path minimizes the as effecta offactorization heading error by balancing the number of leftallows turns with us an equal number
26 Structure from motion [Tomasi & Kanade, 1992]
Video textures Kalman filter works for some video textures steam grate example above fountain: observation = raw pixels (vector of reals over time) 27
28 Video textures, redux Original Kalman Filter PSR both models: 10 latent dimensions
28 Video textures, redux Original Kalman Filter PSR both models: 10 latent dimensions
29 Learning in the loop: option pricing pt pt 100 Price a financial derivative: psychic call holder gets to say I bought call 100 days ago underlying stock follows Black Scholes (unknown parameters)
30 Option pricing pt pt 100 Solution [Van Roy et al.]: use policy iteration 16 hand-picked features (e.g., poly history) initialize policy arbitrarily least-squares temporal differences (LSTD) to estimate value function policy := greedy; repeat
31 Option pricing Better solution: spectral SSID inside policy iteration 16 original features from Van Roy et al. 204 additional low-originality features e.g., linear fns of price history of underlying SSID picks best 16-d dynamics to explain feature evolution solve for value function in closed form
32 Policy iteration w/ spectral learning Expected payoff / $ of underlying 1.30 1.25 1.20 1.15 1.10 1.05 1.00 0.95 0 5 10 15 20 25 30 Policy Iteration Iterations of PI Threshold LSTD (16) LSTD LARS-TD PSTD 0.82 /$ better than best competitor; 1.33 /$ better than best previously published data: 1,000,000 time steps
Avg. Prediction E Slot Ca Slot-car IMU data 6 Avg. Prediction Err. Slot Car A. 7 6 Hilbert 5 Space Embeddin IMU x 10 4 B. 8 HMM 7 3 Mean RR-HMM Last 6 2 LDS Embedded 5 IMU 1 4 3 0 0 10 20 30 40 2 3 2 1 0 ï1 ï2 ï3 0 [Ko & Fox, 09; Boots et al., 10] Racetrack Predic 0 10 20 30 40 50 60 70 80 90 100 Figure measureme Racetrack 4. Slot car inertial Prediction Horizon Figure 4. Slot car inertial measurement data. (A) The slot car platform and the IMU (top) and ar platform and the IMU (top) and the racetrack (bot(b) error Squared errorwith fordifferent predictio om).tom). (B) Squared for prediction esti1 0 mated models and Geoff Gordon UT Austin Feb, 2012baselines. 50 100 150
Kalman < HSE-HMM < manifold Manuscript under review by AISTATS Manuscript under review by AISTATS 2012 2012 Mean Obs. HMM (EM) HMM (E Mean Obs. Obs.HSE-HMM HSE-HM PreviousPrevious Obs. KalmanKalman Filter FilterManifold-HMM Manifold Gaussian RBFRBF Gaussian RMS Error 120 120 RMS Error Different State Spaces Linear Linear C. 160 C. 160 80 40 Laplacian Eigenmap 80 40 Laplacian Eigenmap 0 0 0 10 0 20 10 30 40 50 60 70 80 90 100 Horizon 20Prediction 30 40 50 60 70 80 RMS prediction errorhorizon (XY Prediction position) v. horizon comparison of kernels measurement unit (IMU). (A) The slot car platform: the car and IMU (top) 34
Algorithm intuition Use separate one-manifold learners to estimate structure in past, future result: noisy Gram matrices GX, GY signal is correlated between GX, GY noise is independent Look at eigensystem of GXGY suppresses noise, leaves signal Geoff Gordon MURI review Nov, 2011 35
36 Summary General class of models for dynamical systems Fast, statistically consistent learning method Includes many well-known models & algorithms as special cases HMM, Kalman filter, n-gram, PSR, kernel versions SfM, subspace ID, Kalman video textures Also includes new models & algorithms that give state-of-the-art performance on interesting tasks range-only SLAM, PSR video textures, HSE-HMM
Papers B. Boots and G. An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems. AAAI, 2011. B. Boots and G.% Predictive state temporal difference learning.%nips, 2010. B. Boots, S. M. Siddiqi, and G.% Closing the learning-planning loop with predictive state representations. RSS, 2010. L. Song, B. Boots, S. M. Siddiqi, G., and A. J. Smola.% Hilbert space embeddings of hidden Markov models.% ICML, 2010. (Best paper) 37