Spectral learning algorithms for dynamical systems

Similar documents
Learning about State. Geoff Gordon Machine Learning Department Carnegie Mellon University

Learning about State. Geoff Gordon Machine Learning Department Carnegie Mellon University

An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems

Reduced-Rank Hidden Markov Models

arxiv: v1 [cs.lg] 10 Jul 2012

A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

Hilbert Space Embeddings of Hidden Markov Models

Spectral Learning Part I

Hilbert Space Embeddings of Predictive State Representations

Hilbert Space Embeddings of Hidden Markov Models

Incremental Sparse GP Regression for Continuous-time Trajectory Estimation & Mapping

Estimating Covariance Using Factorial Hidden Markov Models

Two-Manifold Problems with Applications to Nonlinear System Identification

STA 4273H: Statistical Machine Learning

Bayes Filter Reminder. Kalman Filter Localization. Properties of Gaussians. Gaussians. Prediction. Correction. σ 2. Univariate. 1 2πσ e.

Linear Regression (continued)

Mobile Robot Localization

CS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings

EKF and SLAM. McGill COMP 765 Sept 18 th, 2017

Learning gradients: prescriptive models

Introduction to Graphical Models

Autonomous Navigation for Flying Robots

Introduction to Machine Learning

arxiv: v1 [cs.lg] 29 Dec 2011

Predictive Timing Models

ECE521 week 3: 23/26 January 2017

Using the Kalman Filter for SLAM AIMS 2015

Gaussian Processes as Continuous-time Trajectory Representations: Applications in SLAM and Motion Planning

Mathematical Formulation of Our Example

Robotics. Mobile Robotics. Marc Toussaint U Stuttgart

L06. LINEAR KALMAN FILTERS. NA568 Mobile Robotics: Methods & Algorithms

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

arxiv: v2 [stat.ml] 28 Feb 2018

Mobile Robot Localization

L11. EKF SLAM: PART I. NA568 Mobile Robotics: Methods & Algorithms

Randomized Algorithms

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Final Exam, Machine Learning, Spring 2009

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Hilbert Space Embeddings of Hidden Markov Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Brief Introduction of Machine Learning Techniques for Content Analysis

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Particle Filters; Simultaneous Localization and Mapping (Intelligent Autonomous Robotics) Subramanian Ramamoorthy School of Informatics

SLAM Techniques and Algorithms. Jack Collier. Canada. Recherche et développement pour la défense Canada. Defence Research and Development Canada

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods

From Bayes to Extended Kalman Filter

Kalman filtering and friends: Inference in time series models. Herke van Hoof slides mostly by Michael Rubinstein

Final Exam December 12, 2017

STA 414/2104: Machine Learning

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Dual Estimation and the Unscented Transformation

Introduction to Mobile Robotics SLAM: Simultaneous Localization and Mapping

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Hidden Markov models

Linear Dynamical Systems

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Introduction to Machine Learning

Machine Learning 4771

Inference Machines for Nonparametric Filter Learning

CSCI567 Machine Learning (Fall 2014)

Dimensionality Reduction by Unsupervised Regression

CS Homework 3. October 15, 2009

Pattern Recognition and Machine Learning

Advanced Introduction to Machine Learning

Spectral Probabilistic Modeling and Applications to Natural Language Processing

Latent Tree Approximation in Linear Model

UAV Navigation: Airborne Inertial SLAM

Simultaneous Localization and Mapping

Final Exam December 12, 2017

Analysis of Spectral Kernel Design based Semi-supervised Learning

ROBOTICS 01PEEQW. Basilio Bona DAUIN Politecnico di Torino

arxiv: v1 [cs.lg] 12 Dec 2009

Restricted Boltzmann Machines for Collaborative Filtering

PROBABILISTIC REASONING OVER TIME

Chapter 3. Tohru Katayama

CS 532: 3D Computer Vision 6 th Set of Notes

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

TSRT14: Sensor Fusion Lecture 9

CS Machine Learning Qualifying Exam

Laplacian Agent Learning: Representation Policy Iteration

Learning SVM Classifiers with Indefinite Kernels

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Probabilistic Graphical Models

Introduction to Machine Learning CMU-10701

Dimension Reduction. David M. Blei. April 23, 2012

STA 4273H: Sta-s-cal Machine Learning

November 28 th, Carlos Guestrin 1. Lower dimensional projections

Localization. Howie Choset Adapted from slides by Humphrey Hu, Trevor Decker, and Brad Neuman

Expectation Propagation in Dynamical Systems

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

Learning Predictive Models of a Depth Camera & Manipulator from Raw Execution Traces

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

CS491/691: Introduction to Aerial Robotics

Intelligent Systems (AI-2)

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Modeling Multiple-mode Systems with Predictive State Representations

1 Bayesian Linear Regression (BLR)

Rao-Blackwellized Particle Filtering for 6-DOF Estimation of Attitude and Position via GPS and Inertial Sensors

Transcription:

Spectral learning algorithms for dynamical systems Geoff Gordon http://www.cs.cmu.edu/~ggordon/ Machine Learning Department Carnegie Mellon University joint work with Byron Boots, Sajid Siddiqi, Le Song, Alex Smola

2 What s out there?...... ot-2 ot-1 ot ot+1 ot+2 video frames, as vectors of pixel intensities

2 What s out there?...... ot-2 ot-1 ot ot+1 ot+2 video frames, as vectors of pixel intensities

A dynamical system Past Future...... ot-2 ot-1 ot ot+1 ot+2 Given past observations from a partially observable system State Predict future observations 3

A dynamical system Past Future...... ot-2 ot-1 ot ot+1 ot+2 Given past observations from a partially observable system State Predict future observations 3

This talk General class of models for dynamical systems Fast, statistically consistent learning algorithm no local optima Includes many well-known models & algorithms as special cases Also includes new models & algorithms that give state-of-the-art performance on interesting tasks 4

Includes models hidden Markov model n-grams, regexes, k-order HMMs PSR, OOM, multiplicity automaton, RR-HMM LTI system Kalman filter, AR, ARMA Kernel versions and manifold versions of above 5

Includes algorithms Subspace identification for LTI systems recent extensions to HMMs, RR-HMMs Tomasi-Kanade structure from motion Principal components analysis (e.g., eigenfaces); Laplacian eigenmaps New algorithms for learning PSRs, OOMs, etc. A new way to reduce noise in manifold learning A new range-only SLAM algorithm 6

Interesting applications Structure from motion Simultaneous localization and mapping Range-only SLAM SLAM from inertial sensors very simple vision-based SLAM (so far) Video textures Opponent modeling, option pricing, audio event classification, 7

Bayes filter (HMM, Kalman, etc.)...... ot-2 ot-1 ot ot+1 ot+2...... st-2 st-1 st st+1 st+2 Belief Pt(st) = P(st o1:t) Goal: given ot, update Pt-1(st-1) to Pt(st) Extend: Pt-1(st-1, ot, st) = Pt-1(st-1) P(st st-1) P(ot st) Marginalize: Pt-1(ot, st) =! Pt-1(st-1, ot, st) dst-1 Condition: Pt(st) = Pt-1(ot, st) / Pt-1(ot) 8

Bayes filter (HMM, Kalman, etc.)...... ot-2 ot-1 ot ot+1 ot+2...... st-2 st-1 st st+1 st+2 Belief Pt(st) = P(st o1:t) Goal: given ot, update Pt-1(st-1) to Pt(st) Extend: Pt-1(st-1, ot, st) = Pt-1(st-1) P(st st-1) P(ot st) Marginalize: Pt-1(ot, st) =! Pt-1(st-1, ot, st) dst-1 Condition: Pt(st) = Pt-1(ot, st) / Pt-1(ot) 8

nb: A common form for Bayes filters Find covariances as (linear) fns of previous state Σo(st-1) = E(!(ot)!(ot) T st-1) Σso(st-1) = E(st!(ot) st-1) uncentered covars; st &!(ot) include constant Linear regression to get current state -1 st = Σso Σo!(ot) Exact if discrete (HMM, PSR), Gaussian (Kalman, AR), RKHS w/ characteristic kernel [Fukumizu et al.] Approximates many more 9

Why this form? If Bayes filter takes (approximately) the above form, can design a simple spectral algorithm to identify the system Intuitions: predictive state rank bottleneck observable representation 10

Predictive state If Bayes filter takes above form, we can use a vector of predictions of observables as our state E(!(ot+k) st) = linear fn of st for big enough k, E([!(ot+1)!(ot+k)] st) = invertible linear fn of st so, take E([!(ot+1)!(ot+k)] st) as our state 11

12 Predictive state: example

12 Predictive state: example

12 Predictive state: example

12 Predictive state: example

13 Predictive state: minimal example P(s t s t 1 ) = 2 0 1 3 4 1 2 0 3 3 0 1 3 3 4 = T P(o t s t 1 ) = 1 2 1 0 OT = 2 3 1 3 1 2 0 1 2 0 3 1 7 3 8 = O For big enough k, E([!(ot+1)!(ot+k)] st) = Wst invertible linear fn (if system observable)

13 Predictive state: minimal example P(s t s t 1 ) = 2 0 1 3 4 1 2 0 3 3 0 1 3 3 4 = T P(o t s t 1 ) = 1 2 1 0 OT = 2 3 1 3 1 2 0 1 2 0 3 1 7 3 8 = O For big enough k, E([!(ot+1)!(ot+k)] st) = Wst invertible linear fn (if system observable)

13 Predictive state: minimal example P(s t s t 1 ) = 2 0 1 3 4 1 2 0 3 3 0 1 3 3 4 = T P(o t s t 1 ) = 1 2 1 0 OT = 2 3 1 3 1 2 0 1 2 0 3 1 7 3 8 = O W For big enough k, E([!(ot+1)!(ot+k)] st) = Wst invertible linear fn (if system observable)

14 Predictive state: summary E([!(ot+1)!(ot+k)] st) is our state rep n interpretable observable a natural target for learning

14 Predictive state: summary E([!(ot+1)!(ot+k)] st) is our state rep n interpretable observable a natural target for learning To find st, predict!(ot+1)!(ot+k) from o1:t

Rank bottleneck predict data about past (many samples) state data about future (many samples) compress bottleneck expand Can find best rank-k bottleneck via matrix factorization spectral method 15

Finding a compact predictive state W Y " X linear function predicts Yt from Xt 16

Finding a compact predictive state W Y " X linear function predicts Yt from Xt 16

Finding a compact predictive state W Y " X linear function predicts Yt from Xt 16

Finding a compact predictive state W Y " X linear function predicts Yt from Xt 16

Finding a compact predictive state W Y " X linear function predicts Yt from Xt 16

Finding a compact predictive state W Y " X linear function predicts Yt from Xt 16

Finding a compact predictive state W rank(w) # k Y " X linear function predicts Yt from Xt 16

Observable representation Now that we have a compact predictive state, need to estimate fns Σo(st) and Σso(st) Insight: parameters are now observable: the problem is just to estimate some covariances from data to get Σo, regress (!(o) $!(o)) state to get Σso, regress (!(o) $ future+) state Past Future...... ot-2 ot-1 ot ot+1 ot+2 State Future+ 17

Algorithm Find compact predictive state: regress future past use any features of past/future, or a kernel, or even a learned kernel (manifold case) constrain rank of prediction weight matrix Extract model: regress (o $ future+) [predictive state] (o $ o) [predictive state] future+: use same rank-constrained basis as for predictive state 18

19 Does it work? Transitions X T Observations Simple HMM: 5 states, 7 observations N=300 thru N=900k Y T

Does it work? Model True P(o1, o2, o3) mean abs err (10 0 =predict uniform) 10 0 run 1 run 2 c/sqrt(n) 10 1 10 2 10 3 10 4 10 5 training data 20

Discussion Impossibility learning DFA in poly time = breaking crypto primitives [Kearns & Valiant 89] so, clearly, we can t always be statistically efficient but, see McDonald [11], HKZ [09], us [09]: convergence depends on mixing rate Nonlinearity Bayes filter update is highly nonlinear in state (matrix inverse), even though we use a linear regression to identify the model nonlinearity is essential for expressiveness 21

Range-only SLAM Robot measures distance from current position to fixed beacons (e.g., time of flight or signal strength) may also have odometry Goal: recover robot path, landmark locations 22

Typical solution: EKF Prior After 1 range measurement to known landmark Bayes filter vs. EKF 2 0 2 2 0 2 2 0 2 Problem: linear/gaussian approximation very bad 2 0 2 23

Typical solution: EKF Prior After 1 range measurement to known landmark Bayes filter vs. EKF 2 0 2 2 0 2 2 0 2 Problem: linear/gaussian approximation very bad 2 0 2 23

24 Spectral solution (simple version) landmark 1, time step 2 d 2 11 d 2 12... d 2 1T d 2 21 d 2 22... d 2 2T............ d 2 N1 d 2 N2... d 2 NT

24 Spectral solution (simple version) 1/2 landmark 1, time step 2 d 2 11 d 2 12... d 2 1T d 2 21 d 2 22... d 2 2T............ d 2 N1 d 2 N2... d 2 NT = landmark position x, y 1 x 1 y 1 (x 2 1 + y 2 1)/2 1 x 2 y 2 (x 2 2 + y 2 2)/2.... 1 x N y N (x 2 N + y2 N )/2 (u 2 1 + v1)/2 2 (u 2 2 + v2)/2 2... (u 2 T + v2 T )/2 u 1 u 2... u T v 1 v 2... v T 1 1... 1 robot position u, v

25 Experiments (full version) autonomous lawn mower 60 dead reackoning Path true Path est. Path true Landmark est. Landmark B. Plaza 1 C. Plaza 2 60 Plaza 2 [Djugash & Singh, 2008] Plaza 1 50 40 30 20 10 0 10 40 20 0 20 50 40 30 20 10 0 10 60 40 20 0

Experiments (full version) urrent MAP dead reackoning Path true Path est. Path Method Plaza 1 Plaza 2 true Landmark Dead Reckoning (full path) est. Landmark 15.92m 27.28m [Djugash & Singh, 2008] Cartesian EKF (last 10%) 0.94m 0.92m Plaza B. 3OD]D C. Plaza 2 Fig. 4. The autonomous lawn mower and spectral SLAM. A.) 2 The ro )] (23) Optimization (last 10%) 0.68m 0.96m autonomous Batch lawn mower receiving 3,529 range measurements. This path minimizes the effect of 60 60 (24) FastSLAM (last 10%) 0.73mmowing). 1.14m turns (a commonly used pattern for lawn Spectral SLAM recov 50 50 ROP EKF traveled (last 10%) 0.65m the robot 1.3km receiving 1,816 range0.87m measurements. This path he standard 40 40same direction repeatedly. Again, spectral the SLAM Spectral SLAM (worst 10%) 1.17m 0.51m is able to recover t 30 30 Spectral SLAM (best 10%) 0.28m 0.22m 20 20 VI. C ONCLUSION 0.39m 0.35m Plaza 1Spectral SLAM (full path) ion experi- Fig. 3. Comparison of Range-Only SLAM Algorithms (Localization RMSE). 0 We proposed a novel formulation and solution for the rang 0 this paper. Boldface entries are best in column. í substantially from previo only SLAM problem that differs í th synthetic 20 í í 0 í í í 0 approaches. The via essence of thisfrom newmultispectral approach is to formul ved at each datasets were collected radio nodes ous lawn mower and spectral SLAM. A.) The robotic lawn mower platform. B.) In the first experiment, the robot travele Geoff Gordon UT Austin Feb, 2012 25 multivariate SLAM problem, which to deriv measurements. This path minimizes the as effecta offactorization heading error by balancing the number of leftallows turns with us an equal number

26 Structure from motion [Tomasi & Kanade, 1992]

Video textures Kalman filter works for some video textures steam grate example above fountain: observation = raw pixels (vector of reals over time) 27

28 Video textures, redux Original Kalman Filter PSR both models: 10 latent dimensions

28 Video textures, redux Original Kalman Filter PSR both models: 10 latent dimensions

29 Learning in the loop: option pricing pt pt 100 Price a financial derivative: psychic call holder gets to say I bought call 100 days ago underlying stock follows Black Scholes (unknown parameters)

30 Option pricing pt pt 100 Solution [Van Roy et al.]: use policy iteration 16 hand-picked features (e.g., poly history) initialize policy arbitrarily least-squares temporal differences (LSTD) to estimate value function policy := greedy; repeat

31 Option pricing Better solution: spectral SSID inside policy iteration 16 original features from Van Roy et al. 204 additional low-originality features e.g., linear fns of price history of underlying SSID picks best 16-d dynamics to explain feature evolution solve for value function in closed form

32 Policy iteration w/ spectral learning Expected payoff / $ of underlying 1.30 1.25 1.20 1.15 1.10 1.05 1.00 0.95 0 5 10 15 20 25 30 Policy Iteration Iterations of PI Threshold LSTD (16) LSTD LARS-TD PSTD 0.82 /$ better than best competitor; 1.33 /$ better than best previously published data: 1,000,000 time steps

Avg. Prediction E Slot Ca Slot-car IMU data 6 Avg. Prediction Err. Slot Car A. 7 6 Hilbert 5 Space Embeddin IMU x 10 4 B. 8 HMM 7 3 Mean RR-HMM Last 6 2 LDS Embedded 5 IMU 1 4 3 0 0 10 20 30 40 2 3 2 1 0 ï1 ï2 ï3 0 [Ko & Fox, 09; Boots et al., 10] Racetrack Predic 0 10 20 30 40 50 60 70 80 90 100 Figure measureme Racetrack 4. Slot car inertial Prediction Horizon Figure 4. Slot car inertial measurement data. (A) The slot car platform and the IMU (top) and ar platform and the IMU (top) and the racetrack (bot(b) error Squared errorwith fordifferent predictio om).tom). (B) Squared for prediction esti1 0 mated models and Geoff Gordon UT Austin Feb, 2012baselines. 50 100 150

Kalman < HSE-HMM < manifold Manuscript under review by AISTATS Manuscript under review by AISTATS 2012 2012 Mean Obs. HMM (EM) HMM (E Mean Obs. Obs.HSE-HMM HSE-HM PreviousPrevious Obs. KalmanKalman Filter FilterManifold-HMM Manifold Gaussian RBFRBF Gaussian RMS Error 120 120 RMS Error Different State Spaces Linear Linear C. 160 C. 160 80 40 Laplacian Eigenmap 80 40 Laplacian Eigenmap 0 0 0 10 0 20 10 30 40 50 60 70 80 90 100 Horizon 20Prediction 30 40 50 60 70 80 RMS prediction errorhorizon (XY Prediction position) v. horizon comparison of kernels measurement unit (IMU). (A) The slot car platform: the car and IMU (top) 34

Algorithm intuition Use separate one-manifold learners to estimate structure in past, future result: noisy Gram matrices GX, GY signal is correlated between GX, GY noise is independent Look at eigensystem of GXGY suppresses noise, leaves signal Geoff Gordon MURI review Nov, 2011 35

36 Summary General class of models for dynamical systems Fast, statistically consistent learning method Includes many well-known models & algorithms as special cases HMM, Kalman filter, n-gram, PSR, kernel versions SfM, subspace ID, Kalman video textures Also includes new models & algorithms that give state-of-the-art performance on interesting tasks range-only SLAM, PSR video textures, HSE-HMM

Papers B. Boots and G. An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems. AAAI, 2011. B. Boots and G.% Predictive state temporal difference learning.%nips, 2010. B. Boots, S. M. Siddiqi, and G.% Closing the learning-planning loop with predictive state representations. RSS, 2010. L. Song, B. Boots, S. M. Siddiqi, G., and A. J. Smola.% Hilbert space embeddings of hidden Markov models.% ICML, 2010. (Best paper) 37