Multi-task Learning with Gaussian Processes, with Applications to Robot Inverse Dynamics

Similar documents
Multi-task Gaussian Process Learning of Robot Inverse Dynamics

Model Selection for Gaussian Processes

Multi-task Gaussian Process Prediction

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

GAUSSIAN PROCESS REGRESSION

Multi-Kernel Gaussian Processes

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Introduction to Gaussian Processes

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Online Learning in High Dimensions. LWPR and it s application

Gaussian Process Regression

Joint Emotion Analysis via Multi-task Gaussian Processes

MTTTS16 Learning from Multiple Sources

Nonparametric Bayesian Methods (Gaussian Processes)

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

GWAS V: Gaussian processes

Gaussian Process Regression with K-means Clustering for Very Short-Term Load Forecasting of Individual Buildings at Stanford

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian with mean ( µ ) and standard deviation ( σ)

Gaussian Processes (10/16/13)

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Multioutput, Multitask and Mechanistic

Learning Gaussian Process Models from Uncertain Data

Gaussian Processes in Machine Learning

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Probabilistic Graphical Models Lecture 20: Gaussian Processes

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Gaussian Process Regression Networks

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Massachusetts Institute of Technology

STAT 518 Intro Student Presentation

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Learning to Learn and Collaborative Filtering

Further Issues and Conclusions

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

20: Gaussian Processes

Gaussian Process Latent Variable Models for Dimensionality Reduction and Time Series Modeling

Lecture 5: GPs and Streaming regression

Probabilistic & Unsupervised Learning

CPSC 540: Machine Learning

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Neutron inverse kinetics via Gaussian Processes

Chris Bishop s PRML Ch. 8: Graphical Models

Variational Model Selection for Sparse Gaussian Process Regression

Introduction to Gaussian Processes

STA414/2104 Statistical Methods for Machine Learning II

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Latent Force Models. Neil D. Lawrence

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

CS 7140: Advanced Machine Learning

Introduction to Gaussian Processes

Advanced Introduction to Machine Learning CMU-10715

Gaussian Processes for Machine Learning

Nonparameteric Regression:

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

Reliability Monitoring Using Log Gaussian Process Regression

Introduction to Gaussian Process

Uncertainties estimation and compensation on a robotic manipulator. Application on Barrett arm

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

A Spectral Regularization Framework for Multi-Task Structure Learning

Data Mining Techniques

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Probabilistic Models for Learning Data Representations. Andreas Damianou

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Kernels for Automatic Pattern Discovery and Extrapolation

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Remarks on Multi-Output Gaussian Process Regression

An Introduction to Gaussian Processes for Spatial Data (Predictions!)

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Lecture : Probabilistic Machine Learning

Probabilistic Reasoning in Deep Learning

The joint posterior distribution of the unknown parameters and hidden variables, given the

Gaussian processes for inference in stochastic differential equations

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification

Models. Carl Henrik Ek Philip H. S. Torr Neil D. Lawrence. Computer Science Departmental Seminar University of Bristol October 25 th, 2007

Learning Gaussian Process Kernels via Hierarchical Bayes

Non-Parametric Bayes

System identification and control with (deep) Gaussian processes. Andreas Damianou

A new Hierarchical Bayes approach to ensemble-variational data assimilation

Probabilistic numerics for deep learning

Probabilistic & Bayesian deep learning. Andreas Damianou

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Active and Semi-supervised Kernel Classification

Adding Flight Mechanics to Flight Loads Surrogate Model using Multi-Output Gaussian Processes

Density Estimation. Seungjin Choi

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Optimization of Gaussian Process Hyperparameters using Rprop

ECE521 week 3: 23/26 January 2017

Model-Based Reinforcement Learning with Continuous States and Actions

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Transcription:

1 / 38 Multi-task Learning with Gaussian Processes, with Applications to Robot Inverse Dynamics Chris Williams with Kian Ming A. Chai, Stefan Klanke, Sethu Vijayakumar December 2009

Motivation 2 / 38 Examples of multi-task learning Co-occurrence of ores (geostats) Object recognition for multiple object classes Personalization (personalizing spam filters, speaker adaptation in speech recognition) Compiler optimization of many computer programs Robot inverse dynamics (multiple loads) Gain strength by sharing information across tasks More general questions: meta-learning, learning about learning

Outline 3 / 38 1. Gaussian process prediction 2. Co-kriging 3. Intrinsic Correlation Model 4. Multi-task learning: A. MTL as Hierarchical Modelling B. MTL as Input-space Transformation C. MTL as Shared Feature Extraction 5. Theory for the Intrinsic Correlation Model 6. Multi-task learning in Robot Inverse Dynamics

4 / 38 1. What is a Gaussian process? A Gaussian process (GP) is a generalization of a multivariate Gaussian distribution to infinitely many variables Informally: infinitely long vector function Definition: a Gaussian process is a collection of random variables, any finite number of which have (consistent) Gaussian distributions A Gaussian distribution is fully specified by a mean vector µ and covariance matrix Σ f N (µ, Σ) A Gaussian process is fully specified by a mean function m(x) and a covariance function k(x, x ) f (x) GP(m(x), k(x, x ))

5 / 38 The Marginalization Property Thinking of a GP as an infinitely long vector may seem impractical. Fortunately we are saved by the marginalization property So generally we need only consider the n locations where data is observed, and the test point x, the remainder are marginalized out

Drawing Random Functions from a Gaussian Process 6 / 38 Example one-dimensional Gaussian process f (x) GP(m(x) = 0, k(x, x ) = exp( 1 2 (x x ) 2 )) To get an indication of what this distribution over functions looks like, focus on a finite subset of x-values, f = (f (x 1 ), f (x 2 ),..., f (x n )) T, for which where Σ ij = k(x i, x j ) f N (0, Σ)

7 / 38 2 output, f(x) 1 0 1 2 5 0 5 input, x

8 / 38

From Prior to Posterior 9 / 38 output, f(x) 2 1 0 1 2 5 0 5 input, x Predictive distribution output, f(x) 2 1 0 1 2 5 0 5 input, x p(y x, X, y, M) = N (k T (x, X)[K + σ 2 ni] 1 y, k(x, x ) + σ 2 n k T (x, X)[K + σ 2 ni] 1 k(x, X))

10 / 38 Marginal Likelihood log p(y X, M) = 1 2 yt Ky 1 y 1 2 log K y n 2 log(2π) where K y = K + σ 2 ni. This can be used to adjust the free parameters (hyperparameters) of a kernel. There can be multiple local optima of the marginal likelihood, corresponding to different interpretations of the data

2. Co-kriging 11 / 38 Consider M tasks, and N distinct inputs x 1,..., x N : f il is the response for the l th task on the i th input x i Gaussian process with covariance function k(x, l; x, m) = f l (x)f m (x ) Goal: Given noisy observations y of f make predictions of unobserved values f at locations X Solution Use the usual GP prediction equations

12 / 38 f x

13 / 38 Some questions What kinds of (cross)-covariance structures match different ideas of multi-task learning? Are there multi-task relationships that don t fit well with co-kriging?

14 / 38 3. Intrinsic Correlation Model (ICM) f l (x)f m (x ) = K f lm k x (x, x ) y il N (f l (x i ), σ 2 l ), K f : PSD matrix that specifies the inter-task similarities (could depend parametrically on task descriptors if these are available) k x : Covariance function over inputs σ 2 l : Noise variance for the lth task. Linear Model of Coregionalization is a sum of ICMs

ICM as a linear combination of indepenent GPs 15 / 38 Independent GP priors over the functions z j (x) multi-task GP prior over f m (x)s f l (x)f m (x ) = K f lm k x (x, x ) K f R M M is a task (or context) similarity matrix with K f lm = (ρm ) T ρ l f m m = 1... M ρ m 1 ρ m M ρ m 2. z 1 z 2 z M

16 / 38 Some problems conform nicely to the ICM setup, e.g. robot inverse dynamics (Chai, Williams, Klanke, Vijayakumar 2009; see later) Semiparametric latent factor model (SLFM) of Teh et al (2005) has P latent processes each with its own covariance function. Noiseless outputs are obtained by linear mixing of these latent functions

4 A. Multi-task Learning as Hierarchical Modelling 17 / 38 e.g. Baxter (JAIR, 2000), Goldstein (2003) θ f 1 f 2 f 3 y 1 y 2 y 3

18 / 38 Prior on θ may be generic (e.g. isotropic Gaussian) or more structured Mixture model on θ task clustering Task clustering can be implemented in the ICM model using a block diagonal K f, where each block is a cluster Manifold model for θ, e.g. linear subspace low-rank structure of K f (e.g. linear regression with correlated priors) Combination of the above ideas a mixture of linear subspaces If task descriptors are available then can have K f lm = k f (t l, t m ) Regularization framework: Evgeniou et al (JMLR, 2005),

GP view 19 / 38 Integrate out θ f 1 f 2 f 3 y 1 y 2 y 3

4 B. MTL as Input-space Transformation 20 / 38 Ben-David and Schuller (COLT, 2003), f 2 (x) is related to f 1 (x) by a X -space transformation f : X X Suppose f 2 (x) is related to f 1 (x) by a shift a in x-space Then f 1 (x)f 2 (x ) = f 1 (x)f 1 (x a) = k 1 (x, x a)

21 / 38 More generally can consider convolutions, e.g. f i (x) = h i (x x )g(x )dx to generate dependent f s (e.g. Ver Hoef and Barry, 1998; Higdon, 2002; Boyle and Frean, 2005). δ(x a) is a special case Alvarez and Lawrence (2009) generalize this to allow a linear combination of several latent processes f i (x) = R r=1 h ir (x x )g r (x )dx ICM and SPFM are special cases using the δ() kernel

22 / 38 4 C. Shared Feature Extraction Intuition: multiple tasks can depend on the same extracted features; all tasks can be used to help learn these features If data is scarce for each task this should help learn the features Bakker and Heskes (2003) neural network setup......... output layer hidden layer 2 hidden layer 1 input layer (x)

23 / 38 Minka and Picard (1999): assume that the multiple tasks are independent GPs but with shared hyperparameters Yu, Tresp and Schawaighofer (2005) extend this so that all tasks share the same kernel hyperparameter, but can have different kernels Could also have inter-task correlations Interesting case if different tasks have different x-spaces; convert from each task-dependent x-space to same feature space?

5. Some Theory for the ICM 24 / 38 Kian Ming A. Chai, NIPS 2009 Primary task T and secondary task S Correlation ρ between the tasks Proportion π S of the total data belongs to secondary task S For GPs and squared loss, we can compute analytically the generalizaton error ɛ T (ρ, X T, X S ) given p(x) Average this over X T, X S to get ɛ avg T (ρ, π S, n), the learning curve for primary task T given a total of n observations for both tasks. ɛ avg T is the lower bound on ɛ avg T. Theory bounds benefit of multi-task learning in terms of ɛavg T.

Theory: Result on lower bound 25 / 38 Average generalization error for T ɛavg T (0, π S, n), single-task ɛavg T (ρ, π S, n), multi-task ɛavg T (ρ, π S, n) (1 ρ 2 π S ) ɛ avg T (0, π S, n) n Bound has been demonstrated on 1-d problems, and on the input distribution corresponding to the SARCOS robot arm data

26 / 38 Discussion 3 types of multi-task learning setup ICM and convolutional cross-covariance functions, shared feature extraction Are there multi-task relationships that don t fit well with a co-kriging framework?

27 / 38 Multi-task Learning in Robot Inverse Dynamics end effector Joint variables q. Apply τ i to joint i to trace a trajectory. Inverse dynamics predict τ i (q, q, q). link 2 q 2 link 1 link 0 q 1 base

Inverse Dynamics Characteristics of τ 28 / 38 Torques are non-linear functions of x = def (q, q, q). (One) idealized rigid body control: τ i (x) = b T i (q) q + q T H i (q) q }{{} kinetic + potential {}}{ g i (q) + fi v q i + fi c sgn( q i ), }{{} viscous and Coulomb frictions Physics-based modelling can be hard due to factors like unknown parameters, friction and contact forces, joint elasticity, making analytical predictions unfeasible This is particularly true for compliant, lightweight humanoid robots

29 / 38 Inverse Dynamics Characteristics of τ Functions change with the loads handled at the end effector Loads have different mass, shapes, sizes. Bad news (1): Need a different inverse dynamics model for different loads. Bad news (2): Different loads may go through different trajectory in data collection phase and may explore different portions of the x-space.

30 / 38 Good news: the changes enter through changes in the dynamic parameters of the last link Good news: changes are linear wrt the dynamic parameters τi m (x) = y T i (x)π m where π m R 11 (e.g. Petkos and Vijayakumar,2007) Reparameterization: τ m i (x) = y T i (x)π m = y T i (x)a 1 i A i π m = z T i (x)ρ m i where A i is a non-singular 11 11 matrix

GP prior for Inverse Dynamics for multiple loads 31 / 38 Independent GP priors over the functions z ij (x) multi-task GP prior over τi m s τi l (x)τ i m (x ) = (K ρ i ) lm ki x (x, x ) R M M is a task (or context) similarity matrix with (K ρ i ) lm = (ρ m i ) T ρ l i K ρ i τ m i m = 1... M ρ m i,1 ρ m i,2 ρ m i,s z i,1 z i,2 i = 1... J z i,s

32 / 38 GP prior for k(x, x ) k(x, x ) = bias + [linear with ARD](x, x ) + [squared exponential with ARD](x, x ) + [linear (with ARD)](sgn( q), sgn( q )) Domain knowledge relates to last term (Coulomb friction)

Data 33 / 38 Puma 560 robot arm manipulator: 6 degrees of freedom Realistic simulator (Corke, 1996), including viscous and asymmetric-coulomb frictions. 4 paths 4 speeds = 16 different trajectories: Speeds: 5s, 10s, 15s and 20s completion times. 15 loads (contexts): 0.2kg... 3.0kg, various shapes and sizes. Joint 1 Waist Joint 2 Shoulder Base Joint 3 Elbow Joint 5 Wrist Bend q 3 Joint 6 Flange Joint 4 Wrist rotation 0.5 z/m 0.3 0.2 y/m 0 0.2 p p 3 p 4 0.3 0.4 0.5 x/m 0.6 0.7

34 / 38 Data Training data 1 reference trajectory common to handling of all loads. 14 unique training trajectories, one for each context (load) 1 trajectory has no data for any context; thus this is always novel Test data Interpolation data sets for testing on reference trajectory and the unique trajectory for each load. Extrapolation data sets for testing on all trajectories.

Methods 35 / 38 sgp Single task GPs GPs trained separately for each load igp Independent GP GPs trained independently for each load but tying parameters across loads pgp pooled GP one single GP trained by pooling data across loads mgp multi-task GP with BIC sharing latent functions across loads, selecting similarity matrix using BIC For mgp, the rank of K f is determined using BIC criterion

36 / 38 Results Interpolation (j=5) 4 10 5 2Extrapolation (j=5) 10 1 sgp pgp mean Average nmses 4 3 2 1 0.8 0.6 0.4 0.2 mgp-bic igp 0 0 280 448 728 1400 280 448 728 1400 error bars: 90% confidence interval of the mean (5 replications) n

37 / 38 2 1 0 1 0 n 280 448 728 1400 280 448 728 1400 curve shows the mean of the paired differences paired diff of Average nmses 10 3Interpolation (j=5) 10 2 Extrapolation (j=5) 2 Paired difference = best avg nmse of (p,i,s)gp avg nmse of mgp-bic 1.5 1 0.5

38 / 38 Conclusions and Discussion GP formulation of MTL with factorization k x (x, x ) and K f, and encoding of task similarity This model fits exactly for multi-context inverse dynamics Results show that MTL can be effective This is one model for MTL, but what about others, e.g. cov functions that don t factorize?