Gaussian with mean ( µ ) and standard deviation ( σ)

Similar documents
Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Gaussian Processes in Machine Learning

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes (10/16/13)

Nonparameteric Regression:

GAUSSIAN PROCESS REGRESSION

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Nonparametric Bayesian Methods (Gaussian Processes)

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Introduction to Gaussian Processes

GWAS V: Gaussian processes

Gaussian Process Latent Variable Models for Dimensionality Reduction and Time Series Modeling

Gaussian Process Regression

Announcements. Proposals graded

Machine learning - HT Maximum Likelihood

CMU-Q Lecture 24:

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Model Selection for Gaussian Processes

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Lecture : Probabilistic Machine Learning

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Neutron inverse kinetics via Gaussian Processes

Gaussian Processes for Machine Learning

Probabilistic & Unsupervised Learning

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

9.2 Support Vector Machines 159

Advanced Introduction to Machine Learning CMU-10715

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Lecture 2 Machine Learning Review

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Reliability Monitoring Using Log Gaussian Process Regression

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Lecture 5: GPs and Streaming regression

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

CSE446: non-parametric methods Spring 2017

Linear Models for Regression

PATTERN RECOGNITION AND MACHINE LEARNING

Machine Learning

Learning Gaussian Process Models from Uncertain Data

Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005

Machine Learning

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

CPSC 540: Machine Learning

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Overfitting, Bias / Variance Analysis

Bayesian Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Basics: Maximum Likelihood Estimation

State Space Representation of Gaussian Processes

1 Kalman Filter Introduction

Gaussians Linear Regression Bias-Variance Tradeoff

ECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides.

Introduction to Probabilistic Graphical Models: Exercises

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

CS-E3210 Machine Learning: Basic Principles

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines

Density Estimation. Seungjin Choi

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Tokamak profile database construction incorporating Gaussian process regression

Linear Models for Regression

Artificial Neural Networks

The Variational Gaussian Approximation Revisited

Pattern Recognition and Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Probabilistic Machine Learning

Introduction to Gaussian Processes

STA414/2104 Statistical Methods for Machine Learning II

The Multivariate Gaussian Distribution [DRAFT]

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Learning Non-stationary System Dynamics Online using Gaussian Processes

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Multi-task Learning with Gaussian Processes, with Applications to Robot Inverse Dynamics

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Generative v. Discriminative classifiers Intuition

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Outline Lecture 2 2(32)

Linear Models for Regression CS534

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Models for Regression CS534

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Transcription:

Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics

X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b

+ + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), ( ~ σ σ µ σ σ σ µ σ σ σ σ µ σ µ N X p X p N X N X

Picture from [Bishop: Pattern Recognition and Machine Learning, 006] p(x) = Ν(µ,Σ) x b x = Σ = x a x b, µ = µ a µ b Σ aa Σ ab Σ ba Σ bb 1 p(x) = (π ) d/ Σ 1 (x µ)t Σ 1 (x µ) e 1/ x a 10/6/16 CSE-571: Robotics 5

Slide from Pieter Abbeel " µ = [1; 0] " Σ = [1 0; 0 1] " µ = [-.5; 0] " Σ = [1 0; 0 1] " µ = [-1; -1.5] " Σ = [1 0; 0 1] 10/6/16 CSE-571: Robotics 6

Slide from Pieter Abbeel! µ = [0; 0]! Σ = [1 0 ; 0 1] " µ = [0; 0] " Σ = [.6 0 ; 0.6] " µ = [0; 0] " Σ = [ 0 ; 0 ] 10/6/16 CSE-571: Robotics 7

Slide from Pieter Abbeel " µ = [0; 0] " Σ = [1 0; 0 1] " µ = [0; 0] " Σ = [1 0.5; 0.5 1] " µ = [0; 0] " Σ = [1 0.8; 0.8 1] 10/6/16 CSE-571: Robotics 8

Slide from Pieter Abbeel " µ = [0; 0] " Σ = [1-0.5 ; -0.5 1] " µ = [0; 0] " Σ = [1-0.8 ; -0.8 1] " µ = [0; 0] " Σ = [31 0.8 ; 0.8 31] 10/6/16 CSE-571: Robotics 9

Pictures from [Bishop: PRML, 006] p Marginalizing joint distribution results in a Gaussian x a x b = Ν µ a µ b, Σ aa Σ ba Σ ab Σ bb p(x a ) = p( x a, x b )dx b ( ) p(x a ) = Ν µ a, Σ aa Conditioning also leads to a Gaussian ( ) p(x a x b ) = Ν µ a b, Σ a b µ a b = µ a + Σ ab Σ 1 bb (x b µ b ) Cross co-variance Prior Variance (b) Σ a b = Σ aa Σ ab Σ 1 bb Σ ba Observed value Prior mean (b) Prior Variance (a) Shrink term (>= 0) 10/6/16 CSE-571: Robotics 10

10/6/16 CSE-571: Robotics 11

Modeling the relationship between real-valued variables in data Sensor models, dynamics models, stock market etc Two broad classes of models: Parametric: Learn a model of the data, use model to make new predictions Eg: Linear, Non-linear, Neural Networks etc. Non-Parametric: Keep the data around and use it to make new predictions Eg: Nearest Neighbor methods, Locally Weighted Regression, Gaussian Processes etc. 10/6/16 CSE-571: Robotics 1

Idea: Summarize data using a learned model: Linear, Polynomial Neural Networks etc y 1 0 Parametric models Computationally efficient, tradeoff complexity vs generalization 1 3 Training set Linear Polynomial 4 Polynomial 8 0 4 6 8 10 x 10/6/16 CSE-571: Robotics 13

Idea: Use nearest neighbor s prediction (with some interpolation) Non-parametric, keeps all data Ex: 1-NN, NN with linear interpolation Easy. Needs lot of data Best you can do in limit of infinite data Computationally expensive in high dimensions Y 4 3 1 0 1 3 4 Non Parametric models Training set 1 NN NN Linear 0 4 6 8 10 X 10/6/16 CSE-571: Robotics 14

Idea: Interpolate based on close training data Closeness defined using a kernel function Test output is a weighted interpolation of training outputs Locally Weighted Regression, Gaussian Processes Can model arbitrary (smooth) functions Need to keep around some (maybe all) training data Y 4 3 1 0 1 3 4 Smooth Non Parametric models Training set LWR NN GP GP Var 0 4 6 8 10 X 10/6/16 CSE-571: Robotics 15

10/6/16 CSE-571: Robotics 16

Non-parametric regression model Distribution over functions Fully specified by training data, mean and covariance functions Covariance given by kernel which measures distance of inputs in kernel space 10/6/16 CSE-571: Robotics 17

Given, inputs (x) and targets(y): D= {( x, y ),( x, y ),,( x, y )} = ( X, y) 1 1 GPs model the targets as a noisy function of the inputs: y i = f (x i ) +ε; ε ~ N (0,σ n ) Formally, a GP is a collection of random variables, any finite number of which have a joint Gaussian distribution: f (x) ~ GP(m(x), k(x, x')) m(x) = E[ f (x)] k(x, x') = E[( f (x) m(x))( f (x') m(x'))] 10/6/16 CSE-571: Robotics 18 n n

Given a (finite) set of inputs (X), GP models the outputs (y) as jointly Gaussian: m(x 1 ) m(x ) m =! K = m(x n ) P( y X ) = N (m(x ), K(X, X ) +σ n I) k(x 1, x 1 ) k(x 1, x n ) k(x, x 1 )! k(x i, x i )! k(x n, x 1 ) k(x n, x n ) Usually, we assume zero-mean prior Noise Can define other mean functions (constant, polynomials etc) 10/6/16 CSE-571: Robotics 19

Covariance matrix (K) is defined through the kernel function: Specifies covariance of the outputs as the function of inputs Example: Squared Exponential Kernel Covariance proportional to distance in input space Similar input points will have similar outputs k(x, x ʹ) = σ f e 1 ( x ʹ x )W ( x x ʹ )T 10/6/16 CSE-571: Robotics 0

Pictures from [Bishop: PRML, 006] GP prior: Outputs jointly zero-mean Gaussian: P(y X) = Ν(0, K +σ n I) 10/6/16 CSE-571: Robotics 1

Training data: 1 1 Test pair (y unknown): {x *, y * } GP outputs are jointly Gaussian: Conditioning on y: ( ) P( y * x *,y,x) = N µ *,σ * µ * = k * T σ * = k ** k * T ( K +σ n I) 1 y ( K +σ n I) 1 k * k * [i] = k(x *,x i ); k ** = k(x *,x * ) D= {( x, y ),( x, y ),,( x, y )} = ( X, y) P(y, y * X, x * ) = N(µ, Σ); P(y X) = N(0, Κ +σ n I) 10/6/16 CSE-571: Robotics n n p(x a x b ) = Ν ( µ a b, Σ ) a b µ a b = µ a + Σ ab Σ 1 bb (x b µ b ) Σ a b = Σ aa Σ ab Σ 1 bb Σ ba Recall conditional

10/6/16 CSE-571: Robotics 3

Noise Standard deviation ( σ n) Affects how likely a new observation changes predictions (and covariance) Kernel (choose based on data) SE, Exponential, Matern etc. Kernel hyperparameters: SE kernel: k(x, x ʹ) = σ f e 1 ( x ʹ Length scale (how fast the function changes) Scale factor (how large the function variance is) x )W ( x x ʹ )T 10/6/16 CSE-571: Robotics 4

Pictures from [Bishop: PRML, 006] k(x, x ) = θ 0 exp θ 1 x x +θ +θ 3 xt x' 10/6/16 CSE-571: Robotics 5

Maximize data log likelihood: θ * = arg max θ p(y X,θ) log p(y X,θ) = 1 yt ( K + σ n I) 1 y 1 log K + σ n ( I) n log π Compute derivatives wrt. params n l Optimize using conjugate gradient descent θ = σ,, σ f 10/6/16 CSE-571: Robotics 6

10/6/16 CSE-571: Robotics 7

Learn hyperparameters via numerical methods Learn noise model at the same time 10/6/16 CSE-571: Robotics 8

System: Commercial blimp envelope with custom gondola XScale based computer with Bluetooth connectivity Two main motors with tail motor (3D control) Ground truth obtained via VICON motion capture system 10/6/16 CSE-571: Robotics 9

1-D state=[pos,rot,transvel,rotvel] Describes evolution of state as ODE Forces / torques considered: buoyancy, gravity, drag, thrust 16 parameters are learned by optimization on ground truth motion capture data = = ) * ( ) * ( ) ( 1 1 ω ω ω ξ ω ξ J Torques J Mv Forces M H v R v p dt d s e b! 10/6/16 CSE-571: Robotics 30

c Δs o c 1 Δs 1 s s 3 o 3 s 1 Use ground truth state to extract: Dynamics data D S = [ s1, c1 ], Δs1, [ s, c], Δs Learn model using Gaussian process regression Learn process noise inherent in system 10/6/16 CSE-571: Robotics 31

c 1 Δs 1 s1 Combine GP model with parametric model D X = f ([ s 1, c1 ]) [ s1, c1 ], Δs1 f ([ s1, c1 ]) Advantages Captures aspects of system not considered by parametric model Learns noise model in same way as GP-only models Higher accuracy for same amount of training data s 10/6/16 CSE-571: Robotics 3

Dynamics model error Propagation method pos(mm) rot(deg) vel(mm/s) rotvel(deg/s) Param 3.3 0.5 14.6 1.5 GPonly 1.8 0. 9.8 1.1 EGP 1.6 0. 9.6 1.3 1800 training points, mean error over 900 test points For dynamics model, 0.5 sec predictions 10/6/16 CSE-571: Robotics 33

Heteroscedastic (state dependent) noise Non-stationary GPs Coupled outputs Sparse GPs Online: Decide whether or not to accept new point Remove points Optimize small set of points Classification Laplace approximation No closed-form solution, sampling 10/6/16 CSE-571: Robotics 34

GPs provide flexible modeling framework Take data noise and uncertainty due to data sparsity into account Combination with parametric models increases accuracy and reduces need for training data Computational complexity is a key problem 10/6/16 CSE-571: Robotics 35

Website: http://www.gaussianprocess.org/ GP book: http://www.gaussianprocess.org/gpml/ GPLVM: http://inverseprobability.com/fgplvm/ GPDM: http://www.dgp.toronto.edu/~jmwang/gpdm/ Bishop book: http://research.microsoft.com/en-us/um/people/cmbishop/prml/ 10/6/16 CSE-571: Robotics 36