Introduction to Gaussian Processes

Similar documents
Latent Variable Models with Gaussian Processes

All you want to know about GPs: Linear Dimensionality Reduction

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Non Linear Latent Variable Models

Gaussian Process Latent Variable Models for Dimensionality Reduction and Time Series Modeling

Visualization with Gaussian Processes

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Dimensionality Reduction

Gaussian Process Latent Random Field

Dimensionality Reduction by Unsupervised Regression

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification

Models. Carl Henrik Ek Philip H. S. Torr Neil D. Lawrence. Computer Science Departmental Seminar University of Bristol October 25 th, 2007

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Statistical Machine Learning

Probabilistic Graphical Models & Applications

Introduction to Gaussian Processes

STA 414/2104: Lecture 8

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Impact of Dynamics on Subspace Embedding and Tracking of Sequences

REGRESSION USING GAUSSIAN PROCESS MANIFOLD KERNEL DIMENSIONALITY REDUCTION. Kooksang Moon and Vladimir Pavlović

Supervised Learning Coursework

GWAS V: Gaussian processes

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Subspace Methods for Visual Learning and Recognition

Recent Advances in Bayesian Inference Techniques

Reliability Monitoring Using Log Gaussian Process Regression

CS798: Selected topics in Machine Learning

Support Vector Machines

Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005

Nonlinear Dimensionality Reduction

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Nonparameteric Regression:

Variational Principal Components

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Level-Set Person Segmentation and Tracking with Multi-Region Appearance Models and Top-Down Shape Information Appendix

Linear Models for Classification

Probabilistic & Unsupervised Learning

Neural Network Training

Nonlinear Dimensionality Reduction

Kernel methods for comparing distributions, measuring dependence

Kernel methods, kernel SVM and ridge regression

Gaussian processes autoencoder for dimensionality reduction

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

CSC 411: Lecture 04: Logistic Regression

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

STA 414/2104: Machine Learning

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Learning Gaussian Process Models from Uncertain Data

Principal Component Analysis

Gaussian Process Regression

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Machine Learning (BSMC-GA 4439) Wenke Liu

The Gaussian process latent variable model with Cox regression

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

STA 414/2104: Lecture 8

Gaussian Processes in Machine Learning

Variational Autoencoders

Multivariate Bayesian Linear Regression MLAI Lecture 11

Unsupervised dimensionality reduction

Least Squares Regression

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Nonparametric Bayesian Methods (Gaussian Processes)

Gaussian Processes for Machine Learning

Introduction to Probabilistic Graphical Models

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Kernel Learning via Random Fourier Representations

Probabilistic Reasoning in Deep Learning

Introduction to Probabilistic Graphical Models: Exercises

HMM part 1. Dr Philip Jackson

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Bayesian Linear Regression. Sargur Srihari

Part 4: Conditional Random Fields

Statistical Learning. Dong Liu. Dept. EEIS, USTC

The Variational Gaussian Approximation Revisited

Reading Group on Deep Learning Session 2

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Advanced Machine Learning & Perception

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Introduction to Machine Learning

DD Advanced Machine Learning

Probabilistic & Unsupervised Learning

Chris Bishop s PRML Ch. 8: Graphical Models

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Based on slides by Richard Zemel

PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS. Chenlin Wu Yuhan Lou

Pattern Recognition and Machine Learning

Lecture 5: GPs and Streaming regression

Reading Group on Deep Learning Session 1

Transcription:

Introduction to Gaussian Processes Raquel Urtasun TTI Chicago August 2, 2013 R. Urtasun (TTIC) Gaussian Processes August 2, 2013 1 / 59

Motivation for Non-Linear Dimensionality Reduction USPS Data Set Handwritten Digit 3648 Dimensions 64 rows by 57 columns Space contains more than just this digit. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 2 / 59

Motivation for Non-Linear Dimensionality Reduction USPS Data Set Handwritten Digit 3648 Dimensions 64 rows by 57 columns Space contains more than just this digit. Even if we sample every nanosecond from now until the end of the universe, you won t see the original six! R. Urtasun (TTIC) Gaussian Processes August 2, 2013 2 / 59

Motivation for Non-Linear Dimensionality Reduction USPS Data Set Handwritten Digit 3648 Dimensions 64 rows by 57 columns Space contains more than just this digit. Even if we sample every nanosecond from now until the end of the universe, you won t see the original six! R. Urtasun (TTIC) Gaussian Processes August 2, 2013 2 / 59

Motivation for Non-Linear Dimensionality Reduction USPS Data Set Handwritten Digit 3648 Dimensions 64 rows by 57 columns Space contains more than just this digit. Even if we sample every nanosecond from now until the end of the universe, you won t see the original six! R. Urtasun (TTIC) Gaussian Processes August 2, 2013 2 / 59

Simple Model of Digit Rotate a Prototype R. Urtasun (TTIC) Gaussian Processes August 2, 2013 3 / 59

Simple Model of Digit Rotate a Prototype R. Urtasun (TTIC) Gaussian Processes August 2, 2013 3 / 59

Simple Model of Digit Rotate a Prototype R. Urtasun (TTIC) Gaussian Processes August 2, 2013 3 / 59

Simple Model of Digit Rotate a Prototype R. Urtasun (TTIC) Gaussian Processes August 2, 2013 3 / 59

Simple Model of Digit Rotate a Prototype R. Urtasun (TTIC) Gaussian Processes August 2, 2013 3 / 59

Simple Model of Digit Rotate a Prototype R. Urtasun (TTIC) Gaussian Processes August 2, 2013 3 / 59

Simple Model of Digit Rotate a Prototype R. Urtasun (TTIC) Gaussian Processes August 2, 2013 3 / 59

Simple Model of Digit Rotate a Prototype R. Urtasun (TTIC) Gaussian Processes August 2, 2013 3 / 59

Simple Model of Digit Rotate a Prototype R. Urtasun (TTIC) Gaussian Processes August 2, 2013 3 / 59

MATLAB Demo demdigitsmanifold([1 2], all ) R. Urtasun (TTIC) Gaussian Processes August 2, 2013 4 / 59

MATLAB Demo demdigitsmanifold([1 2], all ) 0.1 0.05 PC no 2 0 0.05 0.1 0.1 0.05 0 0.05 0.1 PC no 1 R. Urtasun (TTIC) Gaussian Processes August 2, 2013 4 / 59

MATLAB Demo demdigitsmanifold([1 2], sixnine ) 0.1 0.05 PC no 2 0 0.05 0.1 0.1 0.05 0 0.05 0.1 PC no 1 R. Urtasun (TTIC) Gaussian Processes August 2, 2013 4 / 59

Low Dimensional Manifolds Pure Rotation is too Simple In practice the data may undergo several distortions. e.g. digits undergo thinning, translation and rotation. For data with structure : we expect fewer distortions than dimensions; we therefore expect the data to live on a lower dimensional manifold. Conclusion: deal with high dimensional data by looking for lower dimensional non-linear embedding. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 5 / 59

Feature Selection Figure: demrotationdist. Feature selection via distance preservation. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 6 / 59

Feature Selection Figure: demrotationdist. Feature selection via distance preservation. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 6 / 59

Feature Selection Figure: demrotationdist. Feature selection via distance preservation. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 6 / 59

Feature Extraction Figure: demrotationdist. Rotation preserves interpoint distances. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 7 / 59

Feature Extraction Figure: demrotationdist. Rotation preserves interpoint distances. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 7 / 59

Feature Extraction Figure: demrotationdist. Rotation preserves interpoint distances. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 7 / 59

Feature Extraction Figure: demrotationdist. Rotation preserves interpoint distances. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 7 / 59

Feature Extraction Figure: demrotationdist. Rotation preserves interpoint distances. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 7 / 59

Feature Extraction Figure: demrotationdist. Rotation preserves interpoint distances. Residuals are much reduced. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 7 / 59

Feature Extraction Figure: demrotationdist. Rotation preserves interpoint distances. Residuals are much reduced. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 7 / 59

Which Rotation? We need the rotation that will minimise residual error. Retain features/directions with maximum variance. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 8 / 59

Which Rotation? We need the rotation that will minimise residual error. Retain features/directions with maximum variance. Error is then given by the sum of residual variances. E (X) = 2 p p σk 2. k=q+1 R. Urtasun (TTIC) Gaussian Processes August 2, 2013 8 / 59

Which Rotation? We need the rotation that will minimise residual error. Retain features/directions with maximum variance. Error is then given by the sum of residual variances. E (X) = 2 p p σk 2. k=q+1 Rotations of data matrix do not effect this analysis. Rotate data so that largest variance directions are retained. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 8 / 59

Which Rotation? We need the rotation that will minimise residual error. Retain features/directions with maximum variance. Error is then given by the sum of residual variances. E (X) = 2 p p σk 2. k=q+1 Rotations of data matrix do not effect this analysis. Rotate data so that largest variance directions are retained. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 8 / 59

Reminder: Principal Component Analysis How do we find these directions? Find directions in data with maximal variance. That s what PCA does! PCA: rotate data to extract these directions. PCA: work on the sample covariance matrix S = n 1 Ŷ Ŷ. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 9 / 59

Principal Coordinates Analysis The rotation which finds directions of maximum variance is the eigenvectors of the covariance matrix. The variance in each direction is given by the eigenvalues. Problem: working directly with the sample covariance, S, may be impossible. Why? R. Urtasun (TTIC) Gaussian Processes August 2, 2013 10 / 59

Equivalent Eigenvalue Problems Principal Coordinate Analysis operates on ŶT Ŷ R p p. Can we compute ŶŶ instead? When p < n it is easier to solve for the rotation, R q. But when p > n we solve for the embedding (principal coordinate analysis). Two eigenvalue problems are equivalent: One solves for the rotation, the other solves for the location of the rotated points. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 11 / 59

The Covariance Interpretation n 1 Ŷ Ŷ is the data covariance. ŶŶ is a centred inner product matrix. Also has an interpretation as a covariance matrix (Gaussian processes). It expresses correlation and anti correlation between data points. Standard covariance expresses correlation and anti correlation between data dimensions. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 12 / 59

Mapping of Points Mapping points to higher dimensions is easy. Figure: Two dimensional Gaussian mapped to three dimensions. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 13 / 59

Linear Dimensionality Reduction Linear Latent Variable Model Represent data, Y, with a lower dimensional set of latent variables X. Assume a linear relationship of the form y i,: = Wx i,: + ɛ i,:, where ɛ i,: N ( 0, σ 2 I ). R. Urtasun (TTIC) Gaussian Processes August 2, 2013 14 / 59

Linear Latent Variable Model Probabilistic PCA Define linear-gaussian relationship between latent variables and data. Standard Latent variable approach: X Y W n p (Y X, W) = N ( y i,: Wx i,:, σ 2 I ) i=1 R. Urtasun (TTIC) Gaussian Processes August 2, 2013 15 / 59

Linear Latent Variable Model Probabilistic PCA Define linear-gaussian relationship between latent variables and data. Standard Latent variable approach: Define Gaussian prior over latent space, X. X W Y n p (Y X, W) = N ( y i,: Wx i,:, σ 2 I ) i=1 R. Urtasun (TTIC) Gaussian Processes August 2, 2013 15 / 59

Linear Latent Variable Model Probabilistic PCA Define linear-gaussian relationship between latent variables and data. Standard Latent variable approach: Define Gaussian prior over latent space, X. Integrate out latent variables. X W Y n p (Y X, W) = N ( y i,: Wx i,:, σ 2 I ) i=1 n p (X) = N ( x i,: 0, I ) i=1 R. Urtasun (TTIC) Gaussian Processes August 2, 2013 15 / 59

Linear Latent Variable Model Probabilistic PCA Define linear-gaussian relationship between latent variables and data. Standard Latent variable approach: Define Gaussian prior over latent space, X. Integrate out latent variables. X p (Y X, W) = p (Y W) = p (X) = Y W n N ( y i,: Wx i,:, σ 2 I ) i=1 n N i=1 n N ( x i,: 0, I ) i=1 ( ) y i,: 0, WW + σ 2 I R. Urtasun (TTIC) Gaussian Processes August 2, 2013 15 / 59

Linear Latent Variable Model II Probabilistic PCA Max. Likelihood Soln (Tipping 99) X W Y p (Y W) = n N i=1 ( ) y i,: 0, WW + σ 2 I R. Urtasun (TTIC) Gaussian Processes August 2, 2013 16 / 59

Linear Latent Variable Model II Probabilistic PCA Max. Likelihood Soln (Tipping 99) p (Y W) = n N (y i,: 0, C), i=1 C = WW + σ 2 I log p (Y W) = n 2 log C 1 ) (C 2 tr 1 Y Y + const. If U q are first q principal eigenvectors of n 1 Y Y and the corresponding eigenvalues are Λ q, W = U q LR, L = ( Λ q σ 2 I ) 1 2 where R is an arbitrary rotation matrix. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 16 / 59

Linear Latent Variable Model III Dual Probabilistic PCA Define linear-gaussian relationship between latent variables and data. Novel Latent variable approach: X Y W n p (Y X, W) = N ( y i,: Wx i,:, σ 2 I ) i=1 R. Urtasun (TTIC) Gaussian Processes August 2, 2013 17 / 59

Linear Latent Variable Model III Dual Probabilistic PCA Define linear-gaussian relationship between latent variables and data. Novel Latent variable approach: Define Gaussian prior over parameters, W. X W Y n p (Y X, W) = N ( y i,: Wx i,:, σ 2 I ) i=1 R. Urtasun (TTIC) Gaussian Processes August 2, 2013 17 / 59

Linear Latent Variable Model III Dual Probabilistic PCA Define linear-gaussian relationship between latent variables and data. Novel Latent variable approach: Define Gaussian prior over parameters, W. Integrate out parameters. X W Y n p (Y X, W) = N ( y i,: Wx i,:, σ 2 I ) i=1 p p (W) = N ( w i,: 0, I ) i=1 R. Urtasun (TTIC) Gaussian Processes August 2, 2013 17 / 59

Linear Latent Variable Model III Dual Probabilistic PCA Define linear-gaussian relationship between latent variables and data. Novel Latent variable approach: Define Gaussian prior over parameters, W. Integrate out parameters. X p (Y X, W) = p (Y X) = p (W) = Y W n N ( y i,: Wx i,:, σ 2 I ) i=1 p N j=1 p N ( w i,: 0, I ) i=1 ( ) y :,j 0, XX + σ 2 I R. Urtasun (TTIC) Gaussian Processes August 2, 2013 17 / 59

Linear Latent Variable Model IV Dual Probabilistic PCA Max. Likelihood Soln (Lawrence 03, Lawrence 05) X W p (Y X) = p N j=1 Y ( ) y :,j 0, XX + σ 2 I R. Urtasun (TTIC) Gaussian Processes August 2, 2013 18 / 59

Linear Latent Variable Model IV Dual Probabilistic PCA Max. Likelihood Soln (Lawrence 03, Lawrence 05) p p (Y X) = N ( y :,j 0, K ), j=1 K = XX + σ 2 I log p (Y X) = p 2 log K 1 2 tr (K 1 YY ) + const. If U q are first q principal eigenvectors of p 1 YY and the corresponding eigenvalues are Λ q, X = U qlr, L = ( Λ q σ 2 I ) 1 2 where R is an arbitrary rotation matrix. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 18 / 59

Linear Latent Variable Model IV Probabilistic PCA Max. Likelihood Soln (Tipping 99) p (Y W) = n N ( y i,: 0, C ), i=1 C = WW + σ 2 I log p (Y W) = n 2 log C 1 ) (C 2 tr 1 Y Y + const. If U q are first q principal eigenvectors of n 1 Y Y and the corresponding eigenvalues are Λ q, W = U qlr, L = ( Λ q σ 2 I ) 1 2 where R is an arbitrary rotation matrix. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 18 / 59

Equivalence of Formulations The Eigenvalue Problems are equivalent Solution for Probabilistic PCA (solves for the mapping) Y YU q = U qλ q W = U qlr Solution for Dual Probabilistic PCA (solves for the latent positions) YY U q = U qλ q X = U qlr Equivalence is from U q = Y U qλ 1 2 q You have probably used this trick to compute PCA efficiently when number of dimensions is much higher than number of points. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 19 / 59

Non-Linear Latent Variable Model Dual Probabilistic PCA Define linear-gaussian relationship between latent variables and data. Novel Latent variable approach: Define Gaussian prior over parameteters, W. Integrate out parameters. X p (Y X, W) = p (Y X) = p (W) = Y W n N ( y i,: Wx i,:, σ 2 I ) i=1 p N j=1 p N ( w i,: 0, I ) i=1 ( ) y :,j 0, XX + σ 2 I R. Urtasun (TTIC) Gaussian Processes August 2, 2013 20 / 59

Non-Linear Latent Variable Model Dual Probabilistic PCA Inspection of the marginal likelihood shows... The covariance matrix is a covariance function. X Y W p (Y X) = p N j=1 ( ) y :,j 0, XX + σ 2 I R. Urtasun (TTIC) Gaussian Processes August 2, 2013 20 / 59

Non-Linear Latent Variable Model Dual Probabilistic PCA Inspection of the marginal likelihood shows... The covariance matrix is a covariance function. We recognise it as the linear kernel. X W Y p p (Y X) = N ( y :,j 0, K ) j=1 K = XX + σ 2 I R. Urtasun (TTIC) Gaussian Processes August 2, 2013 20 / 59

Non-Linear Latent Variable Model Dual Probabilistic PCA Inspection of the marginal likelihood shows... The covariance matrix is a covariance function. We recognise it as the linear kernel. We call this the Gaussian Process Latent Variable model (GP-LVM). X W Y p p (Y X) = N ( y :,j 0, K ) j=1 K = XX + σ 2 I This is a product of Gaussian processes with linear kernels. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 20 / 59

Non-Linear Latent Variable Model Dual Probabilistic PCA Inspection of the marginal likelihood shows... The covariance matrix is a covariance function. We recognise it as the linear kernel. We call this the Gaussian Process Latent Variable model (GP-LVM). X W Y p p (Y X) = N ( y :,j 0, K ) j=1 K =? Replace linear kernel with non-linear kernel for non-linear model. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 20 / 59

Non-linear Latent Variable Models Exponentiated Quadratic (EQ) Covariance The EQ covariance has the form k i,j = k (x i,:, x j,: ), where ( ) k (x i,:, x j,: ) = α exp x i,: x j,: 2 2 2l 2. No longer possible to optimise wrt X via an eigenvalue problem. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 21 / 59

Non-linear Latent Variable Models Exponentiated Quadratic (EQ) Covariance The EQ covariance has the form k i,j = k (x i,:, x j,: ), where ( ) k (x i,:, x j,: ) = α exp x i,: x j,: 2 2 2l 2. No longer possible to optimise wrt X via an eigenvalue problem. Instead find gradients with respect to X, α, l and σ 2 and optimise using conjugate gradients argmin X,α,l,σ p 2 log K(X, X) + σ2 I + p 2 tr ( (K(X, X) + σ 2 I) 1 YY T ) R. Urtasun (TTIC) Gaussian Processes August 2, 2013 21 / 59

Non-linear Latent Variable Models Exponentiated Quadratic (EQ) Covariance The EQ covariance has the form k i,j = k (x i,:, x j,: ), where ( ) k (x i,:, x j,: ) = α exp x i,: x j,: 2 2 2l 2. No longer possible to optimise wrt X via an eigenvalue problem. Instead find gradients with respect to X, α, l and σ 2 and optimise using conjugate gradients argmin X,α,l,σ p 2 log K(X, X) + σ2 I + p 2 tr ( (K(X, X) + σ 2 I) 1 YY T ) R. Urtasun (TTIC) Gaussian Processes August 2, 2013 21 / 59

Let s look at some applications R. Urtasun (TTIC) Gaussian Processes August 2, 2013 22 / 59

1) GPLVM for Character Animation [K. Grochow, S. Martin, A. Hertzmann and Z. Popovic, Siggraph 2004] Learn a GPLVM from a small mocap sequence Smooth the latent space by adding noise in order to reduce the number of local minima. Let s replay the same motion Figure: Syle-IK R. Urtasun (TTIC) Gaussian Processes August 2, 2013 23 / 59

1) GPLVM for Character Animation [K. Grochow, S. Martin, A. Hertzmann and Z. Popovic, Siggraph 2004] Pose synthesis by solving an optimization problem argmin log p(y x) x,y such that C(y) = 0 Constraints from a user in an interactive session or from a mocap system Figure: Syle-IK R. Urtasun (TTIC) Gaussian Processes August 2, 2013 24 / 59

2) Shape Priors in Level Set Segmentation Represent contours with elliptic Fourier descriptors Learn a GPLVM on the parameters of those descriptors R. Urtasun (TTIC) Gaussian Processes August 2, 2013 25 / 59

2) Shape Priors in Level Set Segmentation Represent contours with elliptic Fourier descriptors Learn a GPLVM on the parameters of those descriptors We can now generate closed contours from the latent space R. Urtasun (TTIC) Gaussian Processes August 2, 2013 25 / 59

2) Shape Priors in Level Set Segmentation Represent contours with elliptic Fourier descriptors Learn a GPLVM on the parameters of those descriptors We can now generate closed contours from the latent space Segmentation is done by non-linear minimization of an image-driven energy which is a function of the latent space R. Urtasun (TTIC) Gaussian Processes August 2, 2013 25 / 59

2) Shape Priors in Level Set Segmentation Represent contours with elliptic Fourier descriptors Learn a GPLVM on the parameters of those descriptors We can now generate closed contours from the latent space Segmentation is done by non-linear minimization of an image-driven energy which is a function of the latent space R. Urtasun (TTIC) Gaussian Processes August 2, 2013 25 / 59

GPLVM on Contours [ V. Prisacariu and I. Reid, ICCV 2011] R. Urtasun (TTIC) Gaussian Processes August 2, 2013 26 / 59

Segmentation Results [ V. Prisacariu and I. Reid, ICCV 2011] R. Urtasun (TTIC) Gaussian Processes August 2, 2013 27 / 59

3) Non-rigid shape deformation Monocular 3D shape recovery is severely under-constrained: Complex deformations and low-texture objects. Deformation models are required to disambiguate. Building realistic physics-based models is very complex. Learning the models is a popular alternative. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 28 / 59

Global deformation models State-of-the-art techniques learn global models that require large amounts of training data, must be learned for each new object. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 29 / 59

Key observations 1 Locally, all parts of a physically homogeneous surface obey the same deformation rules. 2 Deformations of small patches are much simpler than those of a global surface, and thus can be learned from fewer examples. Learn Local Deformation Models and combine them into a global one representing the particular shape of the object of interest. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 30 / 59

Overview of the method Generative Approach Image model Mocap Data Deformation Model Pose R. Urtasun (TTIC) Gaussian Processes August 2, 2013 31 / 59

Combining the deformations Use a Product of Experts (POE) paradigm (Hinton 99): High dimensional data subject to low dimensional constraints. A global deformation should be composed of highly probable local ones. For homogeneous materials, all local patches follow the same deformation rules. Learn a single local model, and replicate it to cover the whole object. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 32 / 59

Combining the deformations Use a Product of Experts (POE) paradigm (Hinton 99): High dimensional data subject to low dimensional constraints. A global deformation should be composed of highly probable local ones. For homogeneous materials, all local patches follow the same deformation rules. Learn a single local model, and replicate it to cover the whole object. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 32 / 59

Combining the deformations Use a Product of Experts (POE) paradigm (Hinton 99): High dimensional data subject to low dimensional constraints. A global deformation should be composed of highly probable local ones. For homogeneous materials, all local patches follow the same deformation rules. Learn a single local model, and replicate it to cover the whole object. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 32 / 59

Combining the deformations Use a Product of Experts (POE) paradigm (Hinton 99): High dimensional data subject to low dimensional constraints. A global deformation should be composed of highly probable local ones. For homogeneous materials, all local patches follow the same deformation rules. Learn a single local model, and replicate it to cover the whole object. Same deformation model represents arbitrary shapes and topologies. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 32 / 59

Tracking For each image I t we have to estimate the state φ t = (y t, x t ). Bayesian formulation of the tracking p(φ t I t, X, Y) p(i t φ t )p(y t x t, X, Y)p(x t ) The image likelihood is composed of texture (template matching) and edge information p(i t φ t ) = p(t t φ t )p(e t φ t ) Tracking by minimizing the posterior R. Urtasun (TTIC) Gaussian Processes August 2, 2013 33 / 59

Shape deformation estimation [M. Salzmann, R. Urtasun and P. Fua, CVPR 2008] R. Urtasun (TTIC) Gaussian Processes August 2, 2013 34 / 59

Incorporating dynamics The mapping from latent space to high dimensional space as y i,: = Wψ(x i,: ) + η i,:, where η i,: N ( 0, σ 2 I ). We can augment the model with ARMA dynamics. This is called Gaussian process dynamical models (GPDM) (Wang et al., 05). x t+1,: = Pφ(x t:t τ,: ) + γ i,:, where γ i,: N ( 0, σ 2 d I). X X1 X2 X3 XT Y Y1 Y2 Y3 YT GPLVM GPDM R. Urtasun (TTIC) Gaussian Processes August 2, 2013 35 / 59

Model Learned for tracking Model learned from 6 walking subjects,1 gait cycle each, on treadmill at same speed with a 20 DOF joint parameterization (no global pose) Figure: Density Figure: Randomly generated trajectories R. Urtasun (TTIC) Gaussian Processes August 2, 2013 36 / 59

Tracking results [ R. Urtasun, D. Fleet and P. Fua, CVPR 2006] R. Urtasun (TTIC) Gaussian Processes August 2, 2013 37 / 59

Estimated latent trajectories [ R. Urtasun, D. Fleet and P. Fua, CVPR 2006] Figure: Estimated latent trajectories. (cian) - training data, (black) - exaggerated walk, (blue) - occlusion. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 38 / 59

Visualization of Knee Pathology Two subjects, four walk gait cycles at each of 9 speeds (3-7 km/hr) R. Urtasun (TTIC) Gaussian Processes August 2, 2013 39 / 59

Visualization of Knee Pathology Two subjects, four walk gait cycles at each of 9 speeds (3-7 km/hr) Two subjects with a knee pathology. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 39 / 59

Does it work all the time? Is training with so little data a bug or a feature? R. Urtasun (TTIC) Gaussian Processes August 2, 2013 40 / 59

Problems with the GPLVM It relies on the optimization of a non-convex function L = p 2 ln K + p 2 tr(k 1 YY T ). R. Urtasun (TTIC) Gaussian Processes August 2, 2013 41 / 59

Problems with the GPLVM It relies on the optimization of a non-convex function L = p 2 ln K + p 2 tr(k 1 YY T ). Even with the right dimensionality, they can result in poor representations if initialized far from the optimum. 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 2 1 0 1 2 R. Urtasun (TTIC) Gaussian Processes August 2, 2013 41 / 59

Problems with the GPLVM It relies on the optimization of a non-convex function L = p 2 ln K + p 2 tr(k 1 YY T ). Even with the right dimensionality, they can result in poor representations if initialized far from the optimum. 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 2 1 0 1 2 This is even worst if the dimensionality of the latent space is small. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 41 / 59

Problems with the GPLVM It relies on the optimization of a non-convex function L = p 2 ln K + p 2 tr(k 1 YY T ). Even with the right dimensionality, they can result in poor representations if initialized far from the optimum. 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 2 1 0 1 2 This is even worst if the dimensionality of the latent space is small. As a consequence these models have only been applied to small databases of a single activity. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 41 / 59

Solutions Back-constraints: Constrain the inverse mapping to be smooth [Lawrence et al. 06] Topologies: Add smoothness and topological priors, e.g., style content separation [Urtasun et al. 08] Dynamics: to smooth the latent space [Wang et al. 06] Continuous dimensionality reduction: Add rank priors and reduce the dimensionality as you do the optimization [Geiger et al. 09] Stochastic: learning algorithms [Lawrence et al. 09] etc R. Urtasun (TTIC) Gaussian Processes August 2, 2013 42 / 59

Continuous dimensionality reduction [A. Geiger, R. Urtasun and T. Darrell, CVPR 2009] R. Urtasun (TTIC) Gaussian Processes August 2, 2013 43 / 59

Stochastic Algorithms [ A. Yao, J. Gall, L. Van Gool and R. Urtasun, NIPS 2011] Distance Matrix PCA GPLVM stochastic GPLVM Basketball Signals Exercise Stretching Jumping Walking R. Urtasun (TTIC) Gaussian Processes August 2, 2013 44 / 59

Humaneva Results [ A. Yao, J. Gall, L. Van Gool and R. Urtasun, NIPS 2011] C1, Frame 27 C1, Frame 72 C3, Frame 27 C3, Frame 72 C1, Frame 30 C1, Frame 60 C3, Frame 30 C3, Frame 60 S1 Boxing S3 Walking Train Test [Xu07] [Li10] GPLVM CRBM imcrbm Ours S1 S1 - - 57.6 ± 11.6 48.8 ± 3.7 58.6 ± 3.9 44.0 ± 1.8 S1,2,3 S1 140.3-64.3 ± 19.2 55.4 ± 0.8 54.3 ± 0.5 41.6 ± 0.8 S2 S2-68.7 ± 24.7 98.2 ± 15.8 47.4 ± 2.9 67.0 ± 0.7 54.4 ± 1.8 S1,2,3 S2 149.4-155.9 ± 48.8 99.1 ± 23.0 69.3 ± 3.3 64.0 ± 2.9 S3 S3-69.6 ± 22.2 71.6 ± 10.0 49.8 ± 2.2 51.4 ± 0.9 45.4 ± 1.1 S1,2,3 S3 156.3-123.8. ± 16.7 70.9 ± 2.1 43.4 ± 4.1 46.5 ± 1.4 Model Tracking Error [Pavlovic00] as reported in [Li07] 569.90 ± 209.18 [Lin06] as reported in [Li07] 380.02 ± 74.97 GPLVM 121.44 ± 30.7 [Li07] 117.0 ± 5.5 Best CRBM [Taylor10] 75.4 ± 9.7 Ours 74.1 ± 3.3 R. Urtasun (TTIC) Gaussian Processes August 2, 2013 45 / 59

Other extensions R. Urtasun (TTIC) Gaussian Processes August 2, 2013 46 / 59

1) Priors for supervised learning We introduce a prior that is based on the Fisher criteria { p(x) exp 1 σd 2 tr ( S 1 ) } w S b, with S b the between class matrix and S w the within class matrix R. Urtasun (TTIC) Gaussian Processes August 2, 2013 47 / 59

1) Priors for supervised learning We introduce a prior that is based on the Fisher criteria { p(x) exp 1 σd 2 tr ( S 1 ) } w S b, with S b the between class matrix and S w the within class matrix S b = L i=1 n i N (M i M 0 )(M i M 0 ) T where X (i) = [x (i) 1,, x(i) n i ] are the n i training points of class i, M i is the mean of the elements of class i, and M 0 is the mean of all the training points of all classes. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 47 / 59

1) Priors for supervised learning We introduce a prior that is based on the Fisher criteria { p(x) exp 1 σd 2 tr ( S 1 ) } w S b, with S b the between class matrix and S w the within class matrix L n i S b = N (M i M 0 )(M i M 0 ) T S w = L i=1 i=1 n i n [ N 1 i (x (i) k n i k=1 M i )(x (i) k M i ) T where X (i) = [x (i) 1,, x(i) n i ] are the n i training points of class i, M i is the mean of the elements of class i, and M 0 is the mean of all the training points of all classes. As before the model is learned by maximizing p(y X)p(X). ] R. Urtasun (TTIC) Gaussian Processes August 2, 2013 47 / 59

1) Priors for supervised learning We introduce a prior that is based on the Fisher criteria { p(x) exp 1 σd 2 tr ( S 1 ) } w S b, with S b the between class matrix and S w the within class matrix 1 0 1 2 3 4 0.5 0 0.5 0.6 0.4 0.2 0 0.2 0.4 5 4 2 0 2 1 5 1 0 5 0 05 1 0.6 0.8 0.6 0.4 0.2 0 0.2 0.4 Figure: 2D latent spaces learned by D-GPLVM on the oil dataset for different values of σ d [Urtasun et al. 07]. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 47 / 59

2) Hierarchical GP-LVM Stacking Gaussian Processes Regressive dynamics provides a simple hierarchy. The input space of the GP is governed by another GP. By stacking GPs we can consider more complex hierarchies. Ideally we should marginalise latent spaces In practice we seek MAP solutions. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 48 / 59

Within Subject Hierarchy Decomposition of Body Figure: Decomposition of a subject. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 49 / 59

Single Subject Run/Walk [N. Lawrence and A. Moore, ICML 2007] Figure: Hierarchical model of a walk and a run. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 50 / 59

3) Style Content Separation and Multi-linear models Multiple aspects that affect the input signal, interesting to factorize them R. Urtasun (TTIC) Gaussian Processes August 2, 2013 51 / 59

Multilinear models Style-Content Separation (Tenenbaum & Freeman 00) y = ij w ij a i b j + ɛ Multi-linear analysis (Vasilescu & Terzopoulous 02) y = ijk w ijk a i b j c k + ɛ Non-linear basis functions (Elgammal & Lee, 2004) y = ij w ij a i φ j (b) + ɛ R. Urtasun (TTIC) Gaussian Processes August 2, 2013 52 / 59

Multi (non)-linear models with GPs In the GPLVM y = j w j φ j (x) + ɛ = w T Φ(x) + ɛ with E[y, y ] = Φ(x) T Φ(y) + β 1 δ = k(x, x ) + β 1 δ Multifactor Gaussian process y = i,j,k, w ijk φ (1) i φ (1) j φ (1) k + ɛ with E[y, y ] = i Φ (i)t Φ (i) + β 1 δ = i k i (x (i), x (i) ) + β 1 δ R. Urtasun (TTIC) Gaussian Processes August 2, 2013 53 / 59

Multi (non)-linear models with GPs In the GPLVM y = j w j φ j (x) + ɛ = w T Φ(x) + ɛ with E[y, y ] = Φ(x) T Φ(y) + β 1 δ = k(x, x ) + β 1 δ Multifactor Gaussian process y = i,j,k, w ijk φ (1) i φ (1) j φ (1) k + ɛ with E[y, y ] = i Φ (i)t Φ (i) + β 1 δ = i k i (x (i), x (i) ) + β 1 δ Learning in this model is the same, just the kernel changes. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 53 / 59

Multi (non)-linear models with GPs In the GPLVM y = j w j φ j (x) + ɛ = w T Φ(x) + ɛ with E[y, y ] = Φ(x) T Φ(y) + β 1 δ = k(x, x ) + β 1 δ Multifactor Gaussian process y = i,j,k, w ijk φ (1) i φ (1) j φ (1) k + ɛ with E[y, y ] = i Φ (i)t Φ (i) + β 1 δ = i k i (x (i), x (i) ) + β 1 δ Learning in this model is the same, just the kernel changes. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 53 / 59

Training Data Each training motion is a collection of poses, sharing the same combination of subject (s) and gait (g). Training data, 6 sequences, 314 frames in total R. Urtasun (TTIC) Gaussian Processes August 2, 2013 54 / 59

Results [J. Wang, D. Fleet and A. Hertzmann, ICML 2007] R. Urtasun (TTIC) Gaussian Processes August 2, 2013 55 / 59

4) Continuous Character Control When employing GPLVM, different motions get too far apart Difficult to generate animations where we transition between motions Back-constraints or topologies are not enough New prior that enforces connectivity in the graph ln p(x) = w c ln Kij d with the graph diffusion kernel K d obtain from K d ij = exp(βh) with H = T 1/2 LT 1/2 the graph Laplacian, and T is a diagonal matrix with T ii = j w(x i, x j ), { k L ij = w(x i, x k ) if i = j w(x i, x j ) otherwise. and w(x i, x j ) = x i x j p measures similarity. i,j R. Urtasun (TTIC) Gaussian Processes August 2, 2013 56 / 59

Embeddings: Walking Figure: Walking embeddings learned (a) without the connectivity term, (b) with w c = 0:1, and (c) with w c = 1:0. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 57 / 59

Embeddings: Punching Figure: Embeddings for the punching task (a) with and (b) without the connectivity term. R. Urtasun (TTIC) Gaussian Processes August 2, 2013 58 / 59

Video Results [ S. Levine, J. Wang, A. Haraux, Z. Popovic and V. Koltun, Siggraph 2012] R. Urtasun (TTIC) Gaussian Processes August 2, 2013 59 / 59