ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

1 Non-linear regression techniques Part - II

Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector regression Boosting random gaussians Random Gaussian forest process regression Gaussian Process Gradient boosting Locally weighted projected regression Not covered replaced by one hour to answer questions about mini-project

3 Regression Algorithms in this Course Random Gaussian forest process regression Gaussian Process

4 Probabilistic Regression (PR) PR is a statistical approach to classical linear regression that estimates the relationship between zero-mean variables y and x by building a linear model of the form: y f x, w w T x, w, x N If one assumes that the observed values of y differ from f(x) by an additive noise that follows a zero-mean Gaussian distribution (such an assumption consists of putting a prior distribution over the noise), then: T y w x, with N 0, Where have we seen this before?

6 Probabilistic Regression i i y Training set of M pairs of data points X, x, y Likelihood of the regressive model y T w X N 0, y ~ p y X, w, Data points are independently and identically distributed (i.i.d) M i i y,, ~,, p X w p y x w i1 M i1 1 exp y i T i w x M i1 Parameters of the model

7 Probabilistic Regression i i y Training set of M pairs of data points X, x, y Likelihood of the regressive model y T w X N 0, y ~ p y X, w, Prior model on distribution of parameter w: 1 T 1 pw N 0, w exp w w w M i1 Parameters of the model Hyperparameters Given by user

8 1 T 1 Prior on w : p w N 0, w exp w w w Estimates conditional distribution on w given the data using Bayes' rule. likelihood x prior posterior = marginal likelihood p w y, X p Probabilistic Regression y X, w pw py X (drop, not a variable) Posterior distribution on w is Gaussian. 1 1 1 1 T 1 1 T 1 pw X, y N XX X, XX w y w

9 The conditional distribution of a Gaussian distribution is also Gaussian (image from Wikipedia) Posterior distribution on w is Gaussian. 1 1 1 1 T 1 1 T 1 pw X, y N XX X, XX w y w

10 Probabilistic Regression The expectation over the posterior distribution gives the best estimate: 1 1 1 T 1 X, y. w y E p w XX X This is called the maximum a posteriori (MAP) estimate of w. A 1 1 1 1 T 1 1 T 1 pw X, y N XX X, XX w y w

11 Probabilistic Regression We can now compute the posterior distribution on y :,, y, pw X, y p y x 1 T 1 T 1 p y x, X, y N x A Xy, x A x with A X p y x w dw 1 XX T 1 w 1 1 1 1 T 1 1 T 1 pw X, y N XX X, XX w y w

1 Probabilistic Regression 1 T 1 T 1 p y x, X, y N x A Xy, x A x Testing point Training datapoints with A 1 XX T 1 w The estimate of y given a test point x is given by : 1 T 1 y E p( y x} x A Xy

14 Probabilistic Regression 1 T 1 T 1 p y x, X, y N x A Xy, x A x T 1,, y E p y x X 1 x A Xy with A 1 XX T 1 w The variance gives a measure of the uncertainty of the prediction: T 1 p y x var ( } x A x

15 MACHINE ADVANCED LEARNING MACHINE LEARNING 01 How to extend the simple linear Bayesian regressive model for nonlinear regression? T y w x N 0, Gaussian Process Regression

16 MACHINE ADVANCED LEARNING MACHINE LEARNING 01 How to extend the simple linear Bayesian regressive model for nonlinear regression? T y w x N 0, Gaussian Process Regression x, ~ 0, T y w x N

17 MACHINE ADVANCED LEARNING MACHINE LEARNING 01 How to extend the simple linear Bayesian regressive model for nonlinear regression? T y w x N 0, Gaussian Process Regression x, ~ 0, T y w x N Distribution over functions

19 Gaussian Process Regression How to extend the simple linear Bayesian regressive model for nonlinear regression? T y w x N 0, 1 p y x X N x A X x A x T 1 T 1 T 1,, y y,, A XX 1 w x Non-Linear Transformation 1 p y x X N x A X x A x T 1 T 1,, y y, with T 1 w A X X

0 Gaussian Process Regression Again, a Gaussian distribution. 1 p y x X N x A X x A x T 1 T 1,, y y, with T 1 w A X X

1 Gaussian Process Regression p y x, X, y 1 T 1 x w X K X, X I y, N T T 1 T x w x x w X K X, X I X w x T Define the kernel as: k x, x' x x' w See supplement for steps Inner product in feature space 1 p y x X N x A X x A x T 1 T 1,, y y, with T 1 w A X X

Gaussian Process Regression y E y x X k x x,,, i y i with K X, X T Define the kernel as: k x, x' x x' M i1 I w 1 y See supplement for steps Inner product in feature space 1 p y x X N x A X x A x T 1 T 1,, y y, with T 1 w A X X

4 Gaussian Process Regression,,, i i with K X, X M y E y x X y k x x >0 i1 I 1 y i All datapoints are used in the computation!

5 Gaussian Process Regression y E y x X k x x,,, i y i with K X, X M i1 I The kernel and its hyperparameters are given by the user. These can be optimized through maximum likelihood over the marginal likelihood, see class s supplement 1 y RBF kernel, width = 0.1 RBF kernel, width = 0.5

6 Gaussian Process Regression Sensitivity to the choice of kernel width (called lengthscale in most books) when using Gaussian kernels (also called RBF or square exponential). k x, x ' e xx' l Kernel Width=0.1

7 Gaussian Process Regression Sensitivity to the choice of kernel width (called lengthscale in most books) when using Gaussian kernels (also called RBF or square exponential). k x, x ' e xx' l Kernel Width=0.5

8 Gaussian Process Regression y E y x X k x x,,, i y i with K X, X M i1 I The value for the noise needs to be pre-set by hand. 1 py x K x x K x X K X X I K X x cov,,,, The larger the noise, the more uncertainty. The noise is <=1. 1 y Sigma = 0.05 Sigma = 0.01

Gaussian Process Regression Low noise: 0.05 9

Gaussian Process Regression High noise: 0. 30

31 Gaussian Process Regression y E y x X k x x,,, i y i with K X, X M i1 I 1 y Kernel is usually Gaussian kernel with stationary covariance function Non-Stationary Covariance Functions can encapsulate local variations in the density of the datapoints 1 N N li x li x ' xi xi i1 i i ' i1 i i ' k x, x ' exp l x l x l x l x Gibbs non stationary covariance function (length-scale a function of x):

Gaussian Process Regression Linear Model Non-Linear Model T 0, T y w x, ~ N 0, y w x N Both models follow a zero mean Gaussian distribution! T 0, E y E w x N T E w x 0 0 T 0, E y E w x N x T E w 0 0 Predict y=0 away from datapoints! SVR predicts y=b away from datapoints (see exercise session) 3

Examples of application of GPR Striking a match is a task that requires careful control of the force in interaction to push enough to light the match but not too much in order not to break it. Kronander, K. and Billard, A. (013) Learning Compliant Manipulation through Kinesthetic and Tactile Human-Robot Interaction. IEEE Transactions on Haptics. 10.1109/TOH.013.54. 33

Examples of application of GPR The stiffness profile is encoded as a time-varying input using GPR. The shaded area corresponds to the striking phase. We see that the stiffness must be decreased just before entering into contact and again during contact when the match lights up. Stiffness can increase again when the robot moves back into free space 34

35 Examples of application of GPR Building a 3D model of an object from tactile information can be useful to guide manipulation of object when the object is no longer visible.

36 Examples of application of GPR Distance to surface: y Inside the object: y 0 On the surface: y 0 Outside the surface: y 0 Point on the surface: x 3 Learn a mapping y f x with GPR to determine how far one is from the surface.

37 Examples of application of GPR Normal to the surface: z Learn a mapping z g x with GPR to determine the normal from the surface (need 3 GPRs for each of the coordinate of the vector z). The distance and normal to the surfaces can be used in an optimization framework to determine the optimal posture of robot fingers on the object.

Examples of application of GPR GPR can be used to model the shape of objects. Top: 3D points sampled either from a camera or from tactile sensing. Bottom: 3D shape reconstructed by GPR. The arrows represent the predicted normals at the surface (El Khoury, S., Li, M., and Billard, A. (013). On the generation of a variety of grasps.robotics and Autonomous Systems, 61(1):1335 34) 38

39 Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector regression Boosting random gaussians Random Gaussian forest process regression Gaussian Process Gradient boosting Locally weighted projected regression

40 Regression Algorithms in this Course Relevance Vector Machine Boosting random gaussians Gradient boosting

41 Gradient Boosting Relevance Vector Machine Boosting random gaussians Gradient boosting Choose some regressive technique (any we have seen sofar) ˆ 1 ˆ ˆ m Apply boosting to train and combine the set of estimates f, f... f.

4 Gradient Boosting Relevance Vector Machine Boosting random gaussians Gradient boosting Aggregate to get the final estimate fˆ 1 m m i 1 fˆ i

43 Gradient Boosting Region with high density of datapoints Relevance Vector Machine Boosting random gaussians Gradient boosting using squared loss Typical example of dataset with imbalanced data. Data was hand-drawn. There are more datapoints in regions where the hand slowed down. As a result, the fit is very good in these regions and less good in other regions.

44 Gradient Boosting Boosting random gaussians Gradient boosting using squared loss and using twice more functions Better results, i.e. smoother fit, are obtained when increasing the number of functions for the fit

45 Summary We have seen a few different techniques to perform non-linear regression in machine learning. The techniques differ in their algorithm and in the number of hyperparameters. Some techniques (GP, RVR) provide a metric of uncertainty of the model, which can be used to determine when inference is trustable. Some techniques (n-svr, RVR, LWPR) are designed to be computationally cheap at retrieval (very few support vectors, few models). Other techniques (GP) are meant to provide very accurate estimate of the data, at the cost of retaining all datapoints for retrieval.