ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Similar documents
Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Introduction. Chapter 1

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Gaussian Processes (10/16/13)

Least Squares Regression

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Reliability Monitoring Using Log Gaussian Process Regression

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Introduction to SVM and RVM

GAUSSIAN PROCESS REGRESSION

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Nonparameteric Regression:

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Nonparametric Bayesian Methods (Gaussian Processes)

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Lecture 5: GPs and Streaming regression

Model Selection for Gaussian Processes

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Lecture : Probabilistic Machine Learning

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

STA 4273H: Sta-s-cal Machine Learning

Gaussian Process Regression

STAT 518 Intro Student Presentation

y Xw 2 2 y Xw λ w 2 2

Gaussian Process Regression Forecasting of Computer Network Conditions

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Probabilistic & Unsupervised Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Relevance Vector Machines

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

STA414/2104 Statistical Methods for Machine Learning II

GWAS V: Gaussian processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Probabilistic Machine Learning. Industrial AI Lab.

Introduction to Gaussian Processes

Bayesian Support Vector Machines for Feature Ranking and Selection

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Pattern Recognition and Machine Learning

Machine learning - HT Basis Expansion, Regularization, Validation

Learning Gaussian Process Models from Uncertain Data

Bayesian Machine Learning

STA 4273H: Statistical Machine Learning

ECE521 week 3: 23/26 January 2017

Least Squares Regression

COMP90051 Statistical Machine Learning

Bayesian Learning (II)

COMP90051 Statistical Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Bayesian Linear Regression [DRAFT - In Progress]

Announcements. Proposals graded

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Kernel methods, kernel SVM and ridge regression

CS-E4830 Kernel Methods in Machine Learning

CPSC 540: Machine Learning

MTTTS16 Learning from Multiple Sources

Bayesian Deep Learning

20: Gaussian Processes

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Joint Emotion Analysis via Multi-task Gaussian Processes

Bayesian Linear Regression. Sargur Srihari

CS-E3210 Machine Learning: Basic Principles

Neutron inverse kinetics via Gaussian Processes

Naïve Bayes classification

Gaussian Process Latent Variable Models for Dimensionality Reduction and Time Series Modeling

Gaussian Processes: We demand rigorously defined areas of uncertainty and doubt

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Gaussian with mean ( µ ) and standard deviation ( σ)

Multivariate Bayesian Linear Regression MLAI Lecture 11

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Gaussian Processes in Machine Learning

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Relevance Vector Machines for Earthquake Response Spectra

Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005

Probabilistic Graphical Models for Image Analysis - Lecture 1

Gaussian Processes for Machine Learning

Further Issues and Conclusions

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Statistical learning. Chapter 20, Sections 1 3 1

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Probabilistic Graphical Models Lecture 20: Gaussian Processes

CMU-Q Lecture 24:

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CSC321 Lecture 18: Learning Probabilistic Models

BAYESIAN CLASSIFICATION OF HIGH DIMENSIONAL DATA WITH GAUSSIAN PROCESS USING DIFFERENT KERNELS

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Gaussian Processes as Continuous-time Trajectory Representations: Applications in SLAM and Motion Planning

Transcription:

1 Non-linear regression techniques Part - II

Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector regression Boosting random gaussians Random Gaussian forest process regression Gaussian Process Gradient boosting Locally weighted projected regression Not covered replaced by one hour to answer questions about mini-project

3 Regression Algorithms in this Course Random Gaussian forest process regression Gaussian Process

4 Probabilistic Regression (PR) PR is a statistical approach to classical linear regression that estimates the relationship between zero-mean variables y and x by building a linear model of the form: y f x, w w T x, w, x N If one assumes that the observed values of y differ from f(x) by an additive noise that follows a zero-mean Gaussian distribution (such an assumption consists of putting a prior distribution over the noise), then: T y w x, with N 0, Where have we seen this before?

6 Probabilistic Regression i i y Training set of M pairs of data points X, x, y Likelihood of the regressive model y T w X N 0, y ~ p y X, w, Data points are independently and identically distributed (i.i.d) M i i y,, ~,, p X w p y x w i1 M i1 1 exp y i T i w x M i1 Parameters of the model

7 Probabilistic Regression i i y Training set of M pairs of data points X, x, y Likelihood of the regressive model y T w X N 0, y ~ p y X, w, Prior model on distribution of parameter w: 1 T 1 pw N 0, w exp w w w M i1 Parameters of the model Hyperparameters Given by user

8 1 T 1 Prior on w : p w N 0, w exp w w w Estimates conditional distribution on w given the data using Bayes' rule. likelihood x prior posterior = marginal likelihood p w y, X p Probabilistic Regression y X, w pw py X (drop, not a variable) Posterior distribution on w is Gaussian. 1 1 1 1 T 1 1 T 1 pw X, y N XX X, XX w y w

9 The conditional distribution of a Gaussian distribution is also Gaussian (image from Wikipedia) Posterior distribution on w is Gaussian. 1 1 1 1 T 1 1 T 1 pw X, y N XX X, XX w y w

10 Probabilistic Regression The expectation over the posterior distribution gives the best estimate: 1 1 1 T 1 X, y. w y E p w XX X This is called the maximum a posteriori (MAP) estimate of w. A 1 1 1 1 T 1 1 T 1 pw X, y N XX X, XX w y w

11 Probabilistic Regression We can now compute the posterior distribution on y :,, y, pw X, y p y x 1 T 1 T 1 p y x, X, y N x A Xy, x A x with A X p y x w dw 1 XX T 1 w 1 1 1 1 T 1 1 T 1 pw X, y N XX X, XX w y w

1 Probabilistic Regression 1 T 1 T 1 p y x, X, y N x A Xy, x A x Testing point Training datapoints with A 1 XX T 1 w The estimate of y given a test point x is given by : 1 T 1 y E p( y x} x A Xy

14 Probabilistic Regression 1 T 1 T 1 p y x, X, y N x A Xy, x A x T 1,, y E p y x X 1 x A Xy with A 1 XX T 1 w The variance gives a measure of the uncertainty of the prediction: T 1 p y x var ( } x A x

15 MACHINE ADVANCED LEARNING MACHINE LEARNING 01 How to extend the simple linear Bayesian regressive model for nonlinear regression? T y w x N 0, Gaussian Process Regression

16 MACHINE ADVANCED LEARNING MACHINE LEARNING 01 How to extend the simple linear Bayesian regressive model for nonlinear regression? T y w x N 0, Gaussian Process Regression x, ~ 0, T y w x N

17 MACHINE ADVANCED LEARNING MACHINE LEARNING 01 How to extend the simple linear Bayesian regressive model for nonlinear regression? T y w x N 0, Gaussian Process Regression x, ~ 0, T y w x N Distribution over functions

19 Gaussian Process Regression How to extend the simple linear Bayesian regressive model for nonlinear regression? T y w x N 0, 1 p y x X N x A X x A x T 1 T 1 T 1,, y y,, A XX 1 w x Non-Linear Transformation 1 p y x X N x A X x A x T 1 T 1,, y y, with T 1 w A X X

0 Gaussian Process Regression Again, a Gaussian distribution. 1 p y x X N x A X x A x T 1 T 1,, y y, with T 1 w A X X

1 Gaussian Process Regression p y x, X, y 1 T 1 x w X K X, X I y, N T T 1 T x w x x w X K X, X I X w x T Define the kernel as: k x, x' x x' w See supplement for steps Inner product in feature space 1 p y x X N x A X x A x T 1 T 1,, y y, with T 1 w A X X

Gaussian Process Regression y E y x X k x x,,, i y i with K X, X T Define the kernel as: k x, x' x x' M i1 I w 1 y See supplement for steps Inner product in feature space 1 p y x X N x A X x A x T 1 T 1,, y y, with T 1 w A X X

4 Gaussian Process Regression,,, i i with K X, X M y E y x X y k x x >0 i1 I 1 y i All datapoints are used in the computation!

5 Gaussian Process Regression y E y x X k x x,,, i y i with K X, X M i1 I The kernel and its hyperparameters are given by the user. These can be optimized through maximum likelihood over the marginal likelihood, see class s supplement 1 y RBF kernel, width = 0.1 RBF kernel, width = 0.5

6 Gaussian Process Regression Sensitivity to the choice of kernel width (called lengthscale in most books) when using Gaussian kernels (also called RBF or square exponential). k x, x ' e xx' l Kernel Width=0.1

7 Gaussian Process Regression Sensitivity to the choice of kernel width (called lengthscale in most books) when using Gaussian kernels (also called RBF or square exponential). k x, x ' e xx' l Kernel Width=0.5

8 Gaussian Process Regression y E y x X k x x,,, i y i with K X, X M i1 I The value for the noise needs to be pre-set by hand. 1 py x K x x K x X K X X I K X x cov,,,, The larger the noise, the more uncertainty. The noise is <=1. 1 y Sigma = 0.05 Sigma = 0.01

Gaussian Process Regression Low noise: 0.05 9

Gaussian Process Regression High noise: 0. 30

31 Gaussian Process Regression y E y x X k x x,,, i y i with K X, X M i1 I 1 y Kernel is usually Gaussian kernel with stationary covariance function Non-Stationary Covariance Functions can encapsulate local variations in the density of the datapoints 1 N N li x li x ' xi xi i1 i i ' i1 i i ' k x, x ' exp l x l x l x l x Gibbs non stationary covariance function (length-scale a function of x):

Gaussian Process Regression Linear Model Non-Linear Model T 0, T y w x, ~ N 0, y w x N Both models follow a zero mean Gaussian distribution! T 0, E y E w x N T E w x 0 0 T 0, E y E w x N x T E w 0 0 Predict y=0 away from datapoints! SVR predicts y=b away from datapoints (see exercise session) 3

Examples of application of GPR Striking a match is a task that requires careful control of the force in interaction to push enough to light the match but not too much in order not to break it. Kronander, K. and Billard, A. (013) Learning Compliant Manipulation through Kinesthetic and Tactile Human-Robot Interaction. IEEE Transactions on Haptics. 10.1109/TOH.013.54. 33

Examples of application of GPR The stiffness profile is encoded as a time-varying input using GPR. The shaded area corresponds to the striking phase. We see that the stiffness must be decreased just before entering into contact and again during contact when the match lights up. Stiffness can increase again when the robot moves back into free space 34

35 Examples of application of GPR Building a 3D model of an object from tactile information can be useful to guide manipulation of object when the object is no longer visible.

36 Examples of application of GPR Distance to surface: y Inside the object: y 0 On the surface: y 0 Outside the surface: y 0 Point on the surface: x 3 Learn a mapping y f x with GPR to determine how far one is from the surface.

37 Examples of application of GPR Normal to the surface: z Learn a mapping z g x with GPR to determine the normal from the surface (need 3 GPRs for each of the coordinate of the vector z). The distance and normal to the surfaces can be used in an optimization framework to determine the optimal posture of robot fingers on the object.

Examples of application of GPR GPR can be used to model the shape of objects. Top: 3D points sampled either from a camera or from tactile sensing. Bottom: 3D shape reconstructed by GPR. The arrows represent the predicted normals at the surface (El Khoury, S., Li, M., and Billard, A. (013). On the generation of a variety of grasps.robotics and Autonomous Systems, 61(1):1335 34) 38

39 Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector regression Boosting random gaussians Random Gaussian forest process regression Gaussian Process Gradient boosting Locally weighted projected regression

40 Regression Algorithms in this Course Relevance Vector Machine Boosting random gaussians Gradient boosting

41 Gradient Boosting Relevance Vector Machine Boosting random gaussians Gradient boosting Choose some regressive technique (any we have seen sofar) ˆ 1 ˆ ˆ m Apply boosting to train and combine the set of estimates f, f... f.

4 Gradient Boosting Relevance Vector Machine Boosting random gaussians Gradient boosting Aggregate to get the final estimate fˆ 1 m m i 1 fˆ i

43 Gradient Boosting Region with high density of datapoints Relevance Vector Machine Boosting random gaussians Gradient boosting using squared loss Typical example of dataset with imbalanced data. Data was hand-drawn. There are more datapoints in regions where the hand slowed down. As a result, the fit is very good in these regions and less good in other regions.

44 Gradient Boosting Boosting random gaussians Gradient boosting using squared loss and using twice more functions Better results, i.e. smoother fit, are obtained when increasing the number of functions for the fit

45 Summary We have seen a few different techniques to perform non-linear regression in machine learning. The techniques differ in their algorithm and in the number of hyperparameters. Some techniques (GP, RVR) provide a metric of uncertainty of the model, which can be used to determine when inference is trustable. Some techniques (n-svr, RVR, LWPR) are designed to be computationally cheap at retrieval (very few support vectors, few models). Other techniques (GP) are meant to provide very accurate estimate of the data, at the cost of retaining all datapoints for retrieval.