Managing Uncertainty

Similar documents
ECE521 week 3: 23/26 January 2017

Bayesian Methods for Machine Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

1 Bayesian Linear Regression (BLR)

Learning Gaussian Process Models from Uncertain Data

Bayesian Regression Linear and Logistic Regression

Graphical Models for Collaborative Filtering

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

COMP90051 Statistical Machine Learning

Bayesian Linear Regression [DRAFT - In Progress]

HOMEWORK #4: LOGISTIC REGRESSION

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Linear Regression (9/11/13)

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Lecture : Probabilistic Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

x k+1 = x k + α k p k (13.1)

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Part 6: Multivariate Normal and Linear Models

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Linear Models in Machine Learning

CSC2515 Assignment #2

Machine Learning Practice Page 2 of 2 10/28/13

Mathematical Formulation of Our Example

STA 4273H: Statistical Machine Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

Bayesian Machine Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Overfitting, Bias / Variance Analysis

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Machine Learning 2017

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning (CS 567) Lecture 5

STA 4273H: Statistical Machine Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

CS 195-5: Machine Learning Problem Set 1

Gaussian Processes (10/16/13)

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

10-701/ Machine Learning - Midterm Exam, Fall 2010

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Decision and Bayesian Learning

COM336: Neural Computing

STA414/2104 Statistical Methods for Machine Learning II

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Statistical Data Mining and Machine Learning Hilary Term 2016

STA 4273H: Sta-s-cal Machine Learning

Least Squares. Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter UCSD

2 Tikhonov Regularization and ERM

Nonparametric Bayesian Methods (Gaussian Processes)

Parameter Estimation. Industrial AI Lab.

Gibbs Sampling in Endogenous Variables Models

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Least Squares Regression

y(x) = x w + ε(x), (1)

Statistical Machine Learning Hilary Term 2018

Introduction to Machine Learning

A Course on Advanced Econometrics

Gaussian Models

Stat 5101 Lecture Notes

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lie Groups for 2D and 3D Transformations

Lecture 1: OLS derivations and inference

y Xw 2 2 y Xw λ w 2 2

Inferring biological dynamics Iterated filtering (IF)

Introduction to Probabilistic Graphical Models: Exercises

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Gaussian Processes for Regression. Carl Edward Rasmussen. Department of Computer Science. Toronto, ONT, M5S 1A4, Canada.

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Probability and Information Theory. Sargur N. Srihari

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Bayesian Learning (II)

g-priors for Linear Regression

Gaussian Quiz. Preamble to The Humble Gaussian Distribution. David MacKay 1

COMP90051 Statistical Machine Learning

Least Squares Regression

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Linear Regression and Its Applications

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

The connection of dropout and Bayesian statistics

Today. Calculus. Linear Regression. Lagrange Multipliers

Machine learning - HT Maximum Likelihood

Introduction to Bayesian Inference

6.867 Machine Learning

10-725/36-725: Convex Optimization Prerequisite Topics

Manual for a computer class in ML

Machine Learning Linear Classification. Prof. Matteo Matteucci

CPSC 540: Machine Learning

X t = a t + r t, (7.1)

Terminology for Statistical Data

HOMEWORK #4: LOGISTIC REGRESSION

Transcription:

Managing Uncertainty Bayesian Linear Regression and Kalman Filter December 4, 2017 Objectives The goal of this lab is multiple: 1. First it is a reminder of some central elementary notions of Bayesian Machine Learning in the specic context of linear regression: Bayesian inference, MLE and MAP estimators, conjugate prior, prior as a regularization factor, etc. 2. Second it shows how the classical Ordinary Least Square (OLS) and the ridge regression (with L2 regularization) can be interpreted as special cases of a more general Bayesian linear regression. In other words, it shows how the frequentist and Bayesian approaches of Machine Learning intersect. 3. Third it introduces Recursive Least Square, an original application of Kalman lter to t parameters of a linear model in an online manner. Notations In the following, a matrix will be denoted X, a column vector x and a scalar x. I is the identity matrix and X T is the transpose of matrix X. Given vector c, c (j) denotes the j th component of vector c. 1 Ordinary Least Square Given a dataset D of samples (x i, y i ) i=1,...,n (with x i R m and y i R), regression consists in learning a model to predict some value for y given some new x. Values of x are called inputs and values of y outputs. A parametric approach to solve regression problems consists in learning from the samples the parameters of some parameterized predictive function f, so that y f (x). A linear model is such that the output is assumed to be a linear combination of the input variables: f (x) = m (j) x (j) = x T (1) j=1 Ordinary Least Square (OLS) consists in nding the parameters OLS minimizing the empirical risk J 2 () for a quadratic loss: ( OLS = argmin J 2 () with J 2 () = 1 n ( yi x T i ) ) 2 (2) n i=1 1

This can be rewritten in a matrix form by introducing y as the n-vector containing outputs y i and X as the n m-matrix whose i th line is the input x i : OLS = argmin J 2 () with J 2 () = 1 n y X 2 (3) Question 1.1 Compute the gradient of J(). Deduce that: (X T X) 1 is called the pseudoinverse matrix of X. OLS = (X T X) 1 X T y (4) In order to learn and evaluate linear regressors, one considers an ane model in the plane R 2 and one articially generates n samples from that model, i.e. such that: y = 0 x (1) + 1 x (2) + 2 + ɛ = x T with x T = [x (1), x (2), 1] (5) Parameters will be xed to some arbitrary values, for instance T = [1, 2, 3]. White noise ɛ is sampled from distribution N (0, σɛ 2 ) and the input matrix [x (1) i, x (2) i ] of size (n, 2) will be drawn from the unit square. In the following, one will use Python3 and the matrix operators of Numpy. Most functions are already provided in the script linear.py. Some numpy functions that might be useful are recalled in the appendix section. To run this script, you need to pass arguments from the command line to tell which test function you want to call, with what values for the main arguments. For instance you must type in a terminal: python3 linear. py -- test 1 --n 50 -- sigma 0.1 in order to call the test function number 1, i.e. the function testsimpleols(50,0.1) (see the bottom code of linear.py) that computes OLS from 50 random samples drawn in the unit square with σ e = 0.1. Question 1.2 Look at the provided source code linear.py. See what does the function testregressor. Implement the OLS estimator in function computeols using numpy and evaluate it by coding the test function N o 1 testsimpleols (for instance by calling testregressor). Now one wants to compare the evaluated risks of the OLS estimator for various values of sample numbers n and noise level σ ɛ. Question 1.3 Compare the empirical risk J e obtained by OLS on the training dataset with the empirical risk J r on a test set. Represent graphically both risks as functions of σ ɛ and n using the following recommandations. To do this, complete the test function number 2 evaluatesimpleols by calling function evaluateregressor (see documentation in the source code). Question 1.4 OLS is undened when X T X is singular. When does it happen? Question 1.5 Evaluate the OLS estimator when this case is almost reached (one then says X T X is ill dened) by completing test function 3 using function drawinputsalmostaligned instead of drawinputsinsquare. To this end, use functions testregressor and evaluateregressor in order to compare with the case where inputs are drawn uniformly in the unit square. To choose a correct value for the alignment factor provided to drawinputsalmostaligned, one can observe graphically the amount of point alignement. 2

2 Regularized Linear Regression 2.1 Maximum Likelihood Estimation and OLS The OLS can be interpreted as a Bayesian estimation problem. To this end, a discriminative model must be dened, i.e. the distribution P (Y X) of Y given X = [X 1,..., X m ] T. Question 2.6 Choose the most obvious model for P (Y X) and precise its parameters. One recalls that the MLE estimator chooses the parameter values that maximize the likelihood: MLE = argmax (P (D )) (6) Let rst assume the variance of the additive white noise is known. Question 2.7 Give the expression of the loglikelihood. Deduce that MLE = OLS 2.2 MAP estimator and Ridge regression MLE assumes a uniform prior on. We now relaxes this assumption and look for the MAP estimator MAP that maximizes the posterior distribution P ( X, y): MAP = argmax (P ( D)) = argmax (P (D ) P ()) (7) One assumes the prior is a multivariate normal distribution N (µ µ 0, Σ 0 ): Question 2.8 Show that MAP minimizes a regularized quadratic risk J r(). Give its expression. Question 2.9 Find the expression of the gradient of J r () and give the expression of MAP. The standard ridge regression consists in an L2 regularized version of OLS: Ridge = argmin (J Ridge ()) with J Ridge () = J 2 () + λ 2 (8) Question 2.10 Show that Ridge regression can be interpreted as a special case of a MAP estimator for some values of µ 0 and Σ 0. Interpret the expression of λ as a balance between condence in prior and observations. Question 2.11 Implement the Ridge estimator (or equivalently the MAP estimator) by completing functions computeridge and testridgeregression (in a similar way as for question 1.2). Compare its level of performance with the OLS's one on almost aligned inputs for various values of λ using function evaluateregressor. 3 Strong Bayesian Approach and Recursive Least Square We would like to regress the model online, i.e. on the y, every time a new sample is received. This is possible with a "strong" Bayesian approach where the parameters are described by some distribution P () that is updated every time a new observation o is received, i.e by replacing the prior P () by the posterior distribution P ( o). 3

3.1 Recursive Least Square Assuming the prior P () is a multivariate normal distribution, the online estimation of P ( D) can be interpreted as a particular Kalman lter. Indeed the output y can be interpreted as an observation linear with the parameters to estimate. Question 3.12 Specify the Kalman lter that estimates from a stream of data (i.e. a temporal sequence of couple (x t, y t ) of inputs/output). Is it a stationnary lter? This online method is called Recursive Least Square. Question 3.13 Implement it in Python by completing the constructor init and the method update of the provided class Filter. To test it, call the function testrecursiveleastsquare. At each iteration, this function will update the lter with a new sample (x, y) and will save the current state of the lter and the diagonal coecients of the covariance matrix P in two distinct matrices. Once all data are processed, these matrices are passed as arguments to the function plotfilterstate in order to display the state convergence (This function draws for each parameter 0, 1 and 2 the expected value ˆ x within a condence interval of [ˆ x 2 σ x, ˆ x + 2 σ x ].). Question 3.14 Observe how the initial covariance on the lter state has some inuence on the convergence. Compare the nal estimate of the lter with the OLS estimation computed on the set of samples used to update the lter. Explain. 3.2 Kalman Filter In the previous Kalman lter, the parameters are supposed to be constant. Let's now assume every parameter has a slow random drift with time. This drift is modelled as a random walk whose standard deviation σ c per sample is given (i.e every time a new sample is generated, every coecient of is rst updated with an additive normal noise N (0, σ 2 c ). Suppose one wants to track the parameters of our model and that observations come. Question 3.15 Modify your lter to take into account the drift (i.e implement the integrate method of the lter). Then observe the behaviour of your lter for = [1, 2, 3], σ c = 0.5, σ e = 0.1 (using function testkalman that itself uses function drawoutputwithdrift to generate the data with a drift). Compare the nal estimation of parameters with the estimation provided by OLS and the real parameter values [1, 2, 3]. 3.3 Appendix Here are some Python commands that can be useful: import numpy as np A = np. arange (0.1, 1, 0. 2) # Create a numpy array [0.1, 0.3,..., 0. 9] A /= 2 # Divide coefficients of A by 2 B = np. zeros ((2,4)) # Create a null matrix of size (2,4) print ( B. T ) # Print transpose of B C = np. eye (4) # Create a 2 D array equal to the identity matrix # of size 4 x 4 B * C # Be careful : this is the array multiplication # component by component D = np. asmatrix ( C ) # Interpret C as a matrix B * D # Works now as the standard matrix multiplication E = np. array ([1,2,3,4]) # Create a 1 D array 4

F = np. matrix ( E ) # Create a line matrix G = np. matrix ( F ) # Create a column matrix H = np. reshape ( np. asarray ( G ),4) # Create a 1 D array # from a matrix print ( C. I ) # Print the inverse matrix of D print ( B. shape ) # B. shape gives the size of B as a pair (5,2) for k in range ( 10): print ( k ) # Loops for k from 0 to 9 for v in A : print ( v ) # Iterate over the elements of A for i, v in enumerate ( A ): print ( str ( i ) + " -> " + str ( v )) # Same thing but having with the element index i as well 5