Introduction to the regression problem. Luca Martino

Size: px

Start display at page:

Download "Introduction to the regression problem. Luca Martino"

Todd Wood
5 years ago
Views:

1 Introduction to the regression problem Luca Martino / 30

2 Approximated outline of the course 1. Very basic introduction to regression 2. Gaussian Processes (GPs) and Relevant Vector Machines (RVMs) Connections with kernel smoothing and splines 3. Monte Carlo methods for efficient learning 2 / 30

3 Course: evaluation - mark Deliver two Reports: First one: deliver.tex and.pdf (no word) of your notes about the classes of theory - (5 points). If you also provide good by hand notes you can reach 7 points (as maximum). Second one: deliver.tex,.pdf and upload in RG (mandatory; after my revision but without my name) of a work related to the topic of the course - 5 points. 3 / 30

4 Course: evaluation - mark About second report, you can study a paper (if possible regarding biomedical engineering) or analyze data that you like. We can discuss everything. If you have any idea, discuss with me. 4 / 30

5 Rooms and Schedule 05-11, 06-11, and ===> 42E , 26-11, 03-12, and ===> INF 4SD , 04-12, and ===> INF 4SD ===> INF 11G03 DUAL ===> 42E03 (tutorial) and ===> also tutorials for sure 5 / 30

6 Inference problem Let us consider to observe a set of scalars y 1,..., y M, related to some quantity (observations/measurements) that we desire to infer. We desire to summarize this information (e.g., point estimation, dispersion etc.). Estimator: f = 1 M y m. M i=1 bf 6 / 30

7 Inference problem (2) If M = 1 (we have only y 1 ) and no additional information are provided: 7 / 30

8 Inference problem (3) Let us consider that observations/measurements refers/depends to/on a parameter x: x = y 1,..., y M, y 6 bf x y 5 y 4 y 3 y 2 y 1 8 / 30

9 Different inference problems In general, we can have y x 1 = y 1,1,..., y 1,M1, x n = y n,1,..., y n,mn, x N = y N,1,..., y N,MN, y 2,m y 3,m x n =) {y n,m } n =1,...,N m =1,...,M n y 1,m y 4,m x 1 x 2 x 3 x x 4 9 / 30

10 Different inference problems (2) We can solve the different inference problems independently... y x 1 x 2 x 3 x x 4 10 / 30

11 Filtering - Smoothing - Denoising Key point= take into account the close information, solving a joint inference problem (in this case we talk of filtering, smoothing or denoising). y x 1 x 2 x 3 x x 4 11 / 30

12 Extrapolation - Prediction...or predict y the output at another input x / {x n } N n=1... if x {x n } N n=1 we come back to the filtering-smoothing case. Varying x, we have y = f (x ). f (x ): an estimator that is function of x. y y bf(x) x 1 x 2 x 3 x x x 4 12 / 30

13 Regression Regression= Filtering + Prediction, (there are methods only for filtering-smoothing and other only for prediction; we address both problems jointly). f (x ): regressor, regression function, smoothing function, predictive function, interpolator... y y bf(x) x 1 x 2 x 3 x x x 4 13 / 30

14 Regression (2) Let us denote the true unknown function f (x). We consider in general that y = f (x) + ɛ, (1) where E[ɛ] = 0. For a given x, then y = perturbed version of f (x). Note that E[y] = f (x). We desire f (x) f (x). 14 / 30

15 The famous special case M n = 1 for all n If we consider all the inference problems jointly, we can also have one measurement/observation for each problem and then the results can be no trivial. The optimal, best solution does not pass for the points, in general. It depends on the trust that we have to respect the data. y bf(x) x 1 x 2 x 3 x x 4 15 / 30

16 Interpolation as special case In some applications, we trust completely the data: interpolation. y bf(x) x 1 x 2 x 3 x x 4 16 / 30

17 Trade off Fitting/Generalization How much can you trust your data? Generally, we have trade-off: Over-fitting: if we fit very well the observed data, generally the error in prediction with new data is high (bad generalization). We obtain more complex functions/models f (x) (more oscillations). Underfitting Too much penalty to the complexity : We obtain simpler functions/models f (x) that could generalize well, but almost the solution does not take into account the observed data. This is related to the so-called bias-variance trade-off. 17 / 30

18 Model Selection: Underfitting/Overfitting It is related to the model selection problem (choose the complexity or your model). Underfitting occurs when the estimator f (x) is not flexible enough to capture the underlying trends in the observed data (model too simple). Overfitting occurs when the estimator f (x) is too flexible, allowing it to capture illusory trends in the data (model too complex). These illusory trends are often the result of the noise in the observations y n. The non-parametric models try to automatically adjust the complexity / 30

19 bias-variance trade-off Let us consider that we know the true unknown function f (x). Let us consider the possibility of apply the regression model f (x) to different observations {y n } N n=1 (for simplicity keeping fixed the input x n ). We can average and compute approximately the bias and the variance f (x) E[ f (x)] E [( f (x) E[ f (x)]) 2] of the estimator f (x). All the integral are with respect to y, and considering the pdf p(y x); note that the complete notation should be p(y f, x) or p(y f (x)). 19 / 30

20 bias-variance trade-off (2) All the integral are with respect to y, and considering the pdf p(y x); note that the complete notation should be f (x y), p(y f, x) or p(y f (x)): E[ f (x y)]) = f (x y)p(y f, x)dy Bias f (x) E[ f (x y)]. Variance [ E ( f (x y) E[ f (x y)]) 2] ) 2 = ( f (x y) E[ f (x y)] p(y f, x)dy. 20 / 30

21 bias-variance trade-off (3) Underfitting: high bias, small variance Overfitting: small bias, high variance Best model: compromise between bias and variance with the smallest Mean Square Error (MSE) in prediction (see later). 21 / 30

22 bias-variance trade-off (3) Underfitting: high bias, small variance Overfitting: small bias, high variance Best model: compromise between bias and variance with the smallest Mean Square Error (MSE) in prediction (see later). (Hereafter, we consider x R D ) 21 / 30

23 Overfitting: small bias, high variance Recall that E[y] = f (x). Let fix an input x. Consider the interpolation case (max overfitting) = f (x) = y. Then E[ f (x)] = E[y], so that E[ f (x)] = f (x), = bias = 0. Also var[ f (x)] = var[y] = var[ɛ] (we are including all the noise in our estimation). 22 / 30

24 Underfitting: high bias, small variance Let us consider an extreme underfitting case: given a x, f (x) = c independent on the observation y. Then E[ f (x)] = c and the bias c f (x), but var[ f (x)] = / 30

25 Generic Theoretical Approach Let us consider the random variables (Y, X) with the joint pdf, p(y, x). Recall that if we know the joint pdf p(y, x), the two conditionals p(y x), p(x y) and the two marginals p(y), p(x), are determined. Let us consider the expected squared loss: L[f ] = (y f (x)) 2 p(y, x)dydx. (2) We desire to obtain f (x) = arg min L[f (x)]. (3) f (x) 24 / 30

26 Generic optimal solution Now, consider x fixed (to simplify, without losing of generality), derive L[f ] with respect to f and find the stationary points ( dl[f ] df = 0) dl[f ] df = 2 (y f (x))p(y, x)dy = 0, (4) = y p(y, x)dy f (x)p(y, x)dy = 0, (5) = y p(y, x)dy f (x)p(x) = 0, (6) solving w.r.t. f then yp(y, x)dy f (x) = p(x) = yp(y x)dy = E[Y X]. (7) 25 / 30

27 Decomposition of MSE in prediction Consider the new pair (x, y), [ MSE = E (y f (x)) 2], = E [ y 2] + E [ f ] (x) 2 2E [ y f ] (x) then we use that E[z 2 ] = E[(z E[z]) 2 ] + E[z] 2 (recall the variance expression), E[wz] = E[w]E[z] (if w, z are independent), E[y] = f (x) (think to the model y = f (x) + ɛ where E[ɛ] = 0), = E[(y E[y]) 2 ] + E[y] 2 + E[( f E[ f ]) 2 ] + E[ f ] 2 2fE[ f ], = E[(y f ) 2 ] + f 2 + E[( f E[ f ]) 2 ] + E[ f ] 2 2fE[ f ], ( ) = E[(y f ) 2 ] + E[( f E[ f ]) 2 ] + f 2 + E[ f ] 2 2fE[ f ], ( 2. = E[(y f ) 2 ] + E[( f E[ f ]) 2 ] + f E[ f ]) 26 / 30

28 Decomposition of MSE in prediction (2) Then ( 2, MSE = E[(y f (x)) 2 ] + E[( f (x) E[ f (x)]) 2 ] + f (x) E[ f (x)]) = variance obs. noise + variance of f + (bias of f ) 2, MSE in prediction = estimator variance + squared estimator bias + power/variance of observation noise. 27 / 30

29 In this course... We will consider x R D, multidimensional and continuous (in kalman filtering/particle filtering x = t N is discrete and unidimensional; BE CAREFUL WITH THE NOTATION!). At the beginning, we will consider y R, unidimensional and continuous. Then we will consider multidimensional extensions (multi-output regression). 28 / 30

30 Role of x and y in regression problems y n - outputs: data, measurement, observations (and that we want to filter, to predict). x n - inputs: they play a different role (sometimes they are/can be considered data, sometimes they can play the role of parameters - as in filtering of temporal series, x = t). in the problems that we tackle the x s are alway given,i.e., always on the right side of the conditional density expressions p( x,...). in classification, y n labels, x n data / 30

31 Questions? THANKS! 30 / 30

Probability and Statistical Decision Theory

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Probability and Statistical Decision Theory Many slides attributable to: Erik Sudderth (UCI) Prof. Mike Hughes