Figure 1: Visualising the input features. Figure 2: Visualising the input-output data pairs.

Size: px

Start display at page:

Download "Figure 1: Visualising the input features. Figure 2: Visualising the input-output data pairs."

Dina Potter
5 years ago
Views:

1 Regression. Data visualisation (a) To plot the data in x trn and x tst, we could use MATLAB function scatter. For example, the code for plotting x trn is scatter(x_trn(:,), x_trn(:,)); x (a) x trn..... x (b) x tst Figure : Visualising the input features. The ranges of the values of both features x and in x trn and x tst are the same. A gap. < x <. exists in x trn but not in x tst. (b) To plot the input-output data in -d, we could use function scatter. For example, the code for plotting the training data pair x trn- trn is scatter(x_trn(:,), x_trn(:,), _trn); x x (a) x trn- trn pair (b) x tst- tst pair Figure : Visualising the input-output data pairs. The data forms a curved surface with a single peak around (,.). It indicates that it would be rather difficult to capture the x- relation with a simple linear regression model.

2 . Linear regression (a) The maximum likelihood estimation (MLE) for the linear regression model is given b (in the matrix form) where X is the design matrix w mle = ( X X ) X trn, () x x X =.... () x N x N The code that implements eq. () is given b (MATLAB supports matrix manipulation nicel.) X = [ones(num_trn,) x_trn]; w_mle = (X *X)\X *_trn; % construct design matrix % MLE The maximum likelihood estimator ŵ = (ŵ, ŵ, ŵ ) is (b) The mean squared error (MSE) is defined b ŵ =.8, ŵ =.6, ŵ =.6. () MSE = N N ( n ŷ n ), () n= where ŷ = ˆf(x) = ŵ + ŵ x + ŵ is the estimated value given b the model. The code for calculating the MSE on both the training set and the test set is mle_trn_mse = mean((_trn - X * w_mle).^); _mle_tst = [ones(num_tst,) x_tst] * w_mle; mle_tst_mse = mean((_tst - _mle_tst).^); and the result is % MSE on training set % calculate predicted value of on the test set % MSE on test set mle trn mse =.896, mle tst mse = () To estimate and evaluate the dumb model = w, we onl need to estimate w using the mean of the data trn, and calculate the MSEs _dumb = mean(_trn); dumb_trn_mse = mean((_trn - _dumb).^); dumb_tst_mse = mean((_tst - _dumb).^); The result is % estimate dumb model % MSE on training set % MSE on test set dumb trn mse =.9, dumb trn mse =.6. (6) [You were not asked to comment on the results in the question, but it is interesting to do so. Note that mle_trn_mse is smaller than mle_tst_mse; this is unlikel to be due to overfitting given that there are onl parameters to fit more likel it is due to the different x-data distribution between training and testing sets. Also, as one would expect the dumb predictor has higher training and test error than the more complex linear model; the dumb model is obtained b setting w = w = in the linear regression model.] (c) The maximum likelihood estimator for the noise variance σ η is ˆσ η = N N ( n ŷ n ). (7) n= This is identical to the MSE on the training set. Therefore, we have ˆσ η = mle trn mse =.896. (8) This estimated noise variance is much larger than the true one (σ η =.). This is due to the fact that this simple linear regression model cannot capture the curved surface of the data well.

3 . Baesian linear regression (a) The function value at the input point x = (, ) is f(x ) = f = w + w + w. (9) Since w, w, w are Gaussian random variables, the linear combination of them is also Gaussian. Therefore, we have f(x ) N (µ f, σ f ) with mean µ f and variance σ f µ f = µ + µ + µ = + + =, () σ f = σ + σ + (σ ) = + + = 6. () (b) To plot the sampled function f(x w), we first create a grid for plotting, then sample the weights. For each sample evaluate the function value on the grid and do surf plot (without colour for better visualisation). [gridx, gridx] = meshgrid(-:.:,.:.:.); grid = randn + gridx*randn + gridx*randn; surf(gridx, gridx, grid, facecolor, none ); % create grid % sample weights and evaluate f(x) % plot x x x (a) sample (b) sample (c) sample Figure : Function f(x w) given three different samples of w N (, I ). The function alwas gives a -d plane in the -d input-output space. Again this propert makes it difficult to use the linear regression model to capture our data. On the grid, the output values range around [, ], which is consistent with the characteristic variance of f(x w) as its variance at the point (,.) (the point on the grid with the largest absolute value in both x and ) is. RBF regression σ f = σ + (σ ) + (.σ ) = 7... () (a) Drawing RBF functions is similar to drawing linear functions. The extra step we need is to work out the outputs φ(x) of all RBF bases for an input point x. This is done b the function eval rbf bases. [gridx, gridx] = meshgrid(-.:.:., -:.:); grid_dim = size(gridx); grid_phi = eval_rbf_bases(rbf_net, [gridx(:), gridx(:)]); rbf_net.w = /sqrt(rbf_net.alpha) * randn(gridside^, ); rbf_net.b = /sqrt(rbf_net.alpha) * randn; grid = rbf_net.b + grid_phi * rbf_net.w; surf(gridx, gridx, reshape(grid, grid_dim), edgecolor, none ); % a larger grid for plot % evaluate RBF outputs % sample weights % evaluate function (x) % surf plot

4 6 x x x (a) sample x x x (b) sample (c) sample Figure : grid RBF function given three difference samples of w N (, I6 ). Three sample functions are shown in Figure. The RBF function is nonlinear. More specificall, it is a combination of a set of Gaussian bumps/rbfs (in our case there are ) on a horizontal plane. The size of each bump/rbf is determined b the weights associated with it, and the height of the vertical offset is determined b the bias term w (i.e. rbf net.b in the rbf net object). Because of this nonlinearit it should be able to better describe the data than the linear regression model. Observe that the lengthscales of variation in the plots are rather shorter than in the data, but this arises from choosing each weight independentl in the prior. Thus the RBF network should be able to model the given data (b) The variance of the Gaussian prior over the weights N (, α I6 ) is proportional to /α. As α increases, the variance will decrease and the Gaussian prior will be more squeezed around. The sample of each weight will thus be closer to zero when α becomes larger. As a result, the vertical scaling of the plots is closer to zero, and the size of the Gaussian bumps will decrease... x x (a) α = x x (b) α = x x (c) α = Figure : Change α, resample w N (, α I6 ) and RBF function. Notice the scale on the -axis. (c) The MSE errors of the maximum a posteriori (MAP) estimator on both the training and the test sets can be calculated b rbf_map_trn_mse = mean((rbf_net.b + trn_phi * rbf_net.w - _trn).^); tst_phi = eval_rbf_bases(rbf_net, x_tst); % RBF outputs for test points rbf_map_tst_mse = mean((rbf_net.b + tst_phi * rbf_net.w - _tst).^); The result is rbf map trn mse =., rbf map tst mse =.7. () (d) With the new design matrix trn_phi, the MLE for the RBF model has the same form as the MLE for the linear regression model

5 w_rbf_mle = (trn_phi *trn_phi)\trn_phi *_trn; The MSE on both the training and the test sets are % MLE (given the design matrix) rbf mle trn mse =.9, rbf mle tst mse =.78. () The result is slightl better than that given b MAP. Given the fact that there are points in the training set while onl 6 parameters to fit, we have a low risk of overfitting the data. However, doing MAP is not a bad choice on this problem, as it gives onl a slightl worse result on MSE and it enables us to do a full probabilistic analsis on the model, such as providing the predictive variances. Both MSE scores of the RBF model are better than those of the linear regression model b an order of magnitude. This is due to the fact that the RBF model successfull captures the nonlinearit. Note again the differences between training and test errors, which again ma be due to the different x-data distribution between training and testing sets. (e) Recall that the noise variance is identical to the MSE on the training set. Therefore, ˆσ η = rbf mle trn mse =.9. () This estimate is much smaller than the one given b the linear regression model (ˆσ η =.896) but is still larger than the true value σ η =.. (f) To visualise the predictive variances in a region, we first need to specif a grid, then evaluate the predictive variance for ever point on that grid, and finall make the plot. [gridx, gridx] = meshgrid(-:.:, -:.:); grid_dim = size(gridx); grid_phi = eval_rbf_bases(rbf_net, [gridx(:), gridx(:)]); grid_phi = [ones(size(grid_phi, ), ) grid_phi]; % create grid var_pred = zeros(size(grid_phi, ), ); % predictive variance Vinv = trn_phi *trn_phi/std_n^ + rbf_net.alpha*ee(d); for ii = :size(grid_phi, ) var_pred(ii) = grid_phi(ii,:) * inv(vinv) * grid_phi(ii,:) + std_n^; end imagesc(-:.:, -:.:, reshape(var_pred, grid_dim)); set(gca, YDir, normal ) colorbar It might be more convenient to visualise in the unit of standard deviation std_pred = sqrt(var_pred); imagesc(-:.:, -:.:, reshape(std_pred, grid_dim)); In the assignment we were given V N = ση(ασ ηi + Φ T Φ), which can be rewritten as V N = (αi + ση Φ T Φ). In the code we have used inv(vinv) although one could also use the / operator. The lowest predictive variance is in the two areas ( < x <.,. < <.) and (. < x <,. < <.), which are separated b the gap. < x <.. The predictive variance in the gap area is relativel high. Compared to Figure, we find that the two low variance areas are where the data is, which helps to reduce the variance in prediction. In fact, the lowest predictive variance read from the figure is onl slightl larger., which means that the uncertaint of the prediction at these low variance points mainl results from the Gaussian noise η. There are five RBF centers in the gap, but little training data that can be used to reduce the prior uncertaint over the weights associated with them. Thus the uncertaint of the weights remains large, leading to a large predictive variance. Finall, for those points that are far awa from RBFs, their RBF outputs are ver small and thus the uncertaint in the weights cannot pass to the prediction except w f(x) = w + w k φ k (x) w, when φ k (x). (6) Therefore, we see a constant predictive variance of α + σ η far from the central area. k=

6 x x x. (a) predictive variance (b) predictive standard deviation Figure 6: The predictive variance and standard deviation of the RBF model plotted on the grid. 6

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector