Prediction problems 3: Validation and Model Checking

Size: px

Start display at page:

Download "Prediction problems 3: Validation and Model Checking"

William Fox
5 years ago
Views:

1 Prediction problems 3: Validation and Model Checking Data Science 101 Team May 17, 2018

2 Outline Validation Why is it important How should we do it? Model checking Checking whether your model is a good fit to the data What to do if it is not?

3 Using a very powerful model Let us do a small experiment predicting handwritten digits again (4 versus 9), where x is pixels in image, y { 1, 1} Idea: why use just the pixel values? Why not squares or other powers of pixel values? Use a more powerful prediction model with vectors β (1), β (2), and β (3), which weight powers of pixel intensity : ŷ = β 0 + p j=1 β (1) j x j + p j=1 β (2) j xj 2 }{{} quadratic terms + p j=1 β (3) j xj 3 }{{} cubic terms

4 Generating Polynomials in R Xtrain.powers = cbind(xtrain, Xtrain * Xtrain, Xtrain * Xtr Xtest.powers = cbind(xtest, Xtest * Xtest, Xtest * Xtest * poly.reg = lm(ytrain ~ Xtrain.powers) What is your hypothesis about how this model will do on Error on the training data? Error on the testing data?

5 Experiments with a powerful model on digit recognition ## Errors in training cat(paste("training data: ", sum(sign(poly.reg$fitted.value length(ytrain), " data points\n", sep = "")) Training data: 0 mistakes of 1296 data points Now, the moment of truth: how does our super classifier work? beta.0 = poly.reg$coefficients[1] beta = poly.reg$coefficients[2:length(poly.reg$coefficients test.pred = Xtest.powers %*% beta + beta.0 ## Now, let's find a few example mistakes mistakes = which(sign(test.pred)!= ytest) cat(paste("test data: ", length(mistakes), " mistakes of ", round(100 * length(mistakes)/length(ytest), digits = 1) Test data: 29 mistakes of 377 data points (7.7% error)

6 Experiments with a simple model on digit recognition linreg = lm(ytrain ~ Xtrain) ## Errors in training mistakes = which(sign(linreg$fitted.values)!= ytrain) cat(paste("training data: ", length(mistakes), " mistakes o " data points (", round(100 * length(mistakes)/length(y "% error)\n", sep = "")) Training data: 7 mistakes of 1296 data points (0.5% error)

7 What about our old standby simple classifier? beta.0 = linreg$coefficients[1] beta = linreg$coefficients[2:length(linreg$coefficients)] test.pred = Xtest %*% beta + beta.0 ## Now, let's find a few example mistakes mistakes = which(sign(test.pred)!= ytest) cat(paste("test data: ", length(mistakes), " mistakes of ", round(100 * length(mistakes)/length(ytest), digits = 1) Test data: 11 mistakes of 377 data points (2.9% error)

8 Validation How can we check how good our classifier is? The basic goal in prediction (machine learning) is to do well on future data Often, the best source of future data is to hold some data in the training set out in a test set or validation set Why do we do this? To avoid overfitting forcing our model to match our training data too closely To confirm that we are making reasonable predictions

9 A little theory (the classification case) Suppose we fit a model with parameters β on a training set We keep out a validation (or test) set of size N, with pairs x i, y i, independent of the training data For a classification problem, with very high probability, the validation error rate êrr = 1 N N 1{ŷ i y i } i=1 is an accurate measure of the true error rate of our classifier for all future data, at least within êrr/ N + 4/N

10 Classification continued... Plot of validation error as a function of validation test size for several random validation sets

11 Revisiting validation of our models For simple classifier: Test data: 11 mistakes of 377 data points (2.9% error) The true error rate (on future images) should be no more than =.019 better or worse than 2.9% error. For the fancier classifier: Test data: 29 mistakes of 377 data points (7.7% error) So the true error rate (on future images) is (likely) no more than.077 N + 4 N =.025 better or worse than 7.7% error.

12 What is going on? Overfitting: we have overfit to our training data When we use a model that is too powerful for the amount of data we have, we fit spurious junk N = 10 # Generate data that is nothing but random Normal n y = 0.25 * rnorm(n) x = seq(0, 1, length.out = N) y x Fit the noise with a model of the form ŷ = β 0 + β 1 x + β 2 x β 9 x 9

13 Fitting a model that predicts the data perfectly X = cbind(x) for (ii in 2:(N - 1)) { X = cbind(x, x^ii) # Construct data with all powers up } polynomial.linreg = lm(y ~ X) y x

14 Generating a little more data And yet, if we get a bit more data, it becomes clear we have overfit y.additional = 0.25 * rnorm(100) plot(x, y, xlab = "x", pch = 21, ylim = c(-2, 2), ylab = "y cex = 2) points(x, y, pch = 20) points(x.interp, y.additional, pch = 20) lines(c(0, 1), c(0, 0), col = "red", lwd = 2) lines(x.interp, yhat, col = "blue", lwd = 2) y x

15 Overfitting and the bias-variance tradeoff

16 Model checking Defining the model (according to George Box) All models are wrong some models are useful... Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity What do we check for in a regression model? Are the assumptions reasonable? Have we chosen good features (x variables)? We consider two diagnostics (also good for modeling) Residual plots Probability plots

17 Model checking: residual plots Do we have good features? We make predictions ŷ = β 0 + p β j x j j=1 Consider the errors (residuals) in each prediction p r = y ŷ = y β 0 β j x j j=1 For each variable j {1, 2,..., p}, we ask Is the assumption of linearity reasonable?

18 Model checking: what function should we use? Idea: holding all other variables constant (but at their best fits), what is the right fit for variable j? Plot the prediction errors y ŷ, removing predicted contribution of variable x j, versus variable x j That is, plot y ŷ + β }{{} j x j = y β 0 β k x k vs. x j }{{} k j residual remove component for j

19 Model checking example: finding the right functions Generate data from the model y = 2.1 x x x x x ε where ε is normal mean 0, variance 1/4 p = 4 n = 100 betas = c(2.1, 3.5, -1.9, 2, 1.5) x.samples = rnorm(n * p) X = matrix(x.samples, nrow = n, ncol = p) X.with.quadratic = cbind(x, X[, 1]^2) y = X.with.quadratic %*% betas * rnorm(n) We have generated n = 100 points from this distribution

20 Model checking example: basic plots Plot y against each coordinate of x par(mfrow = c(2, 2)) plot(x[, 1], y, xlab = "x1", ylab = "y") plot(x[, 2], y, xlab = "x2", ylab = "y") plot(x[, 3], y, xlab = "x3", ylab = "y") plot(x[, 4], y, xlab = "x4", ylab = "y") y y x x2 y y x x4

21 Model checking example: first diagnostic Plot y against each coordinate of x linreg = lm(formula = y ~ X) betas = linreg$coefficients[2:(p+1)] par(mfrow = c(2, 2)) plot(x[, 1], linreg$residuals + X[, 1] * betas[1], xlab = " plot(x[, 2], linreg$residuals + X[, 2] * betas[2], xlab = " plot(x[, 3], linreg$residuals + X[, 3] * betas[3], xlab = " plot(x[, 4], linreg$residuals + X[, 4] * betas[4], xlab = " residual residual x1 x2 residual residual x3 x4

22 x3 x4 Model checking example: adding quadratic in Plot y against each coordinate of x X.with.quadratic = cbind(x, X[, 1]^2) linreg = lm(formula = y ~ X.with.quadratic) betas = linreg$coefficients[2:(p+1)] par(mfrow = c(2, 2)) plot(x[, 1], linreg$residuals + X[, 1] * betas[1], xlab = " plot(x[, 2], linreg$residuals + X[, 2] * betas[2], xlab = " plot(x[, 3], linreg$residuals + X[, 3] * betas[3], xlab = " plot(x[, 4], linreg$residuals + X[, 4] * betas[4], xlab = " residual residual x x2 residual residual

23 Model checking: do we have fidelity to the data? The QQ ( quantile-quantile ) plot is a plot of quantiles of one distribution against another Quantile of a distribution: for α [0, 1], q α = q : P(Y q) = α The function qqnorm plots the quantiles of a distribution against those for a normal If things are normal, qqnorm should look linear

24 Model checking: do we have fidelity to the data? Example 1: our simulation qqnorm(linreg$residuals/sd(linreg$residuals)) lines(c(-2, 2), c(-2, 2), col = "red", lwd = 2) Normal Q Q Plot Sample Quantiles Theoretical Quantiles

25 Model checking: does our model look good Example 2: Boston housing data set library(mass) data(boston) ## Remove the $500,000 sale price, as it is a category boston = Boston[Boston$medv!= 50, ] fullregression = lm(formula = medv ~., data = boston) qqnorm(fullregression$residuals/sd(fullregression$residuals lines(c(-3, 3), c(-3, 3), col = "red", lwd = 2) Normal Q Q Plot Sample Quantiles Theoretical Quantiles

ECON/FIN 250: Forecasting in Finance and Economics: Section 4.1 Forecasting Fundamentals

ECON/FIN 250: Forecasting in Finance and Economics: Section 4.1 Forecasting Fundamentals Patrick Herb Brandeis University Spring 2016 Patrick Herb (Brandeis University) Forecasting Fundamentals ECON/FIN