Statistical Prediction P.R. Hahn Fall 2017 1
Some terminology The goal is to use data to find a pattern that we can exploit. y: response/outcome/dependent/left-hand-side x: predictor/covariate/feature/independent f(x): regression/function estimation/prediction/forecasting/ classification/curve fitting The pattern is statistical in the sense that it holds only approximately. 2
Signal + noise y 10 20 30 40 1500 2000 2500 3000 3500 4000 4500 5000 x y = f (x) + ɛ 3
Minimize the average size of the noise among a certain function class (such as lines or polynomials) minimize the average deviation from the curve squared distance is common (mean squared error) as we get more data, our predictions get better y = α + xβ + ɛ f <- function(x,alpha, beta){ fx <- alpha + beta*x return(fx) } MSE = n 1 i (y i ˆf (x i )) 2 mse <- function(y,fhat){ return(mean((y - fhat)^2)) } 4
Load the auto data auto <- read.csv("auto.csv") str(auto) 'data.frame': 397 obs. of 9 variables: $ mpg : num 18 15 18 16 17 15 14 14 14 15... $ cylinders : int 8 8 8 8 8 8 8 8 8 8... $ displacement: num 307 350 318 304 302 429 454 440 455 390... $ horsepower : Factor w/ 94 levels "?","100","102",..: 17 35 29 29 24 42 47 46 48 40... $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850... $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5... $ year : int 70 70 70 70 70 70 70 70 70 70... $ origin : int 1 1 1 1 1 1 1 1 1 1... $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2. Next, we need to clean the data. auto[auto == "?"] <- NA auto$horsepower <- as.numeric(auto$horsepower) auto <- auto[complete.cases(auto),] auto$origin <- as.factor(auto$origin) 5
Examine the correlations names(auto) [1] "mpg" "cylinders" "displacement" "horsepower" [5] "weight" "acceleration" "year" "origin" [9] "name" cormat <- cor(auto[,-c(8,9)]) print(round(matrix(as.numeric(cormat),7,7),2)) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 1.00-0.78-0.81 0.45-0.83 0.42 0.58 [2,] -0.78 1.00 0.95-0.57 0.90-0.50-0.35 [3,] -0.81 0.95 1.00-0.51 0.93-0.54-0.37 [4,] 0.45-0.57-0.51 1.00-0.51 0.28 0.14 [5,] -0.83 0.90 0.93-0.51 1.00-0.42-0.31 [6,] 0.42-0.50-0.54 0.28-0.42 1.00 0.29 [7,] 0.58-0.35-0.37 0.14-0.31 0.29 1.00 6
Examine the scatterplots plot(auto[,c("mpg","cylinders","displacement","weight")]) 3 4 5 6 7 8 1500 2500 3500 4500 mpg 10 30 3 4 5 6 7 8 cylinders displacement 100 300 1500 3500 10 20 30 40 100 200 300 400 weight 7
Write a function, optimize it tempf <- function(parms){ alpha <- parms[1] beta <- parms[2] y <- auto$mpg fhat <- f(auto$weight,alpha,beta) return(mse(y,fhat)) } optim(c(0,0),tempf) $par [1] 46.213730485-0.007646194 $value [1] 18.67662 $counts function gradient 129 NA $convergence [1] 0 $message NULL 8
Simple linear regression We use the lm() function to predict mpg using weight. fit <- lm(mpg~weight,data = auto) summary(fit) Call: lm(formula = mpg ~ weight, data = auto) Residuals: Min 1Q Median 3Q Max -11.9736-2.7556-0.3358 2.1379 16.5194 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 46.216524 0.798673 57.87 <2e-16 *** weight -0.007647 0.000258-29.64 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.333 on 390 degrees of freedom Multiple R-squared: 0.6926, Adjusted R-squared: 0.6918 F-statistic: 878.8 on 1 and 390 DF, p-value: < 2.2e-16 9
Now we plot the fit plot(auto$weight,auto$mpg,pch=20,main='',xlab='weight',ylab='mpg') abline(a=fit$coefficients[1],b=fit$coefficients[2],col='red',lwd=3) mpg 10 20 30 40 1500 2000 2500 3000 3500 4000 4500 5000 weight 10
Now it is your turn Fit single-variable linear models using cylinders, displacement, and year. Which one looks best? 11
Can the linear trend be improved? Here we add nonlinear features within a linear model. fit_nl <- lm(mpg~ poly(weight,2,raw=true),data=auto) plot(auto$weight,auto$mpg,pch=20,main='',xlab='weight',ylab='mpg') points(auto$weight,fit_nl$fitted.values,col='red',pch=20) mpg 10 20 30 40 1500 2000 2500 3000 3500 4000 4500 5000 weight 12
The quadratic model is better summary(fit_nl) Call: lm(formula = mpg ~ poly(weight, 2, raw = TRUE), data = auto) Residuals: Min 1Q Median 3Q Max -12.6246-2.7134-0.3485 1.8267 16.0866 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.226e+01 2.993e+00 20.800 < 2e-16 *** poly(weight, 2, raw = TRUE)1-1.850e-02 1.972e-03-9.379 < 2e-16 *** poly(weight, 2, raw = TRUE)2 1.697e-06 3.059e-07 5.545 5.43e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.176 on 389 degrees of freedom Multiple R-squared: 0.7151, Adjusted R-squared: 0.7137 F-statistic: 488.3 on 2 and 389 DF, p-value: < 2.2e-16 Now you: compare the 1, 2, and 3 degree polynomial models in terms of the residual standard error. 13
The predictive impact of a variable now depends on the level Here we use the predict() function auto$weight[1] [1] 3504 temp <- auto[1,] temp$weight <- temp$weight + 200 predict(fit_nl,newdata = temp) - predict(fit_nl,newdata = auto[1,]) 1-1.253354 temp <- auto[1,] temp$weight <- temp$weight - 200 predict(fit_nl,newdata = auto[1,]) - predict(fit_nl,newdata = temp) 1-1.389079 14
What if more than one thing matters? Figure 1: With multiple variables we are now fitting a (hyper)plane. 15
The lm() function handles this. fit_mlr <- lm(mpg~.-name-origin+as.factor(origin) + poly(weight,2) - weight,data=auto) summary(fit_mlr) Call: lm(formula = mpg ~. - name - origin + as.factor(origin) + poly(weight, 2) - weight, data = auto) Residuals: Min 1Q Median 3Q Max -8.9552-1.6941 0.0041 1.7087 12.8701 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -4.629e+01 4.239e+00-10.919 < 2e-16 *** cylinders -1.059e-01 3.058e-01-0.346 0.729387 displacement 1.250e-02 6.712e-03 1.862 0.063341. horsepower 6.799e-03 6.409e-03 1.061 0.289420 acceleration 1.724e-01 6.991e-02 2.467 0.014077 * year 8.461e-01 4.614e-02 18.337 < 2e-16 *** as.factor(origin)2 1.756e+00 5.154e-01 3.407 0.000727 *** as.factor(origin)3 1.283e+00 5.080e-01 2.525 0.011959 * poly(weight, 2)1-1.159e+02 8.934e+00-12.978 < 2e-16 *** poly(weight, 2)2 2.964e+01 3.196e+00 9.273 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.986 on 382 degrees of freedom Multiple R-squared: 0.857, Adjusted R-squared: 0.8536 F-statistic: 254.3 on 9 and 382 DF, p-value: < 2.2e-16-16
Actual-vs-Predicted plots plot(auto$mpg,fit_mlr$fitted.values,pch=20,main='', xlab='mpg',ylab='predicted mpg') abline(0,1,col='red') predicted mpg 10 15 20 25 30 35 10 20 30 40 mpg 17
Interactions fit_mlr2 <- lm(mpg~poly(weight,2)*origin + (year + acceleration + origin)^2,data=auto) round(coef(summary(fit_mlr2)),4) Estimate Std. Error t value Pr(> t ) (Intercept) 46.3735 17.1780 2.6996 0.0073 poly(weight, 2)1-113.9304 4.5616-24.9758 0.0000 poly(weight, 2)2 31.4737 3.7640 8.3617 0.0000 origin2-32.2517 9.3921-3.4339 0.0007 origin3-18.8180 8.7171-2.1588 0.0315 year -0.2936 0.2300-1.2769 0.2024 acceleration -5.1336 1.1147-4.6053 0.0000 poly(weight, 2)1:origin2-83.3332 29.7957-2.7968 0.0054 poly(weight, 2)2:origin2-61.8395 19.2514-3.2122 0.0014 poly(weight, 2)1:origin3 69.2630 83.0722 0.8338 0.4049 poly(weight, 2)2:origin3 25.2491 36.1687 0.6981 0.4856 year:acceleration 0.0666 0.0148 4.4860 0.0000 origin2:year 0.2497 0.1196 2.0884 0.0374 origin3:year 0.2195 0.1007 2.1783 0.0300 origin2:acceleration 0.7126 0.1361 5.2356 0.0000 origin3:acceleration 0.3112 0.2080 1.4958 0.1356 summary(fit_mlr2)$sigma [1] 2.711726 summary(fit_mlr2)$adj.r.squared [1] 0.8792895 18
Training and test (validation) sets n <- nrow(auto) ntest <- floor(0.2*n) ntrain <- n - ntest testrows <- sample(1:n, ntest, replace = FALSE) auto_test <- auto[testrows,] auto_train <- auto[-testrows,] fit_mlr2 <- lm(mpg~poly(weight,2)*origin + (year + acceleration + origin)^2,data = auto_train) summary(fit_mlr2)$sigma [1] 2.707457 yhat <- predict(fit_mlr2,newdata = auto_test) sqrt(mse(auto_test$mpg,yhat)) [1] 2.898672 fit_mlr <- lm(mpg~poly(weight,2)+year,data=auto_train) summary(fit_mlr)$sigma [1] 3.0546 yhat <- predict(fit_mlr,newdata = auto_test) sqrt(mse(auto_test$mpg,yhat)) [1] 2.946425 19
Classification auto$highmpg <- auto$mpg > as.numeric(quantile(auto$mpg,0.25)) plot(auto$weight,auto$highmpg,main='',xlab='weight',ylab='high MPG') a <- tapply(auto$weight,cut(auto$weight,seq(1500,5000,by=500)),mean) b <- tapply(auto$highmpg,cut(auto$weight,seq(1500,5000,by=500)),mean) points(a,b,col='red',pch=20,cex=2) high MPG 0.0 0.2 0.4 0.6 0.8 1.0 1500 2000 2500 3000 3500 4000 4500 5000 weight 20
Decision boundary plot(auto$weight[auto$highmpg],auto$year[auto$highmpg],pch=20, col='red',xlim=range(auto$weight),ylim = range(auto$year), xlab = "weight", ylab = "year") points(auto$weight[!auto$highmpg],auto$year[!auto$highmpg],pch=20, col='cyan') year 70 72 74 76 78 80 82 1500 2000 2500 3000 3500 4000 4500 5000 weight 21
Logistic regression These three expressions are all the same. Pr(y = 1 x) = Pr(y = 1 x) = exp (α + xβ) 1 + exp (α + xβ) 1 1 + exp ( (α + xβ)) Pr(y = 1 x) = (1 + exp ( (α + xβ)) 1 A model like this is a type of generalized linear model (GLM). 22
Fitting a logistic regression We use the glm() function fit_glm <- glm(highmpg~weight+year,family = binomial, data = auto) summary(fit_glm) Call: glm(formula = highmpg ~ weight + year, family = binomial, data = auto) Deviance Residuals: Min 1Q Median 3Q Max -2.89256-0.00817 0.01836 0.09080 2.44179 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -1.579e+01 6.610e+00-2.388 0.0169 * weight -5.699e-03 7.941e-04-7.177 7.15e-13 *** year 4.852e-01 1.003e-01 4.839 1.31e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 443.05 on 391 degrees of freedom Residual deviance: 100.04 on 389 degrees of freedom AIC: 106.04 Number of Fisher Scoring iterations: 8 23
Plot the fitted boundary year 70 72 74 76 78 80 82 1500 2000 2500 3000 3500 4000 4500 5000 Let s do some algebra. weight w <- seq(1500, 5000) yr <- (fit_glm$coefficients[1] + fit_glm$coefficients[2]*w)/ (-fit_glm$coefficients[3]) lines(w,yr,col='magenta',lwd=3) 24
ROC curves: How sharp is the decision boundary? The receiver-operating-characteristics curve plots the False Positive Rate (FPR) against the True Positive Rate (TPR). FPR = #(INCORRECTLY PREDICTED POSITIVES)/(TOTAL #NEGATIVES) TPR = #(CORRECTLY PREDICTED POSITIVES)/(TOTAL #POSITIVES) 25
ROC in R simple_roc <- function(labels, scores){ labels <- labels[order(scores, decreasing=true)] data.frame(tpr=cumsum(labels)/sum(labels), FPR=cumsum(!labels)/sum(!labels), labels) } temp <- simple_roc(auto$highmpg,fit_glm$fitted.values) plot(temp[,2:1],pch=20) TPR 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 26 FPR
We can add nonlinear terms and interactions, too year 70 72 74 76 78 80 82 1500 2000 2500 3000 3500 4000 4500 5000 weight fit_glm2 <- glm(highmpg~poly(weight,5,raw=true)+year, family = binomial, data = auto) yr <- (fit_glm2$coefficients[1] + fit_glm2$coefficients[2:6]%*%t(cbind(w,w^2,w^3,w^4,w^5)))/ (-fit_glm2$coefficients[7]) lines(w,yr,col='magenta',lwd=3) 27
Feature design The process of constructing transformations and interactions is commonly referred to as feature design. It is more art than science, but can be really effective. I have heard it called the single most important part of applied predictive modeling. It is also difficult and ad hoc. 28
Regression and classification trees Figure 2: Regression and classification trees automate feature design 29
Regression and classification trees Figure 3: Regression and classification trees partition the feature space. 30
Regression and classification trees Figure 4: They fit a piecewise constant response surface. 31
Regression and classification trees Figure 5: We grow the tree very deep, then prune it. 32
rpart() library(rpart);library(rpart.plot) auto <- auto[,-9] #get ride of names column fit_tree <- rpart(mpg~.-highmpg,data=auto) sqrt(mse(auto$mpg,predict(fit_tree))) [1] 2.922369 rpart.plot(fit_tree) yes 23 100% displacement >= 190 no 17 43% displacement >= 284 29 57% weight >= 2217 15 25% year < 78 23 19% 26 33% year < 78 29 14% weight >= 2775 displacement >= 138 33 24% year < 78 14 21% 19 4% 19 18% 20 4% 24 15% 27 7% 32 7% 29 12% 36 12% 33
Exercise fit_tree <- rpart(mpg~.,data=auto_train) sqrt(mse(auto_test$mpg,predict(prune(fit_tree,cp=0), newdata = auto_test))) [1] 4.947657 sqrt(mse(auto_train$mpg,predict(prune(fit_tree,cp=0), newdata = auto_train))) [1] 1.780435 Use the auto_test and auto_train subsets to try out different settings of the cp parameter. Try different splits of the data as well. Do the results change? 34
Random forests and randomforest() Regression trees are automated and are interpretable. But sometimes they are not smooth enough. Random forests is a method that combines a bunch of different tree fits, to get a smoother response surface. It is harder to visualize, but often gives better predictions. library(randomforest) randomforest 4.6-12 Type rfnews() to see new features/changes/bug fixes. auto_train <- auto_train[,-9] # get rid of names again auto_train <- auto_train[,-9] fit_rf <- randomforest(mpg~.,data=auto_train) sqrt(mse(auto_train$mpg,predict(fit_rf,newdata = auto_train))) [1] 1.393711 sqrt(mse(auto_test$mpg,predict(fit_rf,newdata = auto_test))) [1] 2.60724 35
Other models/algorithms Boosted regression trees support vector machines neural networks (deep learning) Gaussian processes There are R package implementations in many cases svm() from the e1071 package nnet() from the nnet package gbm() from the gbm package etc. These methods differ in the representation of the prediction function (response surface). 36
An aside about neural networks vs. tree methods Recently neural networks, specifically deep learning networks, have received a lot of press, setting new standards for classification tasks in speech and video domains. It is still unclear what explains their huge success theoretically. In practice, however, the answer seems to be that they fit very complicated functions/surfaces. They seem to work well in low noise settings with highly complicated decision boundaries. For data with lots of unmeasured factors, leading to high measurement error, neural networks seem to work less well and tree methods seem to work better. Just my two cents... 37
Concept and tool recap Concepts in-sample vs. out-of-sample fits overfitting linear vs. nonlinear functions regression vs. classification feature design vs. non-parametric regression Tools lm() optim() glm() rpart() randomforest() 38
Credits I swiped several pictures from the ISLR book. 39