Statistical Prediction

Size: px

Start display at page:

Download "Statistical Prediction"

Richard George
5 years ago
Views:

1 Statistical Prediction P.R. Hahn Fall

2 Some terminology The goal is to use data to find a pattern that we can exploit. y: response/outcome/dependent/left-hand-side x: predictor/covariate/feature/independent f(x): regression/function estimation/prediction/forecasting/ classification/curve fitting The pattern is statistical in the sense that it holds only approximately. 2

3 Signal + noise y x y = f (x) + ɛ 3

4 Minimize the average size of the noise among a certain function class (such as lines or polynomials) minimize the average deviation from the curve squared distance is common (mean squared error) as we get more data, our predictions get better y = α + xβ + ɛ f <- function(x,alpha, beta){ fx <- alpha + beta*x return(fx) } MSE = n 1 i (y i ˆf (x i )) 2 mse <- function(y,fhat){ return(mean((y - fhat)^2)) } 4

5 Load the auto data auto <- read.csv("auto.csv") str(auto) 'data.frame': 397 obs. of 9 variables: $ mpg : num $ cylinders : int $ displacement: num $ horsepower : Factor w/ 94 levels "?","100","102",..: $ weight : int $ acceleration: num $ year : int $ origin : int $ name : Factor w/ 304 levels "amc ambassador brougham",..: Next, we need to clean the data. auto[auto == "?"] <- NA auto$horsepower <- as.numeric(auto$horsepower) auto <- auto[complete.cases(auto),] auto$origin <- as.factor(auto$origin) 5

6 Examine the correlations names(auto) [1] "mpg" "cylinders" "displacement" "horsepower" [5] "weight" "acceleration" "year" "origin" [9] "name" cormat <- cor(auto[,-c(8,9)]) print(round(matrix(as.numeric(cormat),7,7),2)) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] [2,] [3,] [4,] [5,] [6,] [7,]

7 Examine the scatterplots plot(auto[,c("mpg","cylinders","displacement","weight")]) mpg cylinders displacement weight 7

8 Write a function, optimize it tempf <- function(parms){ alpha <- parms[1] beta <- parms[2] y <- auto$mpg fhat <- f(auto$weight,alpha,beta) return(mse(y,fhat)) } optim(c(0,0),tempf) $par [1] $value [1] $counts function gradient 129 NA $convergence [1] 0 $message NULL 8

9 Simple linear regression We use the lm() function to predict mpg using weight. fit <- lm(mpg~weight,data = auto) summary(fit) Call: lm(formula = mpg ~ weight, data = auto) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** weight <2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 390 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 390 DF, p-value: < 2.2e-16 9

10 Now we plot the fit plot(auto$weight,auto$mpg,pch=20,main='',xlab='weight',ylab='mpg') abline(a=fit$coefficients[1],b=fit$coefficients[2],col='red',lwd=3) mpg weight 10

11 Now it is your turn Fit single-variable linear models using cylinders, displacement, and year. Which one looks best? 11

12 Can the linear trend be improved? Here we add nonlinear features within a linear model. fit_nl <- lm(mpg~ poly(weight,2,raw=true),data=auto) plot(auto$weight,auto$mpg,pch=20,main='',xlab='weight',ylab='mpg') points(auto$weight,fit_nl$fitted.values,col='red',pch=20) mpg weight 12

13 The quadratic model is better summary(fit_nl) Call: lm(formula = mpg ~ poly(weight, 2, raw = TRUE), data = auto) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.226e e < 2e-16 *** poly(weight, 2, raw = TRUE) e e < 2e-16 *** poly(weight, 2, raw = TRUE) e e e-08 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 389 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 2 and 389 DF, p-value: < 2.2e-16 Now you: compare the 1, 2, and 3 degree polynomial models in terms of the residual standard error. 13

14 The predictive impact of a variable now depends on the level Here we use the predict() function auto$weight[1] [1] 3504 temp <- auto[1,] temp$weight <- temp$weight predict(fit_nl,newdata = temp) - predict(fit_nl,newdata = auto[1,]) temp <- auto[1,] temp$weight <- temp$weight predict(fit_nl,newdata = auto[1,]) - predict(fit_nl,newdata = temp)

15 What if more than one thing matters? Figure 1: With multiple variables we are now fitting a (hyper)plane. 15

16 The lm() function handles this. fit_mlr <- lm(mpg~.-name-origin+as.factor(origin) + poly(weight,2) - weight,data=auto) summary(fit_mlr) Call: lm(formula = mpg ~. - name - origin + as.factor(origin) + poly(weight, 2) - weight, data = auto) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e e < 2e-16 *** cylinders e e displacement 1.250e e horsepower 6.799e e acceleration 1.724e e * year 8.461e e < 2e-16 *** as.factor(origin) e e *** as.factor(origin) e e * poly(weight, 2) e e < 2e-16 *** poly(weight, 2) e e < 2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 382 degrees of freedom Multiple R-squared: 0.857, Adjusted R-squared: F-statistic: on 9 and 382 DF, p-value: < 2.2e-16-16

17 Actual-vs-Predicted plots plot(auto$mpg,fit_mlr$fitted.values,pch=20,main='', xlab='mpg',ylab='predicted mpg') abline(0,1,col='red') predicted mpg mpg 17

18 Interactions fit_mlr2 <- lm(mpg~poly(weight,2)*origin + (year + acceleration + origin)^2,data=auto) round(coef(summary(fit_mlr2)),4) Estimate Std. Error t value Pr(> t ) (Intercept) poly(weight, 2) poly(weight, 2) origin origin year acceleration poly(weight, 2)1:origin poly(weight, 2)2:origin poly(weight, 2)1:origin poly(weight, 2)2:origin year:acceleration origin2:year origin3:year origin2:acceleration origin3:acceleration summary(fit_mlr2)$sigma [1] summary(fit_mlr2)$adj.r.squared [1]

19 Training and test (validation) sets n <- nrow(auto) ntest <- floor(0.2*n) ntrain <- n - ntest testrows <- sample(1:n, ntest, replace = FALSE) auto_test <- auto[testrows,] auto_train <- auto[-testrows,] fit_mlr2 <- lm(mpg~poly(weight,2)*origin + (year + acceleration + origin)^2,data = auto_train) summary(fit_mlr2)$sigma [1] yhat <- predict(fit_mlr2,newdata = auto_test) sqrt(mse(auto_test$mpg,yhat)) [1] fit_mlr <- lm(mpg~poly(weight,2)+year,data=auto_train) summary(fit_mlr)$sigma [1] yhat <- predict(fit_mlr,newdata = auto_test) sqrt(mse(auto_test$mpg,yhat)) [1]

20 Classification auto$highmpg <- auto$mpg > as.numeric(quantile(auto$mpg,0.25)) plot(auto$weight,auto$highmpg,main='',xlab='weight',ylab='high MPG') a <- tapply(auto$weight,cut(auto$weight,seq(1500,5000,by=500)),mean) b <- tapply(auto$highmpg,cut(auto$weight,seq(1500,5000,by=500)),mean) points(a,b,col='red',pch=20,cex=2) high MPG weight 20

21 Decision boundary plot(auto$weight[auto$highmpg],auto$year[auto$highmpg],pch=20, col='red',xlim=range(auto$weight),ylim = range(auto$year), xlab = "weight", ylab = "year") points(auto$weight[!auto$highmpg],auto$year[!auto$highmpg],pch=20, col='cyan') year weight 21

22 Logistic regression These three expressions are all the same. Pr(y = 1 x) = Pr(y = 1 x) = exp (α + xβ) 1 + exp (α + xβ) exp ( (α + xβ)) Pr(y = 1 x) = (1 + exp ( (α + xβ)) 1 A model like this is a type of generalized linear model (GLM). 22

23 Fitting a logistic regression We use the glm() function fit_glm <- glm(highmpg~weight+year,family = binomial, data = auto) summary(fit_glm) Call: glm(formula = highmpg ~ weight + year, family = binomial, data = auto) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) e e * weight e e e-13 *** year 4.852e e e-06 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 391 degrees of freedom Residual deviance: on 389 degrees of freedom AIC: Number of Fisher Scoring iterations: 8 23

24 Plot the fitted boundary year Let s do some algebra. weight w <- seq(1500, 5000) yr <- (fit_glm$coefficients[1] + fit_glm$coefficients[2]*w)/ (-fit_glm$coefficients[3]) lines(w,yr,col='magenta',lwd=3) 24

25 ROC curves: How sharp is the decision boundary? The receiver-operating-characteristics curve plots the False Positive Rate (FPR) against the True Positive Rate (TPR). FPR = #(INCORRECTLY PREDICTED POSITIVES)/(TOTAL #NEGATIVES) TPR = #(CORRECTLY PREDICTED POSITIVES)/(TOTAL #POSITIVES) 25

26 ROC in R simple_roc <- function(labels, scores){ labels <- labels[order(scores, decreasing=true)] data.frame(tpr=cumsum(labels)/sum(labels), FPR=cumsum(!labels)/sum(!labels), labels) } temp <- simple_roc(auto$highmpg,fit_glm$fitted.values) plot(temp[,2:1],pch=20) TPR FPR

27 We can add nonlinear terms and interactions, too year weight fit_glm2 <- glm(highmpg~poly(weight,5,raw=true)+year, family = binomial, data = auto) yr <- (fit_glm2$coefficients[1] + fit_glm2$coefficients[2:6]%*%t(cbind(w,w^2,w^3,w^4,w^5)))/ (-fit_glm2$coefficients[7]) lines(w,yr,col='magenta',lwd=3) 27

28 Feature design The process of constructing transformations and interactions is commonly referred to as feature design. It is more art than science, but can be really effective. I have heard it called the single most important part of applied predictive modeling. It is also difficult and ad hoc. 28

29 Regression and classification trees Figure 2: Regression and classification trees automate feature design 29

30 Regression and classification trees Figure 3: Regression and classification trees partition the feature space. 30

31 Regression and classification trees Figure 4: They fit a piecewise constant response surface. 31

32 Regression and classification trees Figure 5: We grow the tree very deep, then prune it. 32

33 rpart() library(rpart);library(rpart.plot) auto <- auto[,-9] #get ride of names column fit_tree <- rpart(mpg~.-highmpg,data=auto) sqrt(mse(auto$mpg,predict(fit_tree))) [1] rpart.plot(fit_tree) yes % displacement >= 190 no 17 43% displacement >= % weight >= % year < % 26 33% year < % weight >= 2775 displacement >= % year < % 19 4% 19 18% 20 4% 24 15% 27 7% 32 7% 29 12% 36 12% 33

34 Exercise fit_tree <- rpart(mpg~.,data=auto_train) sqrt(mse(auto_test$mpg,predict(prune(fit_tree,cp=0), newdata = auto_test))) [1] sqrt(mse(auto_train$mpg,predict(prune(fit_tree,cp=0), newdata = auto_train))) [1] Use the auto_test and auto_train subsets to try out different settings of the cp parameter. Try different splits of the data as well. Do the results change? 34

35 Random forests and randomforest() Regression trees are automated and are interpretable. But sometimes they are not smooth enough. Random forests is a method that combines a bunch of different tree fits, to get a smoother response surface. It is harder to visualize, but often gives better predictions. library(randomforest) randomforest Type rfnews() to see new features/changes/bug fixes. auto_train <- auto_train[,-9] # get rid of names again auto_train <- auto_train[,-9] fit_rf <- randomforest(mpg~.,data=auto_train) sqrt(mse(auto_train$mpg,predict(fit_rf,newdata = auto_train))) [1] sqrt(mse(auto_test$mpg,predict(fit_rf,newdata = auto_test))) [1]

36 Other models/algorithms Boosted regression trees support vector machines neural networks (deep learning) Gaussian processes There are R package implementations in many cases svm() from the e1071 package nnet() from the nnet package gbm() from the gbm package etc. These methods differ in the representation of the prediction function (response surface). 36

37 An aside about neural networks vs. tree methods Recently neural networks, specifically deep learning networks, have received a lot of press, setting new standards for classification tasks in speech and video domains. It is still unclear what explains their huge success theoretically. In practice, however, the answer seems to be that they fit very complicated functions/surfaces. They seem to work well in low noise settings with highly complicated decision boundaries. For data with lots of unmeasured factors, leading to high measurement error, neural networks seem to work less well and tree methods seem to work better. Just my two cents... 37

38 Concept and tool recap Concepts in-sample vs. out-of-sample fits overfitting linear vs. nonlinear functions regression vs. classification feature design vs. non-parametric regression Tools lm() optim() glm() rpart() randomforest() 38

39 Credits I swiped several pictures from the ISLR book. 39

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode