Statistical Prediction

Similar documents
Classification. Chapter Introduction. 6.2 The Bayes classifier

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Using R in 200D Luke Sonnet

Linear Regression Models P8111

Analytics 512: Homework # 2 Tim Ahn February 9, 2016

Checking the Poisson assumption in the Poisson generalized linear model

Chapter 3 - Linear Regression

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Introduction to Statistics and R

Logistic Regression 21/05

R Output for Linear Models using functions lm(), gls() & glm()

Logistic Regression - problem 6.14

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

Solutions to obligatorisk oppgave 2, STK2100

Lecture 3 Classification, Logistic Regression

Handout 4: Simple Linear Regression

Stat 401B Final Exam Fall 2016

Neural networks (not in book)

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Modeling Overdispersion

Regression Methods for Survey Data

Stat 401B Exam 3 Fall 2016 (Corrected Version)

Exercise 5.4 Solution

Logistic Regressions. Stat 430

Generalized Linear Models in R

Generalized linear models

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Statistisches Data Mining (StDM) Woche 6. Read and do the excersises of chapter 4.6.1, 4.6.2, and in ILSR

Exam Applied Statistical Regression. Good Luck!

Leftovers. Morris. University Farm. University Farm. Morris. yield

Logistic Regression. 0.1 Frogs Dataset

How to deal with non-linear count data? Macro-invertebrates in wetlands

Generalised linear models. Response variable can take a number of different formats

Cherry.R. > cherry d h v <portion omitted> > # Step 1.

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Consider fitting a model using ordinary least squares (OLS) regression:

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

STA 450/4000 S: January

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

1 The Classic Bivariate Least Squares Model

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018

Sample solutions. Stat 8051 Homework 8

Classification: Logistic Regression and Naive Bayes Book Chapter 4. Carlos M. Carvalho The University of Texas McCombs School of Business

Stat 4510/7510 Homework 7

Various Issues in Fitting Contingency Tables

The GLM really is different than OLS, even with a Normally distributed dependent variable, when the link function g is not the identity.

Age 55 (x = 1) Age < 55 (x = 0)

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Holdout and Cross-Validation Methods Overfitting Avoidance

Density Temp vs Ratio. temp

MATH 644: Regression Analysis Methods

Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

R Hints for Chapter 10

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Module 4: Regression Methods: Concepts and Applications

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

HW1 Roshena MacPherson Feb 1, 2017

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Week 7 Multiple factors. Ch , Some miscellaneous parts

Summary and discussion of: Dropout Training as Adaptive Regularization

Statistical Machine Learning from Data

Machine Learning Linear Classification. Prof. Matteo Matteucci

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

STAT 510 Final Exam Spring 2015

Generalized Additive Models

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

STAT 3022 Spring 2007

Hands on cusp package tutorial

Logistic & Tobit Regression

Interactions in Logistic Regression

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

1. Logistic Regression, One Predictor 2. Inference: Estimating the Parameters 3. Multiple Logistic Regression 4. AIC and BIC in Logistic Regression

Machine Learning Lecture 7

Decision Trees (Cont.)

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

Lecture 8: Fitting Data Statistical Computing, Wednesday October 7, 2015

Final Exam. Name: Solution:

SCHOOL OF MATHEMATICS AND STATISTICS

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

STAT 350: Summer Semester Midterm 1: Solutions

UNIVERSITY OF TORONTO Faculty of Arts and Science

Introduction to General and Generalized Linear Models

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Machine Learning for NLP

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Support Vector Machines

Machine Learning and Data Mining. Linear classification. Kalev Kask

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

10-701/ Machine Learning - Midterm Exam, Fall 2010

Transcription:

Statistical Prediction P.R. Hahn Fall 2017 1

Some terminology The goal is to use data to find a pattern that we can exploit. y: response/outcome/dependent/left-hand-side x: predictor/covariate/feature/independent f(x): regression/function estimation/prediction/forecasting/ classification/curve fitting The pattern is statistical in the sense that it holds only approximately. 2

Signal + noise y 10 20 30 40 1500 2000 2500 3000 3500 4000 4500 5000 x y = f (x) + ɛ 3

Minimize the average size of the noise among a certain function class (such as lines or polynomials) minimize the average deviation from the curve squared distance is common (mean squared error) as we get more data, our predictions get better y = α + xβ + ɛ f <- function(x,alpha, beta){ fx <- alpha + beta*x return(fx) } MSE = n 1 i (y i ˆf (x i )) 2 mse <- function(y,fhat){ return(mean((y - fhat)^2)) } 4

Load the auto data auto <- read.csv("auto.csv") str(auto) 'data.frame': 397 obs. of 9 variables: $ mpg : num 18 15 18 16 17 15 14 14 14 15... $ cylinders : int 8 8 8 8 8 8 8 8 8 8... $ displacement: num 307 350 318 304 302 429 454 440 455 390... $ horsepower : Factor w/ 94 levels "?","100","102",..: 17 35 29 29 24 42 47 46 48 40... $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850... $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5... $ year : int 70 70 70 70 70 70 70 70 70 70... $ origin : int 1 1 1 1 1 1 1 1 1 1... $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2. Next, we need to clean the data. auto[auto == "?"] <- NA auto$horsepower <- as.numeric(auto$horsepower) auto <- auto[complete.cases(auto),] auto$origin <- as.factor(auto$origin) 5

Examine the correlations names(auto) [1] "mpg" "cylinders" "displacement" "horsepower" [5] "weight" "acceleration" "year" "origin" [9] "name" cormat <- cor(auto[,-c(8,9)]) print(round(matrix(as.numeric(cormat),7,7),2)) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 1.00-0.78-0.81 0.45-0.83 0.42 0.58 [2,] -0.78 1.00 0.95-0.57 0.90-0.50-0.35 [3,] -0.81 0.95 1.00-0.51 0.93-0.54-0.37 [4,] 0.45-0.57-0.51 1.00-0.51 0.28 0.14 [5,] -0.83 0.90 0.93-0.51 1.00-0.42-0.31 [6,] 0.42-0.50-0.54 0.28-0.42 1.00 0.29 [7,] 0.58-0.35-0.37 0.14-0.31 0.29 1.00 6

Examine the scatterplots plot(auto[,c("mpg","cylinders","displacement","weight")]) 3 4 5 6 7 8 1500 2500 3500 4500 mpg 10 30 3 4 5 6 7 8 cylinders displacement 100 300 1500 3500 10 20 30 40 100 200 300 400 weight 7

Write a function, optimize it tempf <- function(parms){ alpha <- parms[1] beta <- parms[2] y <- auto$mpg fhat <- f(auto$weight,alpha,beta) return(mse(y,fhat)) } optim(c(0,0),tempf) $par [1] 46.213730485-0.007646194 $value [1] 18.67662 $counts function gradient 129 NA $convergence [1] 0 $message NULL 8

Simple linear regression We use the lm() function to predict mpg using weight. fit <- lm(mpg~weight,data = auto) summary(fit) Call: lm(formula = mpg ~ weight, data = auto) Residuals: Min 1Q Median 3Q Max -11.9736-2.7556-0.3358 2.1379 16.5194 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 46.216524 0.798673 57.87 <2e-16 *** weight -0.007647 0.000258-29.64 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.333 on 390 degrees of freedom Multiple R-squared: 0.6926, Adjusted R-squared: 0.6918 F-statistic: 878.8 on 1 and 390 DF, p-value: < 2.2e-16 9

Now we plot the fit plot(auto$weight,auto$mpg,pch=20,main='',xlab='weight',ylab='mpg') abline(a=fit$coefficients[1],b=fit$coefficients[2],col='red',lwd=3) mpg 10 20 30 40 1500 2000 2500 3000 3500 4000 4500 5000 weight 10

Now it is your turn Fit single-variable linear models using cylinders, displacement, and year. Which one looks best? 11

Can the linear trend be improved? Here we add nonlinear features within a linear model. fit_nl <- lm(mpg~ poly(weight,2,raw=true),data=auto) plot(auto$weight,auto$mpg,pch=20,main='',xlab='weight',ylab='mpg') points(auto$weight,fit_nl$fitted.values,col='red',pch=20) mpg 10 20 30 40 1500 2000 2500 3000 3500 4000 4500 5000 weight 12

The quadratic model is better summary(fit_nl) Call: lm(formula = mpg ~ poly(weight, 2, raw = TRUE), data = auto) Residuals: Min 1Q Median 3Q Max -12.6246-2.7134-0.3485 1.8267 16.0866 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.226e+01 2.993e+00 20.800 < 2e-16 *** poly(weight, 2, raw = TRUE)1-1.850e-02 1.972e-03-9.379 < 2e-16 *** poly(weight, 2, raw = TRUE)2 1.697e-06 3.059e-07 5.545 5.43e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.176 on 389 degrees of freedom Multiple R-squared: 0.7151, Adjusted R-squared: 0.7137 F-statistic: 488.3 on 2 and 389 DF, p-value: < 2.2e-16 Now you: compare the 1, 2, and 3 degree polynomial models in terms of the residual standard error. 13

The predictive impact of a variable now depends on the level Here we use the predict() function auto$weight[1] [1] 3504 temp <- auto[1,] temp$weight <- temp$weight + 200 predict(fit_nl,newdata = temp) - predict(fit_nl,newdata = auto[1,]) 1-1.253354 temp <- auto[1,] temp$weight <- temp$weight - 200 predict(fit_nl,newdata = auto[1,]) - predict(fit_nl,newdata = temp) 1-1.389079 14

What if more than one thing matters? Figure 1: With multiple variables we are now fitting a (hyper)plane. 15

The lm() function handles this. fit_mlr <- lm(mpg~.-name-origin+as.factor(origin) + poly(weight,2) - weight,data=auto) summary(fit_mlr) Call: lm(formula = mpg ~. - name - origin + as.factor(origin) + poly(weight, 2) - weight, data = auto) Residuals: Min 1Q Median 3Q Max -8.9552-1.6941 0.0041 1.7087 12.8701 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -4.629e+01 4.239e+00-10.919 < 2e-16 *** cylinders -1.059e-01 3.058e-01-0.346 0.729387 displacement 1.250e-02 6.712e-03 1.862 0.063341. horsepower 6.799e-03 6.409e-03 1.061 0.289420 acceleration 1.724e-01 6.991e-02 2.467 0.014077 * year 8.461e-01 4.614e-02 18.337 < 2e-16 *** as.factor(origin)2 1.756e+00 5.154e-01 3.407 0.000727 *** as.factor(origin)3 1.283e+00 5.080e-01 2.525 0.011959 * poly(weight, 2)1-1.159e+02 8.934e+00-12.978 < 2e-16 *** poly(weight, 2)2 2.964e+01 3.196e+00 9.273 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.986 on 382 degrees of freedom Multiple R-squared: 0.857, Adjusted R-squared: 0.8536 F-statistic: 254.3 on 9 and 382 DF, p-value: < 2.2e-16-16

Actual-vs-Predicted plots plot(auto$mpg,fit_mlr$fitted.values,pch=20,main='', xlab='mpg',ylab='predicted mpg') abline(0,1,col='red') predicted mpg 10 15 20 25 30 35 10 20 30 40 mpg 17

Interactions fit_mlr2 <- lm(mpg~poly(weight,2)*origin + (year + acceleration + origin)^2,data=auto) round(coef(summary(fit_mlr2)),4) Estimate Std. Error t value Pr(> t ) (Intercept) 46.3735 17.1780 2.6996 0.0073 poly(weight, 2)1-113.9304 4.5616-24.9758 0.0000 poly(weight, 2)2 31.4737 3.7640 8.3617 0.0000 origin2-32.2517 9.3921-3.4339 0.0007 origin3-18.8180 8.7171-2.1588 0.0315 year -0.2936 0.2300-1.2769 0.2024 acceleration -5.1336 1.1147-4.6053 0.0000 poly(weight, 2)1:origin2-83.3332 29.7957-2.7968 0.0054 poly(weight, 2)2:origin2-61.8395 19.2514-3.2122 0.0014 poly(weight, 2)1:origin3 69.2630 83.0722 0.8338 0.4049 poly(weight, 2)2:origin3 25.2491 36.1687 0.6981 0.4856 year:acceleration 0.0666 0.0148 4.4860 0.0000 origin2:year 0.2497 0.1196 2.0884 0.0374 origin3:year 0.2195 0.1007 2.1783 0.0300 origin2:acceleration 0.7126 0.1361 5.2356 0.0000 origin3:acceleration 0.3112 0.2080 1.4958 0.1356 summary(fit_mlr2)$sigma [1] 2.711726 summary(fit_mlr2)$adj.r.squared [1] 0.8792895 18

Training and test (validation) sets n <- nrow(auto) ntest <- floor(0.2*n) ntrain <- n - ntest testrows <- sample(1:n, ntest, replace = FALSE) auto_test <- auto[testrows,] auto_train <- auto[-testrows,] fit_mlr2 <- lm(mpg~poly(weight,2)*origin + (year + acceleration + origin)^2,data = auto_train) summary(fit_mlr2)$sigma [1] 2.707457 yhat <- predict(fit_mlr2,newdata = auto_test) sqrt(mse(auto_test$mpg,yhat)) [1] 2.898672 fit_mlr <- lm(mpg~poly(weight,2)+year,data=auto_train) summary(fit_mlr)$sigma [1] 3.0546 yhat <- predict(fit_mlr,newdata = auto_test) sqrt(mse(auto_test$mpg,yhat)) [1] 2.946425 19

Classification auto$highmpg <- auto$mpg > as.numeric(quantile(auto$mpg,0.25)) plot(auto$weight,auto$highmpg,main='',xlab='weight',ylab='high MPG') a <- tapply(auto$weight,cut(auto$weight,seq(1500,5000,by=500)),mean) b <- tapply(auto$highmpg,cut(auto$weight,seq(1500,5000,by=500)),mean) points(a,b,col='red',pch=20,cex=2) high MPG 0.0 0.2 0.4 0.6 0.8 1.0 1500 2000 2500 3000 3500 4000 4500 5000 weight 20

Decision boundary plot(auto$weight[auto$highmpg],auto$year[auto$highmpg],pch=20, col='red',xlim=range(auto$weight),ylim = range(auto$year), xlab = "weight", ylab = "year") points(auto$weight[!auto$highmpg],auto$year[!auto$highmpg],pch=20, col='cyan') year 70 72 74 76 78 80 82 1500 2000 2500 3000 3500 4000 4500 5000 weight 21

Logistic regression These three expressions are all the same. Pr(y = 1 x) = Pr(y = 1 x) = exp (α + xβ) 1 + exp (α + xβ) 1 1 + exp ( (α + xβ)) Pr(y = 1 x) = (1 + exp ( (α + xβ)) 1 A model like this is a type of generalized linear model (GLM). 22

Fitting a logistic regression We use the glm() function fit_glm <- glm(highmpg~weight+year,family = binomial, data = auto) summary(fit_glm) Call: glm(formula = highmpg ~ weight + year, family = binomial, data = auto) Deviance Residuals: Min 1Q Median 3Q Max -2.89256-0.00817 0.01836 0.09080 2.44179 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -1.579e+01 6.610e+00-2.388 0.0169 * weight -5.699e-03 7.941e-04-7.177 7.15e-13 *** year 4.852e-01 1.003e-01 4.839 1.31e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 443.05 on 391 degrees of freedom Residual deviance: 100.04 on 389 degrees of freedom AIC: 106.04 Number of Fisher Scoring iterations: 8 23

Plot the fitted boundary year 70 72 74 76 78 80 82 1500 2000 2500 3000 3500 4000 4500 5000 Let s do some algebra. weight w <- seq(1500, 5000) yr <- (fit_glm$coefficients[1] + fit_glm$coefficients[2]*w)/ (-fit_glm$coefficients[3]) lines(w,yr,col='magenta',lwd=3) 24

ROC curves: How sharp is the decision boundary? The receiver-operating-characteristics curve plots the False Positive Rate (FPR) against the True Positive Rate (TPR). FPR = #(INCORRECTLY PREDICTED POSITIVES)/(TOTAL #NEGATIVES) TPR = #(CORRECTLY PREDICTED POSITIVES)/(TOTAL #POSITIVES) 25

ROC in R simple_roc <- function(labels, scores){ labels <- labels[order(scores, decreasing=true)] data.frame(tpr=cumsum(labels)/sum(labels), FPR=cumsum(!labels)/sum(!labels), labels) } temp <- simple_roc(auto$highmpg,fit_glm$fitted.values) plot(temp[,2:1],pch=20) TPR 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 26 FPR

We can add nonlinear terms and interactions, too year 70 72 74 76 78 80 82 1500 2000 2500 3000 3500 4000 4500 5000 weight fit_glm2 <- glm(highmpg~poly(weight,5,raw=true)+year, family = binomial, data = auto) yr <- (fit_glm2$coefficients[1] + fit_glm2$coefficients[2:6]%*%t(cbind(w,w^2,w^3,w^4,w^5)))/ (-fit_glm2$coefficients[7]) lines(w,yr,col='magenta',lwd=3) 27

Feature design The process of constructing transformations and interactions is commonly referred to as feature design. It is more art than science, but can be really effective. I have heard it called the single most important part of applied predictive modeling. It is also difficult and ad hoc. 28

Regression and classification trees Figure 2: Regression and classification trees automate feature design 29

Regression and classification trees Figure 3: Regression and classification trees partition the feature space. 30

Regression and classification trees Figure 4: They fit a piecewise constant response surface. 31

Regression and classification trees Figure 5: We grow the tree very deep, then prune it. 32

rpart() library(rpart);library(rpart.plot) auto <- auto[,-9] #get ride of names column fit_tree <- rpart(mpg~.-highmpg,data=auto) sqrt(mse(auto$mpg,predict(fit_tree))) [1] 2.922369 rpart.plot(fit_tree) yes 23 100% displacement >= 190 no 17 43% displacement >= 284 29 57% weight >= 2217 15 25% year < 78 23 19% 26 33% year < 78 29 14% weight >= 2775 displacement >= 138 33 24% year < 78 14 21% 19 4% 19 18% 20 4% 24 15% 27 7% 32 7% 29 12% 36 12% 33

Exercise fit_tree <- rpart(mpg~.,data=auto_train) sqrt(mse(auto_test$mpg,predict(prune(fit_tree,cp=0), newdata = auto_test))) [1] 4.947657 sqrt(mse(auto_train$mpg,predict(prune(fit_tree,cp=0), newdata = auto_train))) [1] 1.780435 Use the auto_test and auto_train subsets to try out different settings of the cp parameter. Try different splits of the data as well. Do the results change? 34

Random forests and randomforest() Regression trees are automated and are interpretable. But sometimes they are not smooth enough. Random forests is a method that combines a bunch of different tree fits, to get a smoother response surface. It is harder to visualize, but often gives better predictions. library(randomforest) randomforest 4.6-12 Type rfnews() to see new features/changes/bug fixes. auto_train <- auto_train[,-9] # get rid of names again auto_train <- auto_train[,-9] fit_rf <- randomforest(mpg~.,data=auto_train) sqrt(mse(auto_train$mpg,predict(fit_rf,newdata = auto_train))) [1] 1.393711 sqrt(mse(auto_test$mpg,predict(fit_rf,newdata = auto_test))) [1] 2.60724 35

Other models/algorithms Boosted regression trees support vector machines neural networks (deep learning) Gaussian processes There are R package implementations in many cases svm() from the e1071 package nnet() from the nnet package gbm() from the gbm package etc. These methods differ in the representation of the prediction function (response surface). 36

An aside about neural networks vs. tree methods Recently neural networks, specifically deep learning networks, have received a lot of press, setting new standards for classification tasks in speech and video domains. It is still unclear what explains their huge success theoretically. In practice, however, the answer seems to be that they fit very complicated functions/surfaces. They seem to work well in low noise settings with highly complicated decision boundaries. For data with lots of unmeasured factors, leading to high measurement error, neural networks seem to work less well and tree methods seem to work better. Just my two cents... 37

Concept and tool recap Concepts in-sample vs. out-of-sample fits overfitting linear vs. nonlinear functions regression vs. classification feature design vs. non-parametric regression Tools lm() optim() glm() rpart() randomforest() 38

Credits I swiped several pictures from the ISLR book. 39