Multiple Regression: Mixed Predictor Types. Tim Frasier

Multiple Regression: Mixed Predictor Types Tim Frasier Copyright Tim Frasier This work is licensed under the Creative Commons Attribution 4.0 International license. Click here for more information.

The Data

Data Fuel economy data from 1999 and 2008 for 38 popular models of car* I know, I know. It s neither biological nor that interesting, but it is hard to find good example data sets for this * As distributed with the ggplot2 package, and original data from the EPA (http://fueleconomy.gov)

Data

Data Predicted variable hwy

Data Two categorical predictors man class

Data Two metric predictors* displ cyl * I realize that cylinders is not really a metric variable, but we will treat it like one here for demonstration purposes

Data Read the data into R and parse out just the fields in which we are interested cardata <- read.table("mpg.csv", header = TRUE, sep = ",") carsub <- cardata[, c(2, 4, 6, 10, 12)]

Data Use summary function to get a feel for it summary(carsub) manufacturer displ cyl hwy class dodge :37 Min. :1.600 Min. :4.000 Min. :12.00 2seater : 5 toyota :34 1st Qu.:2.400 1st Qu.:4.000 1st Qu.:18.00 compact :47 volkswagen:27 Median :3.300 Median :6.000 Median :24.00 midsize :41 ford :25 Mean :3.472 Mean :5.889 Mean :23.44 minivan :11 chevrolet :19 3rd Qu.:4.600 3rd Qu.:8.000 3rd Qu.:27.00 pickup :33 audi :18 Max. :7.000 Max. :8.000 Max. :44.00 subcompact:35 (Other) :74 suv :62

Data Plot the data to get a feel for it But keep in mind these can be misleading!!! pairs(carsub, pch = 16, col = rgb(0, 0, 1, 0.5))

Data 2 3 4 5 6 7 15 25 35 45 manufacturer 2 6 10 14 2 4 6 displ cyl 4 5 6 7 8 15 25 35 45 hwy class 1 3 5 7 2 6 10 14 4 5 6 7 8 1 2 3 4 5 6 7

Data Positive relationship between engine displacement and the number of cylinders (makes sense) 2 3 4 5 6 7 15 25 35 45 manufacturer 2 6 10 14 2 4 6 displ cyl 4 5 6 7 8 15 25 35 45 hwy class 1 3 5 7 2 6 10 14 4 5 6 7 8 1 2 3 4 5 6 7

Data Negative relationship between engine displacement & highway mpg 2 3 4 5 6 7 15 25 35 45 manufacturer 2 6 10 14 2 4 6 displ cyl 4 5 6 7 8 15 25 35 45 hwy class 1 3 5 7 2 6 10 14 4 5 6 7 8 1 2 3 4 5 6 7

Data Negative relationship between number of cylinders & highway mpg 2 3 4 5 6 7 15 25 35 45 manufacturer 2 6 10 14 2 4 6 displ cyl 4 5 6 7 8 15 25 35 45 hwy class 1 3 5 7 2 6 10 14 4 5 6 7 8 1 2 3 4 5 6 7

Data Mostly positive relationship between engine displacement & vehicle class 2 3 4 5 6 7 15 25 35 45 manufacturer 2 6 10 14 2 4 6 displ cyl 4 5 6 7 8 15 25 35 45 hwy class 1 3 5 7 2 6 10 14 4 5 6 7 8 1 2 3 4 5 6 7

Data Some interesting patterns of relationships between class and highway mpg 2 3 4 5 6 7 15 25 35 45 manufacturer 2 6 10 14 2 4 6 displ cyl 4 5 6 7 8 15 25 35 45 hwy class 1 3 5 7 2 6 10 14 4 5 6 7 8 1 2 3 4 5 6 7

Data Some interesting patterns of relationships between manufacturer and highway mpg 2 3 4 5 6 7 15 25 35 45 manufacturer 2 6 10 14 2 4 6 displ cyl 4 5 6 7 8 15 25 35 45 hwy class 1 3 5 7 2 6 10 14 4 5 6 7 8 1 2 3 4 5 6 7

Frequentist Approach

Frequentist Approach Mixed predictors can be analyzed with the lm function cartest <- lm(hwy ~ manufacturer + displ + cyl + class, data = carsub)

Frequentist Approach summary(cartest) Estimate Std. Error t value Pr(> t ) (Intercept) 36.65662 2.11270 17.351 < 2e-16 *** manufacturerchevrolet 1.65228 1.15766 1.427 0.154984 manufacturerdodge 0.68563 1.11661 0.614 0.539857 manufacturerford 0.64843 1.04851 0.618 0.536962 manufacturerhonda 4.11170 1.32135 3.112 0.002117 ** manufacturerhyundai -0.13343 1.06491-0.125 0.900410 manufacturerjeep 0.84860 1.30195 0.652 0.515242 manufacturerland rover 0.54583 1.59905 0.341 0.733181 manufacturerlincoln 1.61904 1.81376 0.893 0.373067 manufacturermercury 0.81057 1.58446 0.512 0.609484 manufacturernissan 0.78152 1.04274 0.749 0.454401 manufacturerpontiac 2.16964 1.47980 1.466 0.144092 manufacturersubaru 0.08387 1.08103 0.078 0.938236 manufacturertoyota 1.41222 0.83692 1.687 0.093003. manufacturervolkswagen 1.81232 0.82522 2.196 0.029169 * displ -0.52109 0.53766-0.969 0.333562 cyl -1.28737 0.35135-3.664 0.000314 *** classcompact -2.17130 1.66099-1.307 0.192557 classmidsize -2.12355 1.59408-1.332 0.184250 classminivan -5.72148 1.90221-3.008 0.002951 ** classpickup -9.25680 1.58293-5.848 1.88e-08 *** classsubcompact -2.17163 1.65154-1.315 0.189966 classsuv -8.16278 1.43190-5.701 3.99e-08 *** --- Residual standard error: 2.583 on 211 degrees of freedom Multiple R-squared: 0.8296, Adjusted R-squared: 0.8118 F-statistic: 46.68 on 22 and 211 DF, p-value: < 2.2e-16

Bayesian Approach

Load Libraries & Functions library(runjags) library(coda) source("plotpost.r")

Organize the Data #--- The y data ---# y = carsub$hwy N = length(y) ymean = mean(y) ysd = sd(y) zy = (y - ymean) / ysd

Organize the Data #-- The metric x data ---# # displ displ <- carsub$displ displmean <- mean(displ) displsd <- sd(displ) zdispl <- (displ - displmean) / displsd # cyl cyl <- carsub$cyl cylmean <- mean(cyl) cylsd <- sd(cyl) zcyl <- (cyl - cylmean) / cylsd

Organize the Data #--- The nominal x data ---# man <- as.numeric(carsub$manufacturer) class <- as.numeric(carsub$class) manlevels <- levels(carsub$manufacturer) classlevels <- levels(carsub$class) nmans <- length(unique(man)) nclass <- length(unique(class))

Organize the Data datalist = list( y = zy, N = N, displ = zdispl, displmean = displmean, cyl = zcyl, cylmean = cylmean, man = man, class = class, nmans = nmans, nclass = nclass )

Organize the Data datalist = list( y = zy, N = N, displ = zdispl, displmean = displmean, cyl = zcyl, cylmean = cylmean, man = man, class = class, nmans = nmans, nclass = nclass ) Note that we need the means of the metric predictor variables here (we haven t in the past)

Define the Model µ τ = 1/σ 2 - norm yi

Define the Model Effect of being in each manufacturer category on mpg µ τ = 1/σ 2 - norm yi

Define the Model Effect of engine displacement on mpg µ τ = 1/σ 2 - norm yi

Define the Model Effect of being in each class category on mpg µ τ = 1/σ 2 - norm yi

Define the Model Effect of # of cylinders on mpg µ τ = 1/σ 2 - norm yi

Define the Model Note multiple personalities of β 0 now Metric predictors: y value when all predictors are zero Nominal predictors: Mean y value across all categories of all variables What should it be now? µ τ = 1/σ 2 - norm yi

Define the Model All α because they will need to be standardized µ τ = 1/σ 2 - norm yi

Define the Model All α because they will need to be standardized Now metric effects are centred around the mean µ τ = 1/σ 2 - norm yi

Define the Model 0 10 µ τ = 1/σ 2 - norm µ τ = 1/σ 2 - norm yi

Define the Model 0 10 0 10 - µ τ = 1/σ 2 norm µ τ = 1/σ 2 - norm µ τ = 1/σ 2 - norm yi

Define the Model 0 10 0 10 - µ τ = 1/σ 2 norm µ τ = 1/σ 2 - norm We ll also make each nominal variable hierarchical... µ τ = 1/σ 2 - norm yi

Define the Model 0 10 0 10 - µ τ = 1/σ 2 norm µ τ = 1/σ 2 - norm 1.1 0.11 α gamma β µ τ = 1/σ 2 - norm yi

modelstring = " model { for (i in 1:N) { } #--- Likelihood ---# y[i] ~ dnorm(mu[i], tau) mu[i] <- a0 + a1[man[i]] + (a2 * (displ[i] - displmean)) + a3[class[i]] + (a4 * (cyl[i] - cylmean)) #--- Priors ---# sigma ~ dgamma(1.1, 0.11) tau <- 1 / sigma^2 a0 ~ dnorm(0, 1/10^2) a2 ~ dnorm(0, 1/10^2) a4 ~ dnorm(0, 1/10^2) # a1 for (j in 1:nMans) { a1[j] ~ dnorm(manmeans, 1/manSD^2) } # a3 for (j in 1:nClass) { a3[j] ~ dnorm(classmeans, 1/classSD^2) }

#--- Hyperpriors ---# manmeans ~ dnorm(0, 1/10^2) mansd ~ dgamma(1.1, 0.11) classmeans ~ dnorm(0, 1/10^2) classsd ~ dgamma(1.1, 0.11)

#--------------------------------------------------------------# # Convert a0,a[] to sum-to-zero b0,b[] : # #--------------------------------------------------------------# m1 <- mean(a1[1:nmans]) # Mean across a1 categories m3 <- mean(a3[1:nclass]) # Mean across a3 categories #- b0 is a0 + mean of each nominal predictor, minus mean effect -# #- of metric predictors. See Kruschke (2015) p. 570 for algebra -# b0 <- a0 + m1 + m3 - (a2 * displmean) - (a4 * cylmean) #- b1 is the the uncorrected a1 minus mean across categories for that nominal variable -# for (j in 1:nMans) { b1[j] <- a1[j] - m1 } #- b3 is the uncorrected a3 minus mean across categories for that nominal variable -# for (j in 1:nClass) { b3[j] <- a3[j] - m3 } #- Coefficients for metric variables stay the same -# b2 <- a2 b4 <- a4 } " # close quote for modelstring writelines(modelstring,con="model.txt")

Specify Initial Values initslist <- function() { list( sigma = rgamma(n = 1, shape = 1.1, rate = 0.11), a0 = rnorm(n = 1, mean = 0, sd = 10), b2 = rnorm(n = 1, mean = 0, sd = 10), b4 = rnorm(n = 1, mean = 0, sd = 10), manmeans = rnorm(n = 1, mean = 0, sd = 10), mansd = rgamma(n = 1, shape = 1.1, rate = 0.11), classmeans = rnorm(n = 1, mean = 0, sd = 10), classsd = rgamma(n = 1, shape = 1.1, rate = 0.11) ) }

Specify MCMC Parameters and Run runjagsout <- run.jags( method = "simple", model = "model.txt", monitor = c("b0", "b1", "b2", "b3", "b4", "sigma"), data = datalist, inits = initslist, n.chains = 3, adapt = 500, burnin = 1000, sample = 20000, thin = 1, summarise = TRUE, plots = FALSE)

Evaluate Performance of the Model

Testing Model Performance Retrieve the data and take a peak at the structure codasamples = as.mcmc.list(runjagsout) head(codasamples[[1]]) Markov Chain Monte Carlo (MCMC) output: Start = 1501 End = 1507 Thinning interval = 1 b0 b1[1] b1[2] b1[3] b1[4] b1[5] b1[6] 1501 0.1615540-0.1428200 0.1109130 0.0269200-0.1742880 0.1978230-0.15071500 1502 0.1552670-0.0826378 0.1884370-0.0403636-0.0872312 0.1875360-0.06216070 1503 0.1001840-0.0175505 0.1996980-0.0569787-0.0932590 0.0996678-0.00883381 1504 0.0676603-0.1242170 0.0388379-0.0892582-0.0544722 0.1295170-0.05160610 1505 0.0574967 0.0596947 0.2072000 0.0547035-0.0452218 0.1282700-0.07768330 1506 0.1424630-0.0746281 0.0516289-0.0279351-0.0471806-0.0418495-0.03025680 1507 0.1330610-0.0468138-0.0630458 0.1247140-0.0365874 0.1287520-0.09369140 b1[7] b1[8] b1[9] b1[10] b1[11] b1[12] b1[13] 1501-0.1281650 0.08830060 0.11898300 0.1815890-0.0421282-0.07847370-0.10213300 1502-0.0905903 0.08597800-0.01362000 0.1636370-0.1507060 0.04585780-0.15745300 1503-0.0052090-0.00374356-0.19175100 0.1362450-0.0953489-0.09058930-0.04872230 1504 0.1106150 0.09045220 0.00229036-0.0308884 0.0522702 0.00123141-0.26743000 1505-0.1661700-0.14270000-0.12711400 0.1074820 0.0087711-0.00288126-0.05062190 1506 0.1196960-0.01552510-0.03205860-0.0868129 0.0830015 0.13673200-0.10771300 1507-0.0157871-0.01230790-0.11858900 0.0305756-0.0964591-0.01032590 0.00492213...

Testing Model Performance Can do this on your own

Extract & Parse Results mcmcchain = as.matrix(codasamples) # b0 zb0 = mcmcchain[, "b0"] # b1 chainlength = length(zb0) zb1 = matrix(0, ncol = chainlength, nrow = nmans) for (i in 1:nMans) { zb1[i, ] = mcmcchain[, paste("b1[", i, "]", sep = "")] } # b2 zb2 = mcmcchain[, "b2"] # b3 zb3 = matrix(0, ncol = chainlength, nrow = nclass) for (i in 1:nClass) { zb3[i, ] = mcmcchain[, paste("b3[", i, "]", sep = "")] } # b4 zb4 = mcmcchain[, "b4"] # sigma zsigma <- mcmcchain[, "sigma"]

Convert to Original Scale b0 <- (zb0 * ysd) + ymean b2 <- (zb2 * ysd) / displsd b4 <- (zb4 * ysd) / cylsd b1 <- zb1 * ysd b3 <- zb3 * ysd sigma <- zsigma * ysd

View Posteriors

Plotting Posterior Distributions β 0 par(mfrow = c(1, 1)) histinfo = plotpost(b0, xlab = "b0", main = "b0") b0 mean = 24.099 95% HDI 23.569 24.661 23.0 23.5 24.0 24.5 25.0 b0

Plotting Posterior Distributions β 1 par(mfrow = c(3, 3)) for (i in 1:nMans) { histinfo = plotpost(b1[i, ], xlab = bquote(b1[.(i)]), main = paste("b1:", manlevels[i])) }

Plotting Posterior Distributions β 1 b1: audi mean = 0.54877 95% HDI 1.668 0.47912 b1: chevrolet mean = 0.408 95% HDI 0.57524 1.5384 b1: dodge mean = 0.22815 95% HDI 1.2746 0.75078 3 2 1 0 1 2 b1 1 2 1 0 1 2 3 b1 2 2 1 0 1 2 b1 3 b1: ford mean = 0.28718 95% HDI 1.2573 0.57371 b1: honda mean = 1.1616 95% HDI 0.21697 2.8029 b1: hyundai mean = 0.72662 95% HDI 1.9256 0.30148 3 2 1 0 1 2 b1 4 1 0 1 2 3 4 5 b1 5 4 3 2 1 0 1 b1 6 b1: jeep mean = 0.065258 95% HDI 1.2923 1.0588 b1: land rover mean = 0.11593 95% HDI 1.5114 1.2461 b1: lincoln mean = 0.14718 95% HDI 1.2934 1.5824 3 2 1 0 1 2 3 b1 7 4 2 0 2 b1 8 2 0 2 4 b1 9

Plotting Posterior Distributions β 1 b1: mercury mean = 0.0622 95% HDI 1.4568 1.2618 b1: nissan mean = 0.14101 95% HDI 1.1743 0.87598 b1: pontiac mean = 0.3685 95% HDI 0.88247 1.8248 4 2 0 2 4 b1 10 2 1 0 1 2 b1 11 2 0 2 4 b1 12 b1: subaru mean = 0.6035 95% HDI 1.7973 0.42863 b1: toyota mean = 0.23372 95% HDI 0.53345 1.0598 b1: volkswagen mean = 0.45959 95% HDI 0.37997 1.4468 3 2 1 0 1 b1 13 1 0 1 2 b1 14 1 0 1 2 b1 15

Plotting Posterior Distributions β 2 par(mfrow = c(1, 1)) histinfo = plotpost(b2, xlab = "b2", main = "Engine Displacement") Engine Displacement mean = 0.51353 95% HDI 1.3686 0.36219 2.0 1.5 1.0 0.5 0.0 0.5 1.0 b2

Plotting Posterior Distributions β 3 par(mfrow = c(2, 2)) for (i in 1:nClass) { histinfo = plotpost(b3[i, ], xlab = bquote(b3[.(i)]), main = paste("b3:", classlevels[i])) }

Plotting Posterior Distributions β 3 b3: 2seater mean = 4.1832 b3: compact mean = 1.863 95% HDI 1.8181 6.5796 95% HDI 0.83817 2.9301 0 2 4 6 8 b3 1 0 1 2 3 4 b3 2 b3: midsize mean = 2.0809 b3: minivan mean = 1.5848 95% HDI 1.1136 3.0201 95% HDI 3.1698 0.012184 0 1 2 3 4 b3 3 4 2 0 2 b3 4

Plotting Posterior Distributions β 3 b3: pickup mean = 4.9694 b3: subcompact mean = 2.3311 95% HDI 6.0004 3.9431 95% HDI 1.2279 3.3752 7 6 5 4 3 b3 5 0 1 2 3 4 b3 6 b3: suv mean = 3.904 95% HDI 4.7215 3.069 6 5 4 3 b3 7

Plotting Posterior Distributions β 4 par(mfrow = c(1, 1)) histinfo = plotpost(b4, xlab = "b4", main = "# of Cylinders") # of Cylinders mean = 1.3652 95% HDI 1.9561 0.78196 2.5 2.0 1.5 1.0 0.5 b4

Posterior Predictive Check

Posterior Predictive Check Select a subset of the data on which to make predictions (let s pick 20) npred = 20 newrows <- round(seq(from = 1, to = NROW(carSub), length = npred)) newdata <- carsub[newrows, ]

Posterior Predictive Check Separate out just the x data, on which we will make predictions x1 <- as.numeric(newdata$manufacturer) x2 <- newdata$displ x3 <- as.numeric(newdata$class) x4 <- newdata$cyl

Posterior Predictive Check Next, define a matrix that will hold all of the predicted y values Number of rows is the number of x values for prediction Number of columns is the number of y values generated from the MCMC process We ll start with the matrix filled with zeros, but will fill it in later postsampsize = length(b0) ynew = matrix(0, nrow = npred, ncol = postsampsize)

Posterior Predictive Check Define a matrix for holding the HDI limits of the predicted y values Same number of rows as above Only two columns (one for each end of the HDI) yhdilim = matrix(0, nrow = npred, ncol = 2)

Posterior Predictive Check Now, populate the ynew matrix by generating one predicted y value for each step in the chain Note that our coefficients for the metric predictors are centred around the mean, so we have to treat them this way here for (i in 1:nPred) { for (j in 1:postSampSize) { ynew[i, j] <- rnorm(1, mean = b0[j] + b1[x1[i], j] + (b2[j] * (x2[i] - displmean)) + b3[x3[i], j] + (b4[j] * (x4[i] - cylmean)), sd = sigma[j]) } }

Posterior Predictive Check Calculate means for each prediction, and the associated low and high 95% HDI estimates means <- rowmeans(ynew) source("hdiofmcmc.r") for (i in 1:nPred) { yhdilim[i, ] <- HDIofMCMC(yNew[i, ]) }

Posterior Predictive Check Combine into one data frame predtable <- cbind(means, yhdilim)

Posterior Predictive Check Plot predicted values dotchart(means, labels = 1:nPred, xlim = c(min(yhdilim), max(yhdilim)), xlab = hwy mpg", pch = 16) segments(yhdilim[, 1], 1:nPred, yhdilim[, 2], 1:nPred, lwd = 2) Add the truth points(x = newdata$hwy, y = 1:nPred, pch = 16, col = rgb(1, 0, 0, 0.5))

Posterior Predictive Check 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 10 15 20 25 30 35 hwy mpg

Homework (last one!)

Homework Get the DIC for the full model Re-configure and run the model 4 more times, leaving a different predictor variable out each time, and get the DIC for each Compare the DIC values to decide which predictors are most important for your model Should explain your results and interpretation, but can do so as commented lines in your code (i.e., enclosed in # so that your code will still run, but also so that you have written explanations in there for me to read)

Creative Commons License Anyone is allowed to distribute, remix, tweak, and build upon this work, even commercially, as long as they credit me for the original creation. See the Creative Commons website for more information. Click here to go back to beginning