Multiple Regression: Mixed Predictor Types. Tim Frasier

Size: px

Start display at page:

Download "Multiple Regression: Mixed Predictor Types. Tim Frasier"

Maude Walker
5 years ago
Views:

1 Multiple Regression: Mixed Predictor Types Tim Frasier Copyright Tim Frasier This work is licensed under the Creative Commons Attribution 4.0 International license. Click here for more information.

2 The Data

3 Data Fuel economy data from 1999 and 2008 for 38 popular models of car* I know, I know. It s neither biological nor that interesting, but it is hard to find good example data sets for this * As distributed with the ggplot2 package, and original data from the EPA (

4 Data

5 Data Predicted variable hwy

6 Data Two categorical predictors man class

7 Data Two metric predictors* displ cyl * I realize that cylinders is not really a metric variable, but we will treat it like one here for demonstration purposes

8 Data Read the data into R and parse out just the fields in which we are interested cardata <- read.table("mpg.csv", header = TRUE, sep = ",") carsub <- cardata[, c(2, 4, 6, 10, 12)]

9 Data Use summary function to get a feel for it summary(carsub) manufacturer displ cyl hwy class dodge :37 Min. :1.600 Min. :4.000 Min. : seater : 5 toyota :34 1st Qu.: st Qu.: st Qu.:18.00 compact :47 volkswagen:27 Median :3.300 Median :6.000 Median :24.00 midsize :41 ford :25 Mean :3.472 Mean :5.889 Mean :23.44 minivan :11 chevrolet :19 3rd Qu.: rd Qu.: rd Qu.:27.00 pickup :33 audi :18 Max. :7.000 Max. :8.000 Max. :44.00 subcompact:35 (Other) :74 suv :62

10 Data Plot the data to get a feel for it But keep in mind these can be misleading!!! pairs(carsub, pch = 16, col = rgb(0, 0, 1, 0.5))

11 Data manufacturer displ cyl hwy class

12 Data Positive relationship between engine displacement and the number of cylinders (makes sense) manufacturer displ cyl hwy class

13 Data Negative relationship between engine displacement & highway mpg manufacturer displ cyl hwy class

14 Data Negative relationship between number of cylinders & highway mpg manufacturer displ cyl hwy class

15 Data Mostly positive relationship between engine displacement & vehicle class manufacturer displ cyl hwy class

16 Data Some interesting patterns of relationships between class and highway mpg manufacturer displ cyl hwy class

17 Data Some interesting patterns of relationships between manufacturer and highway mpg manufacturer displ cyl hwy class

18 Frequentist Approach

19 Frequentist Approach Mixed predictors can be analyzed with the lm function cartest <- lm(hwy ~ manufacturer + displ + cyl + class, data = carsub)

20 Frequentist Approach summary(cartest) Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** manufacturerchevrolet manufacturerdodge manufacturerford manufacturerhonda ** manufacturerhyundai manufacturerjeep manufacturerland rover manufacturerlincoln manufacturermercury manufacturernissan manufacturerpontiac manufacturersubaru manufacturertoyota manufacturervolkswagen * displ cyl *** classcompact classmidsize classminivan ** classpickup e-08 *** classsubcompact classsuv e-08 *** --- Residual standard error: on 211 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 22 and 211 DF, p-value: < 2.2e-16

21 Frequentist Approach summary(cartest) Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** manufacturerchevrolet manufacturerdodge manufacturerford manufacturerhonda ** manufacturerhyundai manufacturerjeep manufacturerland rover manufacturerlincoln manufacturermercury manufacturernissan manufacturerpontiac manufacturersubaru manufacturertoyota manufacturervolkswagen * displ cyl *** classcompact classmidsize classminivan ** classpickup e-08 *** classsubcompact classsuv e-08 *** --- Residual standard error: on 211 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 22 and 211 DF, p-value: < 2.2e-16 Is the intercept plus the effect of being an audi. All other effects are differences from this reference

22 Frequentist Approach summary(cartest) Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** manufacturerchevrolet manufacturerdodge manufacturerford manufacturerhonda ** manufacturerhyundai manufacturerjeep manufacturerland rover manufacturerlincoln manufacturermercury manufacturernissan manufacturerpontiac manufacturersubaru manufacturertoyota manufacturervolkswagen * displ cyl *** classcompact classmidsize classminivan ** classpickup e-08 *** classsubcompact classsuv e-08 *** --- Residual standard error: on 211 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 22 and 211 DF, p-value: < 2.2e-16 Manufacturer not too big an impact, but a little

23 Frequentist Approach summary(cartest) Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** manufacturerchevrolet manufacturerdodge manufacturerford manufacturerhonda ** manufacturerhyundai manufacturerjeep manufacturerland rover manufacturerlincoln manufacturermercury manufacturernissan manufacturerpontiac manufacturersubaru manufacturertoyota manufacturervolkswagen * displ cyl *** classcompact classmidsize classminivan ** classpickup e-08 *** classsubcompact classsuv e-08 *** --- Residual standard error: on 211 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 22 and 211 DF, p-value: < 2.2e-16 Engine displacement has a negative, but not significant, effect

24 Frequentist Approach summary(cartest) Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** manufacturerchevrolet manufacturerdodge manufacturerford manufacturerhonda ** manufacturerhyundai manufacturerjeep manufacturerland rover manufacturerlincoln manufacturermercury manufacturernissan manufacturerpontiac manufacturersubaru manufacturertoyota manufacturervolkswagen * displ cyl *** classcompact classmidsize classminivan ** classpickup e-08 *** classsubcompact classsuv e-08 *** --- Residual standard error: on 211 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 22 and 211 DF, p-value: < 2.2e-16 Cylinder number has a significant negative effect

25 Frequentist Approach summary(cartest) Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** manufacturerchevrolet manufacturerdodge manufacturerford manufacturerhonda ** manufacturerhyundai manufacturerjeep manufacturerland rover manufacturerlincoln manufacturermercury manufacturernissan manufacturerpontiac manufacturersubaru manufacturertoyota manufacturervolkswagen * displ cyl *** classcompact classmidsize classminivan ** classpickup e-08 *** classsubcompact classsuv e-08 *** --- Residual standard error: on 211 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 22 and 211 DF, p-value: < 2.2e-16 Class category seems important

26 Bayesian Approach

27 Load Libraries & Functions library(runjags) library(coda) source("plotpost.r")

28 Organize the Data #--- The y data ---# y = carsub$hwy N = length(y) ymean = mean(y) ysd = sd(y) zy = (y - ymean) / ysd

29 Organize the Data #-- The metric x data ---# # displ displ <- carsub$displ displmean <- mean(displ) displsd <- sd(displ) zdispl <- (displ - displmean) / displsd # cyl cyl <- carsub$cyl cylmean <- mean(cyl) cylsd <- sd(cyl) zcyl <- (cyl - cylmean) / cylsd

30 Organize the Data #--- The nominal x data ---# man <- as.numeric(carsub$manufacturer) class <- as.numeric(carsub$class) manlevels <- levels(carsub$manufacturer) classlevels <- levels(carsub$class) nmans <- length(unique(man)) nclass <- length(unique(class))

31 Organize the Data datalist = list( y = zy, N = N, displ = zdispl, displmean = displmean, cyl = zcyl, cylmean = cylmean, man = man, class = class, nmans = nmans, nclass = nclass )

32 Organize the Data datalist = list( y = zy, N = N, displ = zdispl, displmean = displmean, cyl = zcyl, cylmean = cylmean, man = man, class = class, nmans = nmans, nclass = nclass ) Note that we need the means of the metric predictor variables here (we haven t in the past)

33 Define the Model µ τ = 1/σ 2 - norm yi

34 Define the Model Effect of being in each manufacturer category on mpg µ τ = 1/σ 2 - norm yi

35 Define the Model Effect of engine displacement on mpg µ τ = 1/σ 2 - norm yi

36 Define the Model Effect of being in each class category on mpg µ τ = 1/σ 2 - norm yi

37 Define the Model Effect of # of cylinders on mpg µ τ = 1/σ 2 - norm yi

38 Define the Model Note multiple personalities of β 0 now Metric predictors: y value when all predictors are zero Nominal predictors: Mean y value across all categories of all variables µ τ = 1/σ 2 - norm yi

39 Define the Model Note multiple personalities of β 0 now Metric predictors: y value when all predictors are zero Nominal predictors: Mean y value across all categories of all variables What should it be now? µ τ = 1/σ 2 - norm yi

40 Define the Model Note multiple personalities of β 0 now Metric predictors: y value when all predictors are zero Nominal predictors: Mean y value across all categories of all variables Makes sense to set it as the mean predicted value if the metric predictors are re-centred at their mean µ τ = 1/σ 2 - norm yi

41 Define the Model All α because they will need to be standardized µ τ = 1/σ 2 - norm yi

42 Define the Model All α because they will need to be standardized Now metric effects are centred around the mean µ τ = 1/σ 2 - norm yi

43 Define the Model 0 10 µ τ = 1/σ 2 - norm µ τ = 1/σ 2 - norm yi

44 Define the Model µ τ = 1/σ 2 norm µ τ = 1/σ 2 - norm µ τ = 1/σ 2 - norm yi

45 Define the Model µ τ = 1/σ 2 norm µ τ = 1/σ 2 - norm We ll also make each nominal variable hierarchical... µ τ = 1/σ 2 - norm yi

46 Define the Model µ τ = 1/σ 2 norm µ τ = 1/σ 2 - norm α gamma β µ τ = 1/σ 2 - norm yi

47 modelstring = " model { for (i in 1:N) { } #--- Likelihood ---# y[i] ~ dnorm(mu[i], tau) mu[i] <- a0 + a1[man[i]] + (a2 * (displ[i] - displmean)) + a3[class[i]] + (a4 * (cyl[i] - cylmean)) #--- Priors ---# sigma ~ dgamma(1.1, 0.11) tau <- 1 / sigma^2 a0 ~ dnorm(0, 1/10^2) a2 ~ dnorm(0, 1/10^2) a4 ~ dnorm(0, 1/10^2) # a1 for (j in 1:nMans) { a1[j] ~ dnorm(manmeans, 1/manSD^2) } # a3 for (j in 1:nClass) { a3[j] ~ dnorm(classmeans, 1/classSD^2) }

48 #--- Hyperpriors ---# manmeans ~ dnorm(0, 1/10^2) mansd ~ dgamma(1.1, 0.11) classmeans ~ dnorm(0, 1/10^2) classsd ~ dgamma(1.1, 0.11)

49 # # # Convert a0,a[] to sum-to-zero b0,b[] : # # # m1 <- mean(a1[1:nmans]) # Mean across a1 categories m3 <- mean(a3[1:nclass]) # Mean across a3 categories #- b0 is a0 + mean of each nominal predictor, minus mean effect -# #- of metric predictors. See Kruschke (2015) p. 570 for algebra -# b0 <- a0 + m1 + m3 - (a2 * displmean) - (a4 * cylmean) #- b1 is the the uncorrected a1 minus mean across categories for that nominal variable -# for (j in 1:nMans) { b1[j] <- a1[j] - m1 } #- b3 is the uncorrected a3 minus mean across categories for that nominal variable -# for (j in 1:nClass) { b3[j] <- a3[j] - m3 } #- Coefficients for metric variables stay the same -# b2 <- a2 b4 <- a4 } " # close quote for modelstring writelines(modelstring,con="model.txt")

50 Specify Initial Values initslist <- function() { list( sigma = rgamma(n = 1, shape = 1.1, rate = 0.11), a0 = rnorm(n = 1, mean = 0, sd = 10), b2 = rnorm(n = 1, mean = 0, sd = 10), b4 = rnorm(n = 1, mean = 0, sd = 10), manmeans = rnorm(n = 1, mean = 0, sd = 10), mansd = rgamma(n = 1, shape = 1.1, rate = 0.11), classmeans = rnorm(n = 1, mean = 0, sd = 10), classsd = rgamma(n = 1, shape = 1.1, rate = 0.11) ) }

51 Specify MCMC Parameters and Run runjagsout <- run.jags( method = "simple", model = "model.txt", monitor = c("b0", "b1", "b2", "b3", "b4", "sigma"), data = datalist, inits = initslist, n.chains = 3, adapt = 500, burnin = 1000, sample = 20000, thin = 1, summarise = TRUE, plots = FALSE)

52 Evaluate Performance of the Model

53 Testing Model Performance Retrieve the data and take a peak at the structure codasamples = as.mcmc.list(runjagsout) head(codasamples[[1]]) Markov Chain Monte Carlo (MCMC) output: Start = 1501 End = 1507 Thinning interval = 1 b0 b1[1] b1[2] b1[3] b1[4] b1[5] b1[6] b1[7] b1[8] b1[9] b1[10] b1[11] b1[12] b1[13]

54 Testing Model Performance Can do this on your own

55 Extract & Parse Results mcmcchain = as.matrix(codasamples) # b0 zb0 = mcmcchain[, "b0"] # b1 chainlength = length(zb0) zb1 = matrix(0, ncol = chainlength, nrow = nmans) for (i in 1:nMans) { zb1[i, ] = mcmcchain[, paste("b1[", i, "]", sep = "")] } # b2 zb2 = mcmcchain[, "b2"] # b3 zb3 = matrix(0, ncol = chainlength, nrow = nclass) for (i in 1:nClass) { zb3[i, ] = mcmcchain[, paste("b3[", i, "]", sep = "")] } # b4 zb4 = mcmcchain[, "b4"] # sigma zsigma <- mcmcchain[, "sigma"]

56 Convert to Original Scale b0 <- (zb0 * ysd) + ymean b2 <- (zb2 * ysd) / displsd b4 <- (zb4 * ysd) / cylsd b1 <- zb1 * ysd b3 <- zb3 * ysd sigma <- zsigma * ysd

57 View Posteriors

58 Plotting Posterior Distributions β 0 par(mfrow = c(1, 1)) histinfo = plotpost(b0, xlab = "b0", main = "b0") b0 mean = % HDI b0

59 Plotting Posterior Distributions β 1 par(mfrow = c(3, 3)) for (i in 1:nMans) { histinfo = plotpost(b1[i, ], xlab = bquote(b1[.(i)]), main = paste("b1:", manlevels[i])) }

60 Plotting Posterior Distributions β 1 b1: audi mean = % HDI b1: chevrolet mean = % HDI b1: dodge mean = % HDI b b b1 3 b1: ford mean = % HDI b1: honda mean = % HDI b1: hyundai mean = % HDI b b b1 6 b1: jeep mean = % HDI b1: land rover mean = % HDI b1: lincoln mean = % HDI b b b1 9

61 Plotting Posterior Distributions β 1 b1: mercury mean = % HDI b1: nissan mean = % HDI b1: pontiac mean = % HDI b b b1 12 b1: subaru mean = % HDI b1: toyota mean = % HDI b1: volkswagen mean = % HDI b b b1 15

62 Plotting Posterior Distributions β 2 par(mfrow = c(1, 1)) histinfo = plotpost(b2, xlab = "b2", main = "Engine Displacement") Engine Displacement mean = % HDI b2

63 Plotting Posterior Distributions β 3 par(mfrow = c(2, 2)) for (i in 1:nClass) { histinfo = plotpost(b3[i, ], xlab = bquote(b3[.(i)]), main = paste("b3:", classlevels[i])) }

64 Plotting Posterior Distributions β 3 b3: 2seater mean = b3: compact mean = % HDI % HDI b b3 2 b3: midsize mean = b3: minivan mean = % HDI % HDI b b3 4

65 Plotting Posterior Distributions β 3 b3: pickup mean = b3: subcompact mean = % HDI % HDI b b3 6 b3: suv mean = % HDI b3 7

66 Plotting Posterior Distributions β 4 par(mfrow = c(1, 1)) histinfo = plotpost(b4, xlab = "b4", main = "# of Cylinders") # of Cylinders mean = % HDI b4

67 Posterior Predictive Check

68 Posterior Predictive Check Select a subset of the data on which to make predictions (let s pick 20) npred = 20 newrows <- round(seq(from = 1, to = NROW(carSub), length = npred)) newdata <- carsub[newrows, ]

69 Posterior Predictive Check Separate out just the x data, on which we will make predictions x1 <- as.numeric(newdata$manufacturer) x2 <- newdata$displ x3 <- as.numeric(newdata$class) x4 <- newdata$cyl

70 Posterior Predictive Check Next, define a matrix that will hold all of the predicted y values Number of rows is the number of x values for prediction Number of columns is the number of y values generated from the MCMC process We ll start with the matrix filled with zeros, but will fill it in later postsampsize = length(b0) ynew = matrix(0, nrow = npred, ncol = postsampsize)

71 Posterior Predictive Check Define a matrix for holding the HDI limits of the predicted y values Same number of rows as above Only two columns (one for each end of the HDI) yhdilim = matrix(0, nrow = npred, ncol = 2)

72 Posterior Predictive Check Now, populate the ynew matrix by generating one predicted y value for each step in the chain Note that our coefficients for the metric predictors are centred around the mean, so we have to treat them this way here for (i in 1:nPred) { for (j in 1:postSampSize) { ynew[i, j] <- rnorm(1, mean = b0[j] + b1[x1[i], j] + (b2[j] * (x2[i] - displmean)) + b3[x3[i], j] + (b4[j] * (x4[i] - cylmean)), sd = sigma[j]) } }

73 Posterior Predictive Check Calculate means for each prediction, and the associated low and high 95% HDI estimates means <- rowmeans(ynew) source("hdiofmcmc.r") for (i in 1:nPred) { yhdilim[i, ] <- HDIofMCMC(yNew[i, ]) }

74 Posterior Predictive Check Combine into one data frame predtable <- cbind(means, yhdilim)

75 Posterior Predictive Check Plot predicted values dotchart(means, labels = 1:nPred, xlim = c(min(yhdilim), max(yhdilim)), xlab = hwy mpg", pch = 16) segments(yhdilim[, 1], 1:nPred, yhdilim[, 2], 1:nPred, lwd = 2) Add the truth points(x = newdata$hwy, y = 1:nPred, pch = 16, col = rgb(1, 0, 0, 0.5))

76 Posterior Predictive Check hwy mpg

77 Homework (last one!)

78 Homework Get the DIC for the full model Re-configure and run the model 4 more times, leaving a different predictor variable out each time, and get the DIC for each Compare the DIC values to decide which predictors are most important for your model Should explain your results and interpretation, but can do so as commented lines in your code (i.e., enclosed in # so that your code will still run, but also so that you have written explanations in there for me to read)

79 Creative Commons License Anyone is allowed to distribute, remix, tweak, and build upon this work, even commercially, as long as they credit me for the original creation. See the Creative Commons website for more information. Click here to go back to beginning

Metric Predicted Variable With One Nominal Predictor Variable

Metric Predicted Variable With One Nominal Predictor Variable Tim Frasier Copyright Tim Frasier This work is licensed under the Creative Commons Attribution 4.0 International license. Click here for more