Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

Background Regression so far... Lecture 23 - Sta 111 Colin Rundel June 17, 2014 At this point we have covered: Simple linear regression Relationship between numerical response and a numerical or categorical predictor Multiple regression Relationship between numerical response and multiple numerical and/or categorical predictors What we haven t seen is what to do when the predictors are weird (nonlinear, complicated dependence structure, etc. or when the response is weird (categorical, count data, etc. Sta 111 (Colin Rundel Lec 23 June 17, 2014 2 / 36 Truck prices The scatterplot below shows the relationship between year and price of a random sample of 43 pickup trucks. Describe the relationship between these two variables. Remove unusual observations Let s remove trucks older than 20 years, and only focus on trucks made in 1992 or later. Now what can you say about the relationship? price 20000 15000 10000 5000 1980 1985 1990 year From: http:// faculty.chicagobooth.edu/ robert.gramacy/ teaching.html Sta 111 (Colin Rundel Lec 23 June 17, 2014 3 / 36 price 20000 15000 10000 5000 year Sta 111 (Colin Rundel Lec 23 June 17, 2014 4 / 36

Truck prices - linear model? Truck prices - transform of the response variable price residuals 20000 15000 10000 5000 10000 5000 0 5000 10000 year price = b 0 + b 1 year The linear model doesn t appear to be a good fit since the residuals have non-constant variance. In particular residuals for newer cars (to the right have a larger variance than the residuals for older cars (to the left. residuals (price 10.0 9.5 9.0 8.5 8.0 7.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 year (price = b 0 + b 1 year We have applied a transformation to the response variable. The relationship now seems linear, and the residuals have (more constant variance. Sta 111 (Colin Rundel Lec 23 June 17, 2014 5 / 36 Sta 111 (Colin Rundel Lec 23 June 17, 2014 6 / 36 Interpreting models with transformation Estimate Std. Error t value Pr(> t (Intercept -265.07 25.04-10.59 0.00 pu$year 0.14 0.01 10.94 0.00 (price = 265.07 + 0.14 year For each additional year the car is newer (for each year decrease in car s age we would expect the price of the car to increase on average by 0.14 dollars. which is not very useful... Interpreting models with transformation (cont. The slope coefficient for the transformed model is 0.14, meaning the price difference between cars that are one year apart is predicted to be 0.14 dollars. (price 1 = 265.07 + 0.14 y (price 2 = 265.07 + 0.14 (y + 1 (price 2 (price 1 = 0.14 ( price 2 = 0.14 price 1 price 2 = e 0.14 = 1.15 price 1 Sta 111 (Colin Rundel Lec 23 June 17, 2014 7 / 36 For each additional year the car is newer (for each year decrease in car s age we would expect the price of the car to increase on average by a factor of 1.15. Sta 111 (Colin Rundel Lec 23 June 17, 2014 8 / 36

Recap: dealing with non-constant variance Non-constant variance is one of the most common model violations, however it is usually fixable by transforming the response (y variable The most common variance stabilizing transform is the transformation: (y, especially useful when the response variable is (extremely right skewed. When using a transformation on the response variable the interpretation of the slope changes: For each unit increase in x, y is expected on average to decrease/increase by a factor of e b 1. Another useful transformation is the square root: y, especially useful when the response variable is counts. These transformations may also be useful when the relationship is non-linear, but in those cases a polynomial regression may also be needed. Odds Odds are another way of quantifying the probability of an event, commonly used in gambling (and istic regression. Odds For some event E, odds(e = P(E P(E c = P(E 1 P(E Similarly, if we are told the odds of E are x to y then which implies odds(e = x y = x/(x + y y/(x + y P(E = x/(x + y, P(E c = y/(x + y Sta 111 (Colin Rundel Lec 23 June 17, 2014 9 / 36 Sta 111 (Colin Rundel Lec 23 June 17, 2014 10 / 36 Example - Donner Party In 1846 the Donner and Reed families left Springfield, Illinois, for California by covered wagon. In July, the Donner Party, as it became known, reached Fort Bridger, Wyoming. There its leaders decided to attempt a new and untested rote to the Sacramento Valley. Having reached its full size of 87 people and 20 wagons, the party was delayed by a difficult crossing of the Wasatch Range and again in the crossing of the desert west of the Great Salt Lake. The group became stranded in the eastern Sierra Nevada mountains when the region was hit by heavy snows in late October. By the time the last survivor was rescued on April 21, 1847, 40 of the 87 members had died from famine and exposure to extreme cold. Example - Donner Party - Data Age Sex Status 1 23.00 Male Died 2 40.00 Female Survived 3 40.00 Male Survived 4 30.00 Male Died 5 28.00 Male Died.... 43 23.00 Male Survived 44 24.00 Male Died 45 25.00 Female Survived From Ramsey, F.L. and Schafer, D.W. (2002. The Statistical Sleuth: A Course in Methods of Data Analysis (2nd ed Sta 111 (Colin Rundel Lec 23 June 17, 2014 11 / 36 Sta 111 (Colin Rundel Lec 23 June 17, 2014 12 / 36

Example - Donner Party - EDA Example - Donner Party -??? Status vs. Gender: Status vs. Age: Male Female Died 20 5 Survived 10 10 It seems clear that both age and gender have an effect on someone s survival, how do we come up with a model that will let us explore this relationship? Even if we set Died to 0 and Survived to 1, this isn t something we can transform our way out of - we need something more. Age 20 30 40 50 60 One way to think about the problem - we can treat Survived and Died as successes and failures arising from a binomial distribution where the probability of a success is given by a transformation of a linear model of the predictors. Died Survived Sta 111 (Colin Rundel Lec 23 June 17, 2014 13 / 36 Sta 111 (Colin Rundel Lec 23 June 17, 2014 14 / 36 Generalized linear models It turns out that this is a very general way of addressing this type of problem in regression, and the resulting models are called generalized linear models (. Logistic regression is just one example of this type of model. All generalized linear models have the following three characteristics: 1 A probability distribution describing the outcome variable 2 A linear model η = β 0 + β 1 X 1 + + β n X n 3 A link function that relates the linear model to the parameter of the outcome distribution g(p = η or p = g 1 (η Logistic regression is a GLM used to model a binary categorical variable using numerical and categorical predictors. We assume a binomial distribution produced the outcome variable and we therefore want to model p the probability of success for a given set of predictors. To finish specifying the Logistic model we just need to establish a reasonable link function that connects η to p. There are a variety of options but the most commonly used is the it function. Logit function ( p it(p =, for 0 p 1 Sta 111 (Colin Rundel Lec 23 June 17, 2014 15 / 36 Sta 111 (Colin Rundel Lec 23 June 17, 2014 16 / 36

Properties of the Logit The it function takes a value between 0 and 1 and maps it to a value between and. Inverse it (istic function g 1 (x = exp(x 1 + exp(x = 1 1 + exp( x The inverse it function takes a value between and and maps it to a value between 0 and 1. This formulation also has some use when it comes to interpreting the model as it can be interpreted as the odds of a success, more on this later. The istic regression model The three GLM criteria give us: From which we arrive at, y i Binom(p i η i = β 0 + β 1 x 1,i + + β n x n,i it(p = η p i = exp(β 0 + β 1 x 1,i + + β n x n,i 1 + exp(β 0 + β 1 x 1,i + + β n x n,i Sta 111 (Colin Rundel Lec 23 June 17, 2014 17 / 36 Sta 111 (Colin Rundel Lec 23 June 17, 2014 18 / 36 Example - Donner Party - Model In R we fit a GLM in the same was as a linear model except using glm instead of lm and we must also specify the type of GLM to fit using the family argument. summary(glm(status ~ Age, data=donner, family=binomial Call: glm(formula = Status ~ Age, family = binomial, data = donner Coefficients: Estimate Std. Error z value Pr(> z (Intercept 1.81852 0.99937 1.820 0.0688. Age -0.06647 0.03222-2.063 0.0391 * Null deviance: 61.827 on 44 degrees of freedom Residual deviance: 56.291 on 43 degrees of freedom AIC: 60.291 Number of Fisher Scoring iterations: 4 Sta 111 (Colin Rundel Lec 23 June 17, 2014 19 / 36 Example - Donner Party - Prediction Estimate Std. Error z value Pr(> z (Intercept 1.8185 0.9994 1.82 0.0688 Age -0.0665 0.0322-2.06 0.0391 ( p = 1.8185 0.0665 Age Odds / Probability of survival for a newborn (Age=0: ( p = 1.8185 0.0665 0 p = exp(1.8185 = 6.16 p = 6.16/7.16 = 0.86 Sta 111 (Colin Rundel Lec 23 June 17, 2014 20 / 36

Example - Donner Party - Prediction (cont. Example - Donner Party - Prediction (cont. ( p = 1.8185 0.0665 Age ( p = 1.8185 0.0665 Age Odds / Probability of survival for a 25 year old: ( p = 1.8185 0.0665 25 p = exp(0.156 = 1.17 p = 1.17/2.17 = 0.539 Odds / Probability of survival for a 50 year old: ( p = 1.8185 0.0665 0 p = exp( 1.5065 = 0.222 p = 0.222/1.222 = 0.181 Status 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 Sta 111 (Colin Rundel Lec 23 June 17, 2014 21 / 36 Age Sta 111 (Colin Rundel Lec 23 June 17, 2014 22 / 36 Example - Donner Party - Interpretation Example - Donner Party - Interpretation - Slope Estimate Std. Error z value Pr(> z (Intercept 1.8185 0.9994 1.82 0.0688 Age -0.0665 0.0322-2.06 0.0391 Simple interpretation is only possible in terms of odds and odds ratios for intercept and slope terms. Intercept: The odds of survival for a party member with an age of 0. From this we can calculate the odds or probability, but additional calculations are necessary. Slope: For a unit increase in age (being 1 year older how much will the odds ratio change, not particularly intuitive. More often than not we care only about sign and relative magnitude. ( 1 ( = 1.8185 0.0665(x + 1 1 = 1.8185 0.0665x 0.0665 ( p2 = 1.8185 0.0665x 2 ( p2 ( 1 p 1 1 2 / p2 = 0.0665 = 0.0665 2 / p2 = exp( 0.0665 = 0.94 2 Sta 111 (Colin Rundel Lec 23 June 17, 2014 23 / 36 Sta 111 (Colin Rundel Lec 23 June 17, 2014 24 / 36

Example - Donner Party - Age and Gender summary(glm(status ~ Age + Sex, data=donner, family=binomial Call: glm(formula = Status ~ Age + Sex, family = binomial, data = donner Coefficients: Estimate Std. Error z value Pr(> z (Intercept 1.63312 1.11018 1.471 0.1413 Age -0.07820 0.03728-2.097 0.0359 * SexFemale 1.59729 0.75547 2.114 0.0345 * --- (Dispersion parameter for binomial family taken to be 1 Null deviance: 61.827 on 44 degrees of freedom Residual deviance: 51.256 on 42 degrees of freedom AIC: 57.256 Number of Fisher Scoring iterations: 4 Gender slope: When the other predictors are held constant this is the odds ratio between the given level (Female and the reference level (Male. Sta 111 (Colin Rundel Lec 23 June 17, 2014 25 / 36 Example - Donner Party - Gender Models Just like MLR we can plug in gender to arrive at two status vs age models for men and women respectively. General model: Male model: Female model: ( 1 = 1.63312 + 0.07820 Age + 1.59729 Sex ( = 1.63312 + 0.07820 Age + 1.59729 0 1 = 1.63312 + 0.07820 Age ( 1 = 1.63312 + 0.07820 Age + 1.59729 1 = 3.23041 + 0.07820 Age Sta 111 (Colin Rundel Lec 23 June 17, 2014 26 / 36 Example - Donner Party - Gender Models (cont. Hypothesis test for the whole model Status 0.0 0.2 0.4 0.6 0.8 1.0 Males Females 0 20 40 60 80 Age Male Female Sta 111 (Colin Rundel Lec 23 June 17, 2014 27 / 36 summary(glm(status ~ Age + Sex, data=donner, family=binomial Call: glm(formula = Status ~ Age + Sex, family = binomial, data = donner Coefficients: Estimate Std. Error z value Pr(> z (Intercept 1.63312 1.11018 1.471 0.1413 Age -0.07820 0.03728-2.097 0.0359 * SexFemale 1.59729 0.75547 2.114 0.0345 * --- (Dispersion parameter for binomial family taken to be 1 Null deviance: 61.827 on 44 degrees of freedom Residual deviance: 51.256 on 42 degrees of freedom AIC: 57.256 Number of Fisher Scoring iterations: 4 Note that the model output does not include any F-statistic, as a general rule there are not single model hypothesis tests for GLM models. Sta 111 (Colin Rundel Lec 23 June 17, 2014 28 / 36

Hypothesis tests for a coefficient Testing for the slope of Age Estimate Std. Error z value Pr(> z (Intercept 1.6331 1.1102 1.47 0.1413 Age -0.0782 0.0373-2.10 0.0359 SexFemale 1.5973 0.7555 2.11 0.0345 We are however still able to perform inference on individual coefficients, the basic setup is exactly the same as what we ve seen before except we use a Z test. Note the only tricky bit, which is way beyond the scope of this course, is how the standard error is calculated. Estimate Std. Error z value Pr(> z (Intercept 1.6331 1.1102 1.47 0.1413 Age -0.0782 0.0373-2.10 0.0359 SexFemale 1.5973 0.7555 2.11 0.0345 Z = H 0 : β age = 0 H A : β age 0 β ˆ age β age = -0.0782 0 = -2.10 SE age 0.0373 p-value = P( Z > 2.10 = P(Z > 2.10 + P(Z < -2.10 = 2 0.0178 = 0.0359 Sta 111 (Colin Rundel Lec 23 June 17, 2014 29 / 36 Sta 111 (Colin Rundel Lec 23 June 17, 2014 30 / 36 Confidence interval for age slope coefficient Estimate Std. Error z value Pr(> z (Intercept 1.6331 1.1102 1.47 0.1413 Age -0.0782 0.0373-2.10 0.0359 SexFemale 1.5973 0.7555 2.11 0.0345 Remember, the interpretation for a slope is the change in odds ratio per unit change in the predictor. Log odds ratio: Additional Example Example - Birdkeeping and Lung Cancer A 1972-1981 health survey in The Hague, Netherlands, discovered an association between keeping pet birds and increased risk of lung cancer. To investigate birdkeeping as a risk factor, researchers conducted a case-control study of patients in 1985 at four hospitals in The Hague (population 450,000. They identified 49 cases of lung cancer among the patients who were registered with a general practice, who were age 65 or younger and who had resided in the city since 1965. They also selected 98 controls from a population of residents having the same general age structure. CI = PE ± CV SE = 0.0782 ± 1.96 0.0373 = ( 0.1513, 0.0051 Odds ratio: From Ramsey, F.L. and Schafer, D.W. (2002. The Statistical Sleuth: A Course in Methods of Data Analysis (2nd ed exp(ci = (exp( 0.1513, exp( 0.0051 = (0.85960.9949 Sta 111 (Colin Rundel Lec 23 June 17, 2014 31 / 36 Sta 111 (Colin Rundel Lec 23 June 17, 2014 32 / 36

Additional Example Example - Birdkeeping and Lung Cancer - Data Additional Example Example - Birdkeeping and Lung Cancer - EDA LC FM SS BK AG YR CD 1 LungCancer Male Low Bird 37.00 19.00 12.00 2 LungCancer Male Low Bird 41.00 22.00 15.00 3 LungCancer Male High NoBird 43.00 19.00 15.00........ 147 NoCancer Female Low NoBird 65.00 7.00 2.00 LC FM SS BK AG YR CD Whether subject has lung cancer Sex of subject Socioeconomic status Indicator for birdkeeping Age of subject (years Years of smoking prior to diagnosis or examination Average rate of smoking (cigarettes per day Note - NoCancer is the reference response (0 or failure, LungCancer is the non-reference response (1 or success - this matters for interpretation. Sta 111 (Colin Rundel Lec 23 June 17, 2014 33 / 36 Bird No Bird Lung Cancer No Lung Cancer Sta 111 (Colin Rundel Lec 23 June 17, 2014 34 / 36 Additional Example Example - Birdkeeping and Lung Cancer - Model summary(glm(lc ~ FM + SS + BK + AG + YR + CD, data=bird, family=binomial Call: glm(formula = LC ~ FM + SS + BK + AG + YR + CD, family = binomial, data = bird Coefficients: Estimate Std. Error z value Pr(> z (Intercept -1.93736 1.80425-1.074 0.282924 FMFemale 0.56127 0.53116 1.057 0.290653 SSHigh 0.10545 0.46885 0.225 0.822050 BKBird 1.36259 0.41128 3.313 0.000923 *** AG -0.03976 0.03548-1.120 0.262503 YR 0.07287 0.02649 2.751 0.005940 ** CD 0.02602 0.02552 1.019 0.308055 (Dispersion parameter for binomial family taken to be 1 Null deviance: 187.14 on 146 degrees of freedom Residual deviance: 154.20 on 140 degrees of freedom AIC: 168.2 Number of Fisher Scoring iterations: 5 Sta 111 (Colin Rundel Lec 23 June 17, 2014 35 / 36 Additional Example Example - Birdkeeping and Lung Cancer - Interpretation Estimate Std. Error z value Pr(> z (Intercept -1.9374 1.8043-1.07 0.2829 FMFemale 0.5613 0.5312 1.06 0.2907 SSHigh 0.1054 0.4688 0.22 0.8221 BKBird 1.3626 0.4113 3.31 0.0009 AG -0.0398 0.0355-1.12 0.2625 YR 0.0729 0.0265 2.75 0.0059 CD 0.0260 0.0255 1.02 0.3081 Keeping all other predictors constant then, The odds ratio of getting lung cancer for bird keepers vs non-bird keepers is exp(1.3626 = 3.91. The odds ratio of getting lung cancer for an additional year of smoking is exp(0.0729 = 1.08. Sta 111 (Colin Rundel Lec 23 June 17, 2014 36 / 36