Consider fitting a model using ordinary least squares (OLS) regression:

Size: px

Start display at page:

Download "Consider fitting a model using ordinary least squares (OLS) regression:"

Meryl Barton
5 years ago
Views:

1 Example 1: Mating Success of African Elephants In this study, 41 male African elephants were followed over a period of 8 years. The age of the elephant at the beginning of the study and the number of successful matings during the 8 years was recorded. The objective was to learn whether older animals are more successful at mating or whether they have diminished success after reaching a certain age. The data set Elephants.csv contains the following variables: Y = Number of matings in the 8 year follow-up period X = Age (yrs.) of elephant at the start of the study > plot(matings~age) Consider fitting a model using ordinary least squares (OLS) regression: > ele.lm = lm(matings~age, data=elephants) > summary(ele.lm) 1

2 Call: lm(formula = Matings ~ Age, data = Elephants) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) ** Age e-05 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 39 degrees of freedom Multiple R-squared: 0.343, Adjusted R-squared: F-statistic: on 1 and 39 DF, p-value: 5.749e-05 > abline(ele.lm) > resplot(ele.lm) Do these plots suggest any violations with OLS regression assumptions? 2

3 One approach to attempting to correct the problem is to transform the response, using a variance stabilizing transformation. > elesq.lm = lm(sqrt(matings)~age,data=elephants) > summary(elesq.lm) Call: lm(formula = sqrt(matings) ~ Age, data = Elephants) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) Age *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 39 degrees of freedom Multiple R-squared: 0.296, Adjusted R-squared: F-statistic: 16.4 on 1 and 39 DF, p-value: > resplot(elesq.lm) While this may seem like a satisfactory model, the interpretation of the model coefficients is difficult since the response is now in the square root scale. In this case, Poisson regression may be a better option. 3

4 Introduction to Poisson Regression Recall the Poisson distribution is given by PP(YY = yy) = ee μμ μμ yy yy! yy = 0,1,2, aaaaaa μμ > 0. The response Y is a discrete random variable that represents the number of occurrences per time or space unit. In Poisson regression, we seek a model for the mean of the response (μμ) as a function of terms based upon a set of predictors xx 1, xx 2,, xx pp. For a Poisson random variable, the mean and variance are both µ, so traditional OLS regression will not be adequate because the constant error variance assumption would be violated. The logistic regression model that we have been studying is one type of a broader class of models called Generalized Linear Models. Generalized linear models are an extension of regular linear models that allow: (1) the mean of a population to depend on a linear function of terms through a nonlinear link function and (2) the response probability distribution to be any member of a special class of distributions referred to as the exponential family. The exponential family contains the normal distribution (used in OLS), the binomial distribution (used in logistic regression), and the Poisson distribution. The link function is a function that relates the mean of the response μμ ii = EE(YY ii ) linearly to a set of terms based on the explanatory variables (i.e., the predictors). OLS Regression For a normally distributed response, the link function is the identity function, gg(μμ) = μμ; thus, g(µ) = η 0 + η 1 u η k 1 u k 1. We typically write the model for the mean as follows: EE(YY XX) = η 0 + ηη 1 uu ηη kk 1 uu kk 1. Logistic Regresion For a binomial response we know that gg(μμ) = llll μμ 1 μμ = ηη 0 + ηη 1 uu ηη kk 1 uu kk 1 4

5 We expressed this as: llll θθ(xx ) 1 θθ(xx ) = ηη oo + ηη 1 uu ηη kk 1 uu kk 1. Poisson Regression For a Poisson distributed response variable, the link function is gg(μμ) = ln (μμ); so, ln(μμ) = ηη oo + ηη 1 uu ηη kk 1 uu kk 1. Thus, μμ = exp(ηη 0 + ηη 1 uu ηη kk 1 uu kk 1 ). Fitting the Poisson Regression Model in R As the number of matings per 8 year period is likely to be well-modeled using a Poisson distribution, we will now consider Poisson regression model for the Elephants data. > ele.glm = glm(matings~age,family="poisson") > summary(ele.glm) Call: glm(formula = Matings ~ Age, family = "poisson") Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) ** Age e-07 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for poisson family taken to be 1) Null deviance: on 40 degrees of freedom Residual deviance: on 39 degrees of freedom AIC: Number of Fisher Scoring iterations: 5 5

6 > par(mfrow=c(2,2)) > plot(ele.glm) > par(mfrow=c(1,1)) > plot(age,matings,xlab="age of Elephant",ylab="Num. of Matings") > lines(age,fitted(ele.glm)) > title(main="plot of Matings vs. Age of Elephant w/ Poisson Fit") 6

7 Interpretation of Coefficients in the Poisson Regression Model The coefficients in the Poisson regression model can be interpreted as follows. Assume that we change one of the explanatory terms (for example, the first one) by one unit from u to u+1 while holding all other terms fixed. The percent increase (or decrease) in the mean response can then be calculated as follows: 100 exp(ηη oo + ηη 1 (uu + 1) + + ηη kk 1 uu kk 1 ) exp (ηη oo + ηη 1 uu + + ηη kk 1 uu kk 1 ) exp (ηη oo + ηη 1 uu + + ηη kk 1 uu kk 1 ) = 100[exp(ηη 1 ) 1]%. Alternatively we can simply take the ratio exp(ηη oo + ηη 1 (uu + 1) + + ηη kk 1 uu kk 1 ) exp (ηη oo + ηη 1 uu + + ηη kk 1 uu kk 1 ) = ee ηη 1 which says the mean of the response gets a multiplicative increase of ee ηη 1 units per unit increase in the term uu 1. Interpretation of the estimated coefficient for age: The estimated coefficient for Age is ηη 1 = Thus, we have a 100[ee ] = 7.11% increase in the number of matings in the 8 year period per one year of age at the start of the study. Expressed as a multiplicative increase, this would be For a 5 year difference in initial age, we would expect a 100[ee ] = 40.99% increase in the number of matings in the following 8 year period. Expressed as a multiplicative increase, this would be

8 Wald Intervals and Tests for Parameters 95% CI for ηη ii : ηη ıı ± 1.96 SSSS(ηη ıı ) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) ** Age e-07 *** > confint.default(ele.glm) 2.5 % 97.5 % (Intercept) Age A confidence interval for the multiplicative increase in the response is then given as follows. 95% CI for ee ηη 11: exp (η i ± 1.96 SSSS(ηη ii)) Questions: 1. Find a 95% CI for the effect of a 1-year Age difference. 2. Find a 95% CI for the effect of a 5-year Age difference. Using a large sample test for significance of slope parameter ( η i ) : H H o a : ηi = 0 : η 0 i z = ˆ ηi SE( ˆ η ) i N(0,1) zz 2 ~ χχ 2 8

9 Example 2: Reproduction of Ceriodaphnia Organisms In this study, the number of Ceriodaphnia organisms are counted in a controlled environment in which reproduction occurs among the organisms. Two different strains of organisms are involved, and the environment is changed by adding varying amounts of a chemical component intended to impair reproduction. Initial population sizes are the same. The data can be found in the file Ceriodaphnia.csv. > head(ceriodaph) Cerio Conc Strain > cerio.glm = glm(cerio~conc+strain,family="poisson") > summary(cerio.glm) Call: glm(formula = Cerio ~ Conc + Strain, family = "poisson") Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) < 2e-16 *** Conc < 2e-16 *** Strain e-08 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for poisson family taken to be 1) Null deviance: on 69 degrees of freedom Residual deviance: on 67 degrees of freedom AIC: Number of Fisher Scoring iterations: 4 Interpret the coefficients: 9

Poisson Regression in JMP To fit a Poisson

Analyze > Fit Model to set up the model as

10 Poisson Regression in JMP To fit a Poisson regression model for the number of ceriodaphnia as a function of the concentration and stain we again use Analyze > Fit Model to set up the model as shown below: The results of the model fit are shown below: 10

11 Example 3: Caesarean Sections in Private vs. Public Hospitals Births by caesarean sections are said to be more frequent in private (fee paying) hospitals (coded as Type=0) as compared to non-fee paying public hospitals (coded as Type=1). Data on total annual births and the number of caesarean sections carried out were obtained from the records of 4 private hospitals and 16 public hospitals. These are tabulated in the file Caesarean_data.csv. As the number of caesareans performed at a hospital is clearly a count of the number of occurrences, a Poisson regression for these data is appropriate. Also, our focus is on what role the type of hospital plays the number of caesareans; however, the number of caesarean births is clearly going to be dependent on the number of births performed at the hospital, overall. We will therefore fit a Poisson regression model for the number of caesarean births using both the number of total births and hospital type as predictors in the model. > cb.glm = glm(caesareans~type+births,family="poisson") > summary(cb.glm) Call: glm(formula = Caesareans ~ Type + Births, family = "poisson") Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 1.351e e e-08 *** Type 1.045e e *** Births 3.261e e e-08 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for poisson family taken to be 1) Null deviance: on 19 degrees of freedom Residual deviance: on 17 degrees of freedom (50 observations deleted due to missingness) AIC: Number of Fisher Scoring iterations: 4 Questions: 1. Are type of hospital and number of births both significant predictors of the number of caesarean births? Explain. 11

12 2. Since Births is a continuous predictor, we need to pick an incremental value (c) to use when interpreting the parameter estimate. Suppose we use c = 1000 births. Use the appropriate parameter estimate to describe the predicted relationship between number of births and number of caesarean births (after adjusting for hospital type). 3. Use the appropriate parameter estimate to describe the predicted relationship between hospital type and the number of caesarean births (after adjusting for the total number of births). 4. Find and interpret confidence intervals to address the above questions, as well. 12

13 Appendix: Code for some useful R functions for OLS Regression Studresid = function (lm1, lms = summary(lm1), lmi = lm.influence(lm1)) { y <- resid(lm1) y2 <- y^2 sy2 <- sum(y2) npred <- lm1$rank l <- length(resid(lm1)) rse <- sy2/(l - npred) rses <- sqrt(rse) h <- lmi$hat (resid(lm1))/(rses * (1 - h)^0.5) } resplot = function (lm1, lms = summary(lm1)) { par(mfrow = c(2, 2), pty = "m") y <- resid(lm1) qqnorm(studresid(lm1), main = "Normal Probability Plot", ylab = "Residuals") abline(0, sqrt(var(studresid(lm1)))) plot(fitted(lm1), Studresid(lm1), xlab = "Fitted Values", ylab = "Studentized Residuals", main = "Plot of Studentized Residuals vs. Fitted", cex = 0.65) } x <- fitted(lm1) y <- Studresid(lm1) f <- 0.5 xs <- sort(x, index = T) x <- xs$x ix <- xs$ix y <- y[ix] trend <- lowess(x, y, f) e2 <- (y - trend$y)^2 scatter <- lowess(x, e2, f) uplim <- trend$y + sqrt(abs(scatter$y)) lowlim <- trend$y - sqrt(abs(scatter$y)) lines(trend$x, trend$y, col = "Blue") lines(scatter$x, uplim, col = "Red") lines(scatter$x, lowlim, col = "Red") abline(h = 0, lty = 2, col = 2) plot(fitted(lm1), sqrt(abs(studresid(lm1))), main = "Loess Fit of Residuals", ylab = "Absolute Stud. Residuals", xlab = "Fitted Values", cex = 0.7) lines(lowess(fitted(lm1), sqrt(abs(studresid(lm1)))), lty = 1, col = 3) abline(h = mean(sqrt(abs(studresid(lm1)))), col = "blue", lty = 3) par(mfrow = c(1, 2)) par(ask = T) yl <- c(min(resid(lm1), fitted(lm1) - mean(fitted(lm1))), max(resid(lm1), fitted(lm1) - mean(fitted(lm1)))) fit <- fitted(lm1) p <- sort(fit - mean(fit)) pp <- ppoints(p) res <- resid(lm1) pr <- sort(res) ppr <- ppoints(pr) plot(pp, p, pch = "o", xlim = c(0, 1), ylim = yl, xlab = "f-value", ylab = "", main = "Fitted values", cex = 0.7) plot(ppr, pr, pch = "o", xlim = c(0, 1), ylim = yl, xlab = "f-value", ylab = "", main = "Residuals", cex = 0.7) par(mfrow = c(1, 1)) par(ask = F) invisible() 13

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R Brian S. Everitt and Torsten Hothorn CHAPTER 6 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, and Colonic Polyps