Poisson Regression. Chapter Introduction. Contents. The suggested citation for this chapter of notes is:

Size: px

Start display at page:

Download "Poisson Regression. Chapter Introduction. Contents. The suggested citation for this chapter of notes is:"

Avice Perkins
5 years ago
Views:

1 Chapter 23 Poisson Regression Contents 23.1 Introduction Experimental design Data structure Single continuous X variable Single continuous X variable - dealing with overdispersion Single Continuous X variable with an OFFSET ANCOVA models Categorical X variables - a designed experiment Log-linear models for multi-dimensional contingency tables Variable selection methods Summary The suggested citation for this chapter of notes is: Schwarz, C. J. (2015). Poisson Regression. In Course Notes for Beginning and Intermediate Statistics. Available at Retrieved Introduction In past chapters, multiple-regression methods were used to predict a continuous Y variable given a set of predictors, and logistic regression methods were used to predict a dichotomous categorical variable given a set of predictors. In this chapter, we will explore the use of Poisson-regression methods that are typically used to predict counts of (rare) events given a set of predictors. Just as multiple-regression implicitly assumed that the Y variable had a normal distribution and logistic-regression assumed that the choice of categories in Y was based on binomial distribution, Poisson regression assumes that the observed counts are generated from a Poisson distribution. 1543

2 The Poisson distribution is often used to model count data when the events being counted are somewhat rare, e.g. cancer cases, the number of accidents, the number of satellite males around a female bird, etc. It is characterized by the expected number of events to occur µ with probability mass functions: P (Y = y µ) = e µ µ y where y! = y(y 1)(y 2)... 2(1), and y 0. The probability mass function is available in tabular form, or can be computed by many statistical packages. While the values of Y are restricted to being non-negative integers, it is not necessary for µ to be an integer. In the following graph, 1000 observations were each generated from a Poisson distribution with differing means. y! c 2015 Carl James Schwarz

3 c 2015 Carl James Schwarz

4 For very small values of µ, virtually all the counts are zero, with only a few counts that are positive. As µ increases, the shape of the distribution look more and more like a normal distribution indeed for large µ, a normal distribution can be used as an approximation to the distribution of Y. Sometimes µ is further parameterized by a rate parameter and a group size, i.e. µ = Nλ where λ is the rate per unit, and N is the group size. For example, the number of cancers in a group of 100,000 people could be modeled using λ as the rate per 1000 people, and the N = 100. Two important properties of the Poisson distribution are: E[Y ] = µ V [y] = µ Unlike the normal distribution which has a separate parameter for the mean and variance, the Poisson distribution variance is equal to the mean. This means that once you estimate the mean, you have also estimated the variance and so it is not necessary to have replicate counts to estimate the sample variance from data. As will be seen later, this can be quite limiting when for many population, the data are over-dispersed, i.e. the variance is greater than you would expect from a simple Poisson distribution. Another important property is that the Poisson distribution is additive. If Y 1 is a Poisson(µ 1 ), and Y 2 is a Poisson(µ 2 ), then Y = Y 1 + Y 2 is also Poisson(µ = µ 1 + µ 2 ). Lastly, the Poisson distribution is a limiting distribution of a Binomial distribution as n becomes large and p becomes very small. Poisson regression is another example of a Generalized Linear Model (GLIM) 1. As in all GLIM s, the modeling process is a three step affair: Y i is assumed P oisson(µ i ) φ i = log(µ i ) φ i = β 0 + β 1 X i1 + β 2 X i Here the link function is the natural logarithms log. In many cases, the mean changes in a multiplicative fashion. For example, if population size doubled, then the expected number of cancer cases should also double. As population age, the rate of cancer increases linearly on a log-scale. Additionally, by modeling the log(µ i ), it is impossible to get negative estimates of the mean. The linear part of the GLIM can consist of continuous X or categorical X or mixtures of both types of predictors. Categorical variables will be converted to indicator variables in exactly the same way as in multiple- and logistic-regression. Unlike multiple-regression, there are no closed form solutions to give estimates of parameters. Standard maximum likelihood estimation (MLE) methods are used. 2 MLEs are guaranteed to be the best estimators (smallest standard errors) as the sample size increases, and seem to work well even if the sample sizes are not large. Standard methods are used to estimate the standard errors of the estimates. Model comparisons are done using likelihood-ratio tests whose test statistics follow a chi-square distribution which is used to give a p-value which is interpreted in the standard fashion. Predictions are done in the usual fashion these initially appear on the log-scale and must be anti-logged to provide estimates on the ordinary scale. 1 Logistic regression is another GLIM. 2 A discussion of the theory of MLE is beyond the scope of this course, but is covered in Stat-330 and Stat-402. c 2015 Carl James Schwarz

5 23.2 Experimental design In this chapter, we will again assume that the data are collected under a completely randomized design. In some of the examples that follow, blocked designs will be analyzed, but we will not explore how to analyze split-plot or repeated measure designs or design with pseudo-replication. The analysis of such designs in a generalized linear models framework is possible please consult with a statistician if you have a complex experimental design Data structure The data structure is straightforward. Columns represent variables and rows represent observations. The response variable, Y, will be a count of the number of events and will be set to continuous scale. The predictor variables, X, can be either continuous or categorical in the later case, indicator variables will be created. As usual, the coding that a package uses for indicator variables is important if you want to interpret directly the estimates of the effect of the indicator variable. Consult the documentation for the package for details Single continuous X variable The JMP file salamanders-burn.jmp available in the Sample Program Library at sfu.ca/~cschwarz/stat-650/notes/myprograms contains data on the number of salamanders in a fixed size quadrat at various locations in a large forest. The location of quadrats were chosen to represent a range of years since a forest fire burned the understory. A simple plot of the data: c 2015 Carl James Schwarz

6 shows an increasing relationship between the number of salamanders and the time since the forest understory burned. Why can t a simple regression analysis using standard normal theory be used to fit the curve? First, the assumption of normality is suspect. The counts of the number of salamanders are discrete with most under 10. It is impossible to get a negative number of salamanders so the bottom left part of the graph would require the normal distribution to be truncated at Y = 0. Second, it appears that the variance of the counts at any particular age increases with age since burned. This violates the assumption of equal variance for all X values made for standard regression models. Third, the fitted line from ordinary regression could go negative. It is impossible to have a negative number of salamanders. It seems reasonable that a Poisson distribution could be used to model the number of salamanders. They are relatively rare and seem to forage independently of each other. This conditions are the underpinnings of a Poisson distribution. The process of fitting the model and interpreting the output is analogous to those used in logistic regression. The basic model is then: Y i P oisson(µ i ) θ i =log(µ i ) θ i =β 0 + β 1 Years i As in the logistic model, the distribution of the data about the mean (line 1) has a link function (line 2) between the mean for each Y and the linear structural part of the model (line 3). In logistic regression, c 2015 Carl James Schwarz

7 the logit link was used to ensure that all values of p were between 0 and 1. In Poisson regression, the log (natural logarithm) is traditionally used to ensure that the mean is always positive. The model must be fit using maximum likelihood methods, just like in logistic regression. This model is fit in JMP using the Analyze->Fit Model platform: Be sure to specify the proper distribution and link function. This gives the output: c 2015 Carl James Schwarz

8 Most of the output parallels that seen in logistic regression. At the top of the output is a summary of variable being analyzed, the distribution for the raw data, the link used, and the total number of observation (rows in the dataset). The Whole Model Test is analogous to that in multiple-regression - is there evidence that the set of predictors (in this case there is only one predictor) have any predictive ability over that seen by random chance. The test statistic is computed using a likelihood-ratio test comparing this model to a model with only the intercept. The p-value is very small, indicating that the model has some predictive ability. [Because there is only 1 predictor, this test is equivalent to the Effect Test discussed below.] The goodness-of-fit statistic compares the model with the intercept and the single predictor to a model where every observation is predicted individually. If the model fits well, the chi-square test statistic should be approximately equal to the degrees of freedom, and the p-value should be LARGE, i.e. much larger than There is no evidence of a problem in the fit. Later in this section, we will examine how to adjust for slight lack of fit. The Effect tests examine if each predictor (or in the case of a categorical variable, the entire set of indicator variables) makes a statistically significant marginal contribution to the fit. As in multipleregression model, this are MARGINAL contributions, i.e. assuming that all other variables remain in the model and fixed at their current value. There is only one predictor, and there is strong evidence against the hypothesis of no marginal contribution. Finally, the Parameter Estimates section reports the estimated β s. So our fitted model is: Y i P oisson(µ i ) θ i =log(µ i ) θ i = Years i Each line also tests if the corresponding population coefficient is zero. Because each of the X variables in the model are single variables (i.e. not categories) the results of the parameter estimates tests match the effect tests. 3 Remember, that in goodness-of-fit tests, you DON T want to find evidence against the null hypothesis. c 2015 Carl James Schwarz

in the table. θ 1 =0.59 +.045(12) = 1.12 µ 1 =exp(1.12) = 3.

9 We can obtain predictions by following the drop down menu: For example, consider the first row of the data. At 12 years since the last burn, we estimate the mean response by starting at the bottom of the model and working upwards: which is the predicted value in the table. θ 1 = (12) = 1.12 µ 1 =exp(1.12) = 3.08 As in ordinary normal-theory regression, confidence limits for the mean response and for individual response may be found. The above table shows the confidence interval for the mean response. Finally, a residual plot may also be constructed: c 2015 Carl James Schwarz

There is no evidence of a lack-of-fit. 23.

10 There is no evidence of a lack-of-fit Single continuous X variable - dealing with overdispersion One of the weaknesses of Poisson regression is the very restrictive assumption that the variance of a Poisson distribution is equal to its mean. In some cases, data are over-dispersed, i.e. the variance is greater than predicted by a simple Poisson distribution. In this section, we will illustrate how to detect overdispersion and how to adjust the analysis to account for overdispersion. In the section on Logistic Regression, a dataset was examined on nesting horseshoe crabs 4 that is analyzed in Agresti s book. 5 The design of the study is given in Brockmann H.J. (1996). Satellite male groups in horseshoe crabs, Limulus polyphemus. Ethology, 102, Again it is important to check that the design is a completely randomized design or a simple random sampling. As in regression models, you do have some flexibility in the choice of the X settings, but for a particular weight and color, the data must be selected at random from that relevant population. Each female horseshoe crab had a male resident in her nest. The study investigated other factors affecting whether the female had any other males, called satellites residing nearby. These other factors includes: 4 See 5 These are available from Agresti s web site at c 2015 Carl James Schwarz

11 crab color where 2=light medium, 3=medium, 4=dark medium, 5=dark. spine condition where 1=both good, 2=one worn or broken, or 3=both worn or broken. weight carapace width In the section on Logistic Regression, a derived variable on the presence or absence of satellite males was examined. In this section, we will examine the actual number of satellite males. A JMP dataset crabsatellites.jmp is available from the Sample Program Library at stat.sfu.ca/~cschwarz/stat-650/notes/myprograms. A portion of the datafile is shown below: Note that the color and spine condition variables should be declared with an ordinal scale despite having numerical codes. In this analysis we will use the actual number of satellite males. As noted on the section on Logistic Regression, a preliminary scatter plot of the variables shows some interesting features. c 2015 Carl James Schwarz

12 There is a very high positive relationship between carapace width and weight, but there are few anomalous crabs that should be investigated further as shown in this magnified plot: c 2015 Carl James Schwarz

There are three points with weights in the 1200-1300 g range whose carapace widths suggest that the weights should be in the 2200-2300 g range, i.e. a typographical error in the first digit.

13 There are three points with weights in the g range whose carapace widths suggest that the weights should be in the g range, i.e. a typographical error in the first digit. There is a single crab whose weight suggests a width of 24 cm rather than 21 cm perhaps a typo in the last digit. Finally, there is one crab which is extremely large compared to the rest of the group. In the analysis that follows, I ve excluded these five data values. To begin with, fit a model that attempts to predict the mean number of satellite crabs as a function of the weight of the female crab, i.e. Y i distributed P oisson(µ i ) λ i = log(µ i ) µ i = β 0 + β 1 W eight i The Generalized Linear Model platform of JMP is used: c 2015 Carl James Schwarz

14 This gives selected output: c 2015 Carl James Schwarz

15 There are two parts of the output which show that the fit is not very satisfactory. First while the studentized residual plot does not show any structural defects (the residuals are scattered around zero) 6, it does show substantial numbers of points outside of the ( 2, 2) range. This suggests that the data are too variable relative to the Passion assumption. Second, the Goodness-of-fit statistic has a vary small p-value indicating that the data are not well fit by the model. This is an example of overdispersion. To see this overdispersion, divide the weight classes into 6 The lines in the plot are artifacts of the discrete nature of the response. See the chapter on residual plots for more details. c 2015 Carl James Schwarz

categories, e.g. 0000 2500 g, 2500 3000 g., etc. [This has already been done in the dataset.

Poisson assumption were true, then the variance of the number of satellite males should be roughly 7 The choice of 4 weight classes is

16 categories, e.g g, g., etc. [This has already been done in the dataset.] 7 Now find the mean and variance of the number of satellite males for each weight class using the Tables- >Summary platform: If the Poisson assumption were true, then the variance of the number of satellite males should be roughly 7 The choice of 4 weight classes is somewhat arbitrary. I would usually try and subdivide the data into between 4 and 10 classes ensuring that at least observations are in each class. c 2015 Carl James Schwarz

17 equal to the mean in each class. In fact, the variance in the number of satellite males appears to be roughly 3 that of the mean. With generalized linear models, there are two ways to adjust for over-dispersion. A different distribution can be used that is more flexible in the mean-to-variance ratio. A common distribution that is used in these cases is the negative binomial distribution. In more advanced classes, you will learn that the negative binomial distribution can arise from a Poisson distribution with extra variation in the mean rates. JMP does not allow the fitting of a negative binomial distribution, but this option is available in SAS. An ad hoc method, that nevertheless has theoretical justification, is to allow some flexibility in the variance. For example, rather than restricting V [Y ] = E[Y ] = µ, perhaps, V [y] = cµ where c is called the over-dispersion factor. Note that if this formulation is used, the data are no longer distributed as a Poisson distribution; in fact, there is NO actual probability function that has this property. Nevertheless, this quasi-distribution still has nice properties and the over-dispersion factor can be estimated using quasi-likelihood methods that are analogous to regular likelihood methods. The end result is that the over-dispersion factor is used to adjust the se and the test-statistics. The adjusted se are obtained by multiplying the se from the Poisson model by ĉ. The adjusted chi-square test statistics are found by dividing the test statistic from the poisson model by ĉ, and p-value is adjusted by looking up the adjusted test-statistic in the appropriate table. How is the over-dispersion factor c estimated? There are two methods, both of which are asymptotically equivalent. These involve taking the goodness-of-fit statistic and dividing by their degrees of freedom: ĉ = goodness-of-fit-statistic df Usually, ĉ s of less than 10 (corresponding to a potential inflation in the se by a factor of about 3) are acceptable if the inflation factor is more than about 10, the lack-of-fit is so large that alternate methods should be used. In JMP, the adjustment of over-dispersion occurs in the Analyze->Fit Model dialogue box: c 2015 Carl James Schwarz

18 The revised output is now: c 2015 Carl James Schwarz

Notice that the overdispersion factor has been estimated as ĉ = chi square df = 519.7857 166 = 3.

19 Notice that the overdispersion factor has been estimated as ĉ = chi square df = = 3.13 This is very close to the guess that we made based on looking at the variance-to-mean ratio among weight classes. The estimated intercept and slope are unchanged and their interpretation is as before. For example, the estimated slope of is the estimated increase in the log number of male satellite crabs when the female crab s weight increases by 1 g. A 1000g increase in body-weight corresponds to a =.668 increase in the log(number of satellite males) which corresponds to an increase by a factor of e.668 = 1.95, i.e. the mean number of male satellite crabs almost doubles. The estimated se has been inflated by ĉ = 3.13 = The confidence intervals for the slope and intercept are now wider. The chi-square test statistics have been deflated by ĉ and the p-values have been adjusted accordingly. c 2015 Carl James Schwarz

Predictions of the mean response at levels of X are obtained in the usual fashion: giving (partial output): The se of the predicted mean will also have been be adjusted for overdispersion as will

20 Finally, the residual plot has been rescaled by the factor of ĉ and now most residuals lie between 2 and 2. Note that the pattern of the residual plot doesn t change; all that the over-dispersion adjustment does is to change the residual variance so that the standardization brings them closer to 0. Predictions of the mean response at levels of X are obtained in the usual fashion: giving (partial output): The se of the predicted mean will also have been be adjusted for overdispersion as will have the confidence intervals for the mean number of male satellite crabs at each weight value. However, notice that the menu item for a prediction interval for the INDIVIDUAL response is grayed out and it is now impossible to obtain prediction intervals for the ACTUAl number of events. By using the overdispersion factor, you are no longer assuming that the counts are distributed as a Poisson distribution in fact, there is NO REAL DISTRIBUTION that has the mean to variance ratio that implicitly assumed using the overdispersion factor. Without an actual distribution, it is impossible to make predictions for individual events. We save the predicted values to the dataset and do a plot of the final results on both the ordinary c 2015 Carl James Schwarz

21 scale: c 2015 Carl James Schwarz

22 and on the log-scale (the scale where the model is linear ): c 2015 Carl James Schwarz

23 c 2015 Carl James Schwarz

23.6 Single Continuous X variable with an OFFSET In the previous examples, the sampling unit (where the counts were obtained) were all the same size (e.g. the number of satellite males around a single female).

24 23.6 Single Continuous X variable with an OFFSET In the previous examples, the sampling unit (where the counts were obtained) were all the same size (e.g. the number of satellite males around a single female). In some cases, the sampling unit are of different sizes. For example, if the number of weeds are counted in a quadrat plot, then hopefully the size of the plot is constant. However, it is conceivable that the size of the plot varies because different people collected different parts of the data. Of if the number of events are counted in a time interval (e.g. the number of fish captured in a fishing trip), the time intervals could be of different size. Often these type of data are pre-standardized, i.e. converted to a per m 2 or per hour basis and then an analysis is attempted on this standardized variable. However, standardization destroys the poisson shape of the data and turns out to be unnecessary if the size of the sampling unit is also collected. The incidence of non melanoma skin cancer among women in the early 1970 s in Minneapolis-St Paul, Minnesota, and Dallas-Fort Worth, Texas is summarized below: c 2015 Carl James Schwarz

City Age Class Age Mid Count Pop Size msp 15-24 20 1 172,675 msp 25-34 30 16 123,065 msp 35-44 40 30 96,216 msp 45-54 50 71 92,051 msp 55-64 60 102 72,159 msp 65-74 70 130 54,722 msp 75-84 80 133

25 City Age Class Age Mid Count Pop Size msp ,675 msp ,065 msp ,216 msp ,051 msp ,159 msp ,722 msp ,185 msp ,328 dfw ,343 dfw ,207 dfw ,374 dfw ,353 dfw ,004 dfw ,932 dfw ,007 dfw ,538 We will first examine the relationship of cancer incidence to age by using the age midpoint as our continuous X variable and only using the Minneapolis data (for now). The data set is available in the JMP data file skincancer.jmp from the Sample Program Library at Is there a relationship between the age of a cohort and the cancer incidence rate? Notice that a comparison of the raw counts is not very sensible because of the different size of the age cohorts. Most people would first STANDARDIZE the incidence rate, e.g. find the incidence per person by dividing the number of cancers by the number of people in each cohort: A plot of the standardized incidence rate by the mid-age of each cohort: c 2015 Carl James Schwarz

shows a curved relationship between the incidence rate and the mid-point of the age-cohort. This suggests a theoretical model of the form: Incidence = Ce age i.e. an exponential increase in the cancer rates with age.

26 shows a curved relationship between the incidence rate and the mid-point of the age-cohort. This suggests a theoretical model of the form: Incidence = Ce age i.e. an exponential increase in the cancer rates with age. This suggests that a log-transform is applied to BOTH sides, but a plot: of the logarithm of the incidence rate against log(age midpoint): c 2015 Carl James Schwarz

27 is still not linear with a dip for the youhgest cohorts. There appears to be a strong relationship between the log(cancer rate) and log(age) that may not be linear, but a quadratic looks as if it could fit quite nicely, i.e. a model of the form. log(incidence) = β 0 + β 1 log(age) + β 2 log(age) 2 + residual Is it possible to include the population size direct? Expand the above model: log(incidence) = β 0 + β 1 log(age) + β 2 log(age) 2 + residual log( count pop size ) = β 0 + β 1 log(age) + β 2 log(age) 2 + residual log(count) log(pop size) = β 0 + β 1 log(age) + β 2 log(age) 2 + residual log(count) = log(pop size) + β 0 + β 1 log(age) + β 2 log(age) 2 + residual Notice that the log(pop size) has a known coefficient of 1 associated with it, i.e. there is NO β coefficient associated with log(pop size). Also notice that log(p OP SiZE) is known in advance and is NOT a parameter to be estimated. Variables such as population size are often called offset variables and notice that most packages expect to see the offset variable pre-transformed depending upon the link function used. In this case, the log link was used, so the offset is log(p OP SIZE age ) as you will see in a minute. c 2015 Carl James Schwarz

Our GLIM model will then be: Y age distributed P oisson(µ age ) φ age = log(µ age ) = log(p OP SIZE age ) + log(λ age ) φ age = β 0 + β 1 log(age) + β 2 log(age) 2 This can be rewritten slightly as:

28 Our GLIM model will then be: Y age distributed P oisson(µ age ) φ age = log(µ age ) = log(p OP SIZE age ) + log(λ age ) φ age = β 0 + β 1 log(age) + β 2 log(age) 2 This can be rewritten slightly as: Y age distributed P oisson(µ age ) φ age = log(µ age ) = log(p OP SIZE age ) + log(λ age ) φ age = log(p OP SIZE age ) + log(λ age ) = β 0 + β 1 log(age) + β 2 log(age) 2 or log(λ age ) = β 0 + β 1 log(age) + β 2 log(age) 2 log(p OP SIZE age ) So the modeling can be done in terms of estimating the effect of log(age) upon the incidence rate, rather than the raw counts, as long as the offset variable (log(p OP SIZE age )) is known. To perform a Poisson regression, first create the offset variable (log(p OP SIZE age )) using the formula editor of JMP. The Analyze->Fit Model platform launches the analysis: c 2015 Carl James Schwarz

29 Note that the raw count is the Y variable, and that the offset variable is specified separately from the X variables. The output is: The goodness-of-fit statistic indictes no evidence of lack-of-fit, i.e. no need to adjust for over-dispersion. Based on the results of the Effect Test for the quadratic term, it appears that a linear fit may actually be sufficient as the p-value for the quadratic term is almost 10%.. The reason for this apparent non-need for the quadratic term is that the smaller age-cohorts have very few counts and so the actual incidence rate is very imprecisely estimated. Finally, the Parameter Estimates section reports the estimated β s (remember these are on the logscale). Each line also tests if the corresponding population coefficient is zero. Because each of the X variables in the model are single variables (i.e. not categories) the results of the parameter estimates tests match the effect tests. Based on the output so far, it appears that we can drop the quadratic term. This term was dropped, and the model refit: c 2015 Carl James Schwarz

The final model is The predicted log(λ) for age 40 is found as: log(λ) age = 21.32 + 3.60(log(age)) log(λ) 40 = 21.32 + 3.60(log(40)) = 8.

30 The final model is The predicted log(λ) for age 40 is found as: log(λ) age = (log(age)) log(λ) 40 = (log(40)) = 8.04 This incidence rate is on the log-scale, so the predicted incidence rate is found by taking the anti-logs, or e 8.04 = or.322/thousand people or 322/million people. In order to make predictions about the expected number of cancers in each age cohort, that would be seen under this model, you would need to add back the log(p OP SIZE) for the appropriate age class: log(µ 40 ) = log(λ) 40 + log(p OP SIZE 40 ) = = 3.42 Finally, the predicted number of cases is simply the anti-log of this value: Ŷ 40 = e logµ 40 = e 3.42 = Of course, this can be done automatically the the platform by requesting: c 2015 Carl James Schwarz

This also allows you save the confidence limits for the average number (the mean confidence bounds) of skin cancers expected for this age class (assuming the same population size) and confidence

69 with a 95% confidence interval for the mean number of cases ranging from (26.0 36.8).

31 This also allows you save the confidence limits for the average number (the mean confidence bounds) of skin cancers expected for this age class (assuming the same population size) and confidence limits (the individual confidence bounds) In this case, the expected number of skin cancer cases for the age group is with a 95% confidence interval for the mean number of cases ranging from ( ). The confidence bound for the actual number of cases (assuming the model is correct) is somewhere between 19 and 43 cases. By adding new data lines to the data table (before the model fit) with the Y variable missing, but the age and offset variable present, you can make forecasts for any set of new X values. The residual plot: c 2015 Carl James Schwarz

where the quadratic curve may provide a better fit. A plot of actual vs.

32 isn t too bad the large negative residual for the first age class (near when 0 skin cancers are predicted) is a bit worrisome, I suspect this is where the quadratic curve may provide a better fit. A plot of actual vs. predicted values can be obtained directly: c 2015 Carl James Schwarz

33 or by saving the predicted value to the data sheet, and using the Analyze->Fit Y-by-X platform with Fit Special to add the reference line: c 2015 Carl James Schwarz

34 These plot show excellent agreement with data. Finally, it is nice to construct an overlay plot the empirical log(rates) (the first plot constructed) with the estimated log(rate) and confidence bounds as a function of log(age). Create the predicted log(rate) using the formula editor and the predicted skin cancer numbers by subtracting the log(p OP SIZE) (why?): Repeat the same formula for the lower and upper bounds of the 95% confidence interval for the mean number of cases: c 2015 Carl James Schwarz

Finally, use the Graph OverlayPlot to plot the empirical estimates, the predicted values of λ and the 95% confidence interval for λ on the same plot: and fiddle 8 with the plot to join up predictions

35 Finally, use the Graph OverlayPlot to plot the empirical estimates, the predicted values of λ and the 95% confidence interval for λ on the same plot: and fiddle 8 with the plot to join up predictions and confidence bounds but leave the actual empirical points as is to give the final plot: 8 I had to use the turn on the connect through missing option the red-triangle. c 2015 Carl James Schwarz

36 Remember that the point with the smallest log(rate) is based on a single skin cancer case and not very reliable. That is why the quadratic fit was likely not selected ANCOVA models Just like in regular multiple-regression, it is possible to mix continuous and categorical variables and test for parallelism of the effects. Of course this parallelism is assessed on the link scale (in most cases for Poisson data, on the log scale). There is nothing new compared to what was seen with ordinary regression and logistic regression. The three appropriate models are: log(λ) = X log(λ) = X Cat log(λ) = X Cat X Cat where X is the continuous predictors, and Cat is the categorical predictor. The first model assumes a common line for all categories of the Cat variable. The second model assumes parallel slopes, but differing intercepts. The third model assumes separate lines for each category. Fitting would start with the most complex model (the third model) and test if there is evidence of non-parallelism. If none were found, the second model would be examined, and a test would be made for common intercepts. Finally, the simplest model may be an adequate fit. Let us return to the skin cancer data examined earlier in this chapter. It is of interest to see if there is a consistent difference in skin cancer rates between the two cities. Presumably, Dallas, which receives more intense sun, would have a higher skin cancer rate. The data is available in the skincancer.jmp data set in the Sample Program Library at stat.sfu.ca/~cschwarz/stat-650/notes/myprograms. Use all of the data. As before, the log(population size) will be the offset variable. A preliminary data plot of the empirical cancer rate for the two cities: c 2015 Carl James Schwarz

shows roughly parallel responses, but now the curvature is much more pronounced in Dallas. Perhaps a quadratic model should be first fit, with separate response curve for both cities.

37 shows roughly parallel responses, but now the curvature is much more pronounced in Dallas. Perhaps a quadratic model should be first fit, with separate response curve for both cities. In short hand model notation, this is: log(lambda) = City log(age) log(age) 2 City log(age) City log(age) 2 where City is the effect of the two cities, log(age) is the continuous X variable, and the interaction terms represent the non-parallelism of the responses. This is specified as: c 2015 Carl James Schwarz

38 As before, use the Generalized Linear Model option of the Analyze->Fit Model platform and don t forget to specify the log(popsize) as the offset variable. This gives the output: c 2015 Carl James Schwarz

The Whole Model Test shows evidence that the model has predictive ability. The Goodness-of-fit Test shows that this model is a reasonable fit (p-values around.30).

39 The Whole Model Test shows evidence that the model has predictive ability. The Goodness-of-fit Test shows that this model is a reasonable fit (p-values around.30). The Effect Test shows that perhaps both of the interaction terms can be dropped, but some care must be taken as these are marginal tests and cannot be simply combined. A Chunk Test similar to that seen in logistic regression can be done to see if both interaction terms can be dropped simultaneously: c 2015 Carl James Schwarz

40 c 2015 Carl James Schwarz

41 The p-value is just above α =.05 so I would be a little hesitant to drop both interaction terms. On the other hand, some of the larger age classes have such large sample sizes and large count values that very minor differences in fit can likely be detected. The simpler model with two parallel quadratic curves was then fit: c 2015 Carl James Schwarz

The parameter estimates must be interpreted carefully for categorical data.

42 This simpler model also has no strong evidence of lack-of-fit. Now, however, the quadratic term cannot be dropped. The parameter estimates must be interpreted carefully for categorical data. Every package codes indicator variables in different ways, and so the interpretation of the estimates associated with the indicator c 2015 Carl James Schwarz

variables differs among packages. JMP codes indicator variables so that estimates are the difference in response between that specified level and the AVERAGE of all other levels.

43 variables differs among packages. JMP codes indicator variables so that estimates are the difference in response between that specified level and the AVERAGE of all other levels. So in this case, the estimate associated with City[dfw] =.401 represent 1/2 the distance between the two parallel curves. Consequently, the difference in logλ between Minneapolis and Dallas is =.801 (SE =.05). This is a consistent difference for all age groups. This can also be estimated without having to worry too much about the coding details by doing a contrast between the estimates for the city effects:: c 2015 Carl James Schwarz

802 = 2.22 TIMES the skin cancer rate of Minneapolis. This is consistent with what is seen in the raw data.

44 This gives the same results as above. This is a difference on the log-scale. As seen in earlier chapter, this can be converted to an estimate of the ratio of incidence by taking anti-logs. In this case, Dallas is estimated to have e.802 = 2.22 TIMES the skin cancer rate of Minneapolis. This is consistent with what is seen in the raw data. The SE of this ratio is found using an application of the Delta method 9 The delta-method indicates that the SE of an exponentiated estimate is found as SE(e θ) = SE( θ)e θ 9 A form of a Taylor Series Expansion. Consult many books on statistics for details. c 2015 Carl James Schwarz

45 In this case SE(ratio) = =.11 Confidence bounds are found by finding the usual confidence bounds on the log-scale and then taking anti-logs of the end points. In this case, the 95% confidence interval for the difference in log(λ) is (.802 2(.052) (.052)) or ( ). Taking antilogs, gives a 95% confidence interval for the ratio of skin cancer rates as ( ). The residual plot (not shown) look reasonable Categorical X variables - a designed experiment Just like ANOVA is used to analyze data from designed experiments, Generalized linear models can also be used to analyze count data from designed experiments. However, JMP is limited to designs without random effects, e.g. no GLIMs that involve split-plot designs. Consider an experiment to investigate 10 treatments (a control vs. a 3x3 factorial structure for two factors A and B) on controlling insect numbers. The experiment was run in a randomized block design (see earlier chapters). In each block, the 10 treatments were randomized to 10 different trees. On each tree, a trap was mounted, and the number of insects caught in each trap was recorded. Here is the raw data This is example from SAS for Linear Models, 4th Edition. Data extracted from samples/a56655 on c 2015 Carl James Schwarz

46 Block Treatment A B Count c 2015 Carl James Schwarz

The data are available in JMP data file insectcount.jmp in the Sample Program Library at http: //www.stat.sfu.ca/~cschwarz/stat-650/notes/myprograms.

47 The data are available in JMP data file insectcount.jmp in the Sample Program Library at http: // The RCB model was fit using a generalized linear model with a log link: Count i distributed P oisson(µ i ) φ i = log(µ i ) φ i = Block T reatment where the simplified syntax Block and Treatment refer to block and treatment effects. Both Blocks and Treatment are categorical, and will be translated to sets of indicator variables in the usual way. This model is fit in JMP using the Analyze->Fit Model platform: Note that the block and treatment variables must be nominally scaled. There is NO offset variable as the insect cages were all equal size. This produces the output: c 2015 Carl James Schwarz

48 The Goodness-of-fit test shows strong evidence that the model doesn t fit as the p-values are very small. Lack-of-fit can be caused by inadequacies of the actual model (perhaps a more complex model with block and treatment interactions is needed?), failure of the Poisson assumption, or using the wrong link-function, The residual plot: c 2015 Carl James Schwarz

49 shows that the data is more variable than expected by a Poisson distribution (about 95% of the residual should be within ± 2). The base model and link function seem reasonable as there is no pattern to the residuals, merely an over-dispersion relative to a Poisson distribution. The adjustment of over-dispersion is made as seen earlier in the Analyze->Fit Model dialogue box: c 2015 Carl James Schwarz

50 which gives the revised output: Note that the over-dispersion factor ĉ = 3.5. The test-statistic for the Effect Test are adjusted by this factor (compare the chi-square of for the treatment effects in the absence of adjusting for overc 2015 Carl James Schwarz

dispersion with the chi-square of 21.79 after adjusting for over-dispersion), and the p-values have been adjusted as well.

51 dispersion with the chi-square of after adjusting for over-dispersion), and the p-values have been adjusted as well. The residuals have been adjusted by ĉ and now look more acceptable: Note that the pattern of the residual plot doesn t change; all that the over-dispersion adjustment does is to change the residual variance so that the standardization brings them closer to 0. If you compare the parameter estimates between the two models, you will find that the estimates are unchanged, but the reported se are increased by ĉ to account for over-dispersion. As the case with all categorical X variables, the interpretation of the estimates for the indicator variables depends upon the coding used by the package. JMP uses a coding where each indicator variable is compared to the mean response over all indicator variables. Predictions of the mean response at levels of X are obtained in the usual fashion. The se will also be adjusted for overdisperion. However, it is now impossible to obtain prediction intervals for the ACTUAl number of events. By using the overdispersion factor, you are no longer assuming that the counts are distributed as a Poisson distribution in fact, there is NO REAL DISTRIBUTION that has the mean to variance ratio that implicitly assumed using the overdispersion factor. Without an actual distribution, it is impossible to make predictions for individual events. If comparisons are of interest among the treatment levels, it is better to use the built-in Contrast facilities of the package to compute the estimates and standard errors rather than trying to do this by hand. For example, suppose we are interested in comparing treatment 0 (the control), to the treatment with factor A at level 1 and factor B at level 1 (corresponding to treatment 1). The contrast is estimated as: c 2015 Carl James Schwarz

52 c 2015 Carl James Schwarz

The estimated difference in the log(mean) is -.34 (se.39) which corresponds to a ratio of e.34 =.71 of treatment 1 to control, i.e. on, average, the number of insects in the treatment 1 traps are 71% of the number of insects in the control trap.

53 The estimated difference in the log(mean) is -.34 (se.39) which corresponds to a ratio of e.34 =.71 of treatment 1 to control, i.e. on, average, the number of insects in the treatment 1 traps are 71% of the number of insects in the control trap. An application of the delta-method shows that the se of the ratio is computed as se(e θ) = se( θ)e θ =.39(.71) =.28. However, there was no evidence of a difference in trap counts as the standard error was sufficiently large. A 95% confidence interval for the difference in log(mean) is found as.34 ± 2(.39) which gives ( ). Because the p-value was larger than α =.05, this confidence interval includes zero. When this interval is anti-logged, the 95% confidence interval for the ratio of mean counts is ( ), i.e. the true ratio of treatment counts to control counts is between.32 and Because the p-value was greater than α =.05, this interval contains the value of 1 (indicating that the ratio of counts was 1:1). It is also correct to compute the 95% confidence interval for the ratio using the estimated ratio ± its se@. This gives (.71 ± 2(.28)) or ( ). In large samples, these confidence intervals are equivalent. In smaller samples, there is no real objective way to choose between them Log-linear models for multi-dimensional contingency tables In the chapter on logistic regression, k 2 contingency tables were analyzed to see if the proportion of responses in the population that fell in the two categories (e.g. survived or died) were the same across the k levels of the factor (e.g. sex, or passenger class, or dose of a drug). The use of logisitic regression is a special case of the general r c contingency table where observations are classified by r levels of a factor and c levels of a response. In a separate chapter, the use of χ 2 tests to test the hypothesis of equal population proportions in the c levels of the response across all levels of the the factor. This is also known as the test of independence of the response to levels of the factor. This can be generalized to the analysis of multi-dimensional tables using Poisson-regression. In more advanced courses, you can learn how the two previous cases are simple cases of this more general modelling approach. Consult Agresti s book for a fuller account of this topic Variable selection methods To be added later c 2015 Carl James Schwarz

54 23.11 Summary Poisson-regression is the standard tool for the analysis of smallish count data. If the counts are large (say in the orders of hundreds), you could likely use ordinary or weighted regression methods without difficulty. This chapter only concerns itself with data collected under a simple random sample or a completely randomized design. If the data are collected under other designs, please consult with a statistician for the proper analysis. A common problem that have encountered are data that have been prestandardized. For example, data may recorded on the number of tree stems in a 100 m 2 test plots. This data could likely be modeled using poisson regression. But, then the data are standardized to a per hectare basis. These standardized data are NO LONGER distributed as a Poisson distribution. It would be preferable to analyze the data using the sampling units that were used to collect the data with an offset variable being used to adjust for differing sizes of survey units. A common cause of overdispersion is non-independence in the data. For example, data may be collected using a cluster design rather than by a simple random sample. Overdispersion can be accounted for using quasi-likelihood methods. As a rule of thumb, overdispersion factor ĉ of 10 or less are acceptable. Very larger overdispersion factors indicate other serious problems in the model. Alternatives to the use of the correction factor are using a different distribution such as the negative binomial distribution. Related models for this chapter are the Zero-inflated Poisson (ZIP) models. In these models there are an excess number of zeroes relative to what would be expected under a Poisson model. The ZIP model has two parts the probability that an observation will be zero, and then the distribution of the non-zero counts. There is a substantial base in the literature on this model. This is the end of the chapter c 2015 Carl James Schwarz

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

ST3241 Categorical Data Analysis I Generalized Linear Models Introduction and Some Examples 1 Introduction We have discussed methods for analyzing associations in two-way and three-way tables. Now we will