Poisson Regression. Chapter Introduction. Contents. The suggested citation for this chapter of notes is:

Size: px
Start display at page:

Download "Poisson Regression. Chapter Introduction. Contents. The suggested citation for this chapter of notes is:"

Transcription

1 Chapter 23 Poisson Regression Contents 23.1 Introduction Experimental design Data structure Single continuous X variable Single continuous X variable - dealing with overdispersion Single Continuous X variable with an OFFSET ANCOVA models Categorical X variables - a designed experiment Log-linear models for multi-dimensional contingency tables Variable selection methods Summary The suggested citation for this chapter of notes is: Schwarz, C. J. (2015). Poisson Regression. In Course Notes for Beginning and Intermediate Statistics. Available at Retrieved Introduction In past chapters, multiple-regression methods were used to predict a continuous Y variable given a set of predictors, and logistic regression methods were used to predict a dichotomous categorical variable given a set of predictors. In this chapter, we will explore the use of Poisson-regression methods that are typically used to predict counts of (rare) events given a set of predictors. Just as multiple-regression implicitly assumed that the Y variable had a normal distribution and logistic-regression assumed that the choice of categories in Y was based on binomial distribution, Poisson regression assumes that the observed counts are generated from a Poisson distribution. 1543

2 The Poisson distribution is often used to model count data when the events being counted are somewhat rare, e.g. cancer cases, the number of accidents, the number of satellite males around a female bird, etc. It is characterized by the expected number of events to occur µ with probability mass functions: P (Y = y µ) = e µ µ y where y! = y(y 1)(y 2)... 2(1), and y 0. The probability mass function is available in tabular form, or can be computed by many statistical packages. While the values of Y are restricted to being non-negative integers, it is not necessary for µ to be an integer. In the following graph, 1000 observations were each generated from a Poisson distribution with differing means. y! c 2015 Carl James Schwarz

3 c 2015 Carl James Schwarz

4 For very small values of µ, virtually all the counts are zero, with only a few counts that are positive. As µ increases, the shape of the distribution look more and more like a normal distribution indeed for large µ, a normal distribution can be used as an approximation to the distribution of Y. Sometimes µ is further parameterized by a rate parameter and a group size, i.e. µ = Nλ where λ is the rate per unit, and N is the group size. For example, the number of cancers in a group of 100,000 people could be modeled using λ as the rate per 1000 people, and the N = 100. Two important properties of the Poisson distribution are: E[Y ] = µ V [y] = µ Unlike the normal distribution which has a separate parameter for the mean and variance, the Poisson distribution variance is equal to the mean. This means that once you estimate the mean, you have also estimated the variance and so it is not necessary to have replicate counts to estimate the sample variance from data. As will be seen later, this can be quite limiting when for many population, the data are over-dispersed, i.e. the variance is greater than you would expect from a simple Poisson distribution. Another important property is that the Poisson distribution is additive. If Y 1 is a Poisson(µ 1 ), and Y 2 is a Poisson(µ 2 ), then Y = Y 1 + Y 2 is also Poisson(µ = µ 1 + µ 2 ). Lastly, the Poisson distribution is a limiting distribution of a Binomial distribution as n becomes large and p becomes very small. Poisson regression is another example of a Generalized Linear Model (GLIM) 1. As in all GLIM s, the modeling process is a three step affair: Y i is assumed P oisson(µ i ) φ i = log(µ i ) φ i = β 0 + β 1 X i1 + β 2 X i Here the link function is the natural logarithms log. In many cases, the mean changes in a multiplicative fashion. For example, if population size doubled, then the expected number of cancer cases should also double. As population age, the rate of cancer increases linearly on a log-scale. Additionally, by modeling the log(µ i ), it is impossible to get negative estimates of the mean. The linear part of the GLIM can consist of continuous X or categorical X or mixtures of both types of predictors. Categorical variables will be converted to indicator variables in exactly the same way as in multiple- and logistic-regression. Unlike multiple-regression, there are no closed form solutions to give estimates of parameters. Standard maximum likelihood estimation (MLE) methods are used. 2 MLEs are guaranteed to be the best estimators (smallest standard errors) as the sample size increases, and seem to work well even if the sample sizes are not large. Standard methods are used to estimate the standard errors of the estimates. Model comparisons are done using likelihood-ratio tests whose test statistics follow a chi-square distribution which is used to give a p-value which is interpreted in the standard fashion. Predictions are done in the usual fashion these initially appear on the log-scale and must be anti-logged to provide estimates on the ordinary scale. 1 Logistic regression is another GLIM. 2 A discussion of the theory of MLE is beyond the scope of this course, but is covered in Stat-330 and Stat-402. c 2015 Carl James Schwarz

5 23.2 Experimental design In this chapter, we will again assume that the data are collected under a completely randomized design. In some of the examples that follow, blocked designs will be analyzed, but we will not explore how to analyze split-plot or repeated measure designs or design with pseudo-replication. The analysis of such designs in a generalized linear models framework is possible please consult with a statistician if you have a complex experimental design Data structure The data structure is straightforward. Columns represent variables and rows represent observations. The response variable, Y, will be a count of the number of events and will be set to continuous scale. The predictor variables, X, can be either continuous or categorical in the later case, indicator variables will be created. As usual, the coding that a package uses for indicator variables is important if you want to interpret directly the estimates of the effect of the indicator variable. Consult the documentation for the package for details Single continuous X variable The JMP file salamanders-burn.jmp available in the Sample Program Library at sfu.ca/~cschwarz/stat-650/notes/myprograms contains data on the number of salamanders in a fixed size quadrat at various locations in a large forest. The location of quadrats were chosen to represent a range of years since a forest fire burned the understory. A simple plot of the data: c 2015 Carl James Schwarz

6 shows an increasing relationship between the number of salamanders and the time since the forest understory burned. Why can t a simple regression analysis using standard normal theory be used to fit the curve? First, the assumption of normality is suspect. The counts of the number of salamanders are discrete with most under 10. It is impossible to get a negative number of salamanders so the bottom left part of the graph would require the normal distribution to be truncated at Y = 0. Second, it appears that the variance of the counts at any particular age increases with age since burned. This violates the assumption of equal variance for all X values made for standard regression models. Third, the fitted line from ordinary regression could go negative. It is impossible to have a negative number of salamanders. It seems reasonable that a Poisson distribution could be used to model the number of salamanders. They are relatively rare and seem to forage independently of each other. This conditions are the underpinnings of a Poisson distribution. The process of fitting the model and interpreting the output is analogous to those used in logistic regression. The basic model is then: Y i P oisson(µ i ) θ i =log(µ i ) θ i =β 0 + β 1 Years i As in the logistic model, the distribution of the data about the mean (line 1) has a link function (line 2) between the mean for each Y and the linear structural part of the model (line 3). In logistic regression, c 2015 Carl James Schwarz

7 the logit link was used to ensure that all values of p were between 0 and 1. In Poisson regression, the log (natural logarithm) is traditionally used to ensure that the mean is always positive. The model must be fit using maximum likelihood methods, just like in logistic regression. This model is fit in JMP using the Analyze->Fit Model platform: Be sure to specify the proper distribution and link function. This gives the output: c 2015 Carl James Schwarz

8 Most of the output parallels that seen in logistic regression. At the top of the output is a summary of variable being analyzed, the distribution for the raw data, the link used, and the total number of observation (rows in the dataset). The Whole Model Test is analogous to that in multiple-regression - is there evidence that the set of predictors (in this case there is only one predictor) have any predictive ability over that seen by random chance. The test statistic is computed using a likelihood-ratio test comparing this model to a model with only the intercept. The p-value is very small, indicating that the model has some predictive ability. [Because there is only 1 predictor, this test is equivalent to the Effect Test discussed below.] The goodness-of-fit statistic compares the model with the intercept and the single predictor to a model where every observation is predicted individually. If the model fits well, the chi-square test statistic should be approximately equal to the degrees of freedom, and the p-value should be LARGE, i.e. much larger than There is no evidence of a problem in the fit. Later in this section, we will examine how to adjust for slight lack of fit. The Effect tests examine if each predictor (or in the case of a categorical variable, the entire set of indicator variables) makes a statistically significant marginal contribution to the fit. As in multipleregression model, this are MARGINAL contributions, i.e. assuming that all other variables remain in the model and fixed at their current value. There is only one predictor, and there is strong evidence against the hypothesis of no marginal contribution. Finally, the Parameter Estimates section reports the estimated β s. So our fitted model is: Y i P oisson(µ i ) θ i =log(µ i ) θ i = Years i Each line also tests if the corresponding population coefficient is zero. Because each of the X variables in the model are single variables (i.e. not categories) the results of the parameter estimates tests match the effect tests. 3 Remember, that in goodness-of-fit tests, you DON T want to find evidence against the null hypothesis. c 2015 Carl James Schwarz

9 We can obtain predictions by following the drop down menu: For example, consider the first row of the data. At 12 years since the last burn, we estimate the mean response by starting at the bottom of the model and working upwards: which is the predicted value in the table. θ 1 = (12) = 1.12 µ 1 =exp(1.12) = 3.08 As in ordinary normal-theory regression, confidence limits for the mean response and for individual response may be found. The above table shows the confidence interval for the mean response. Finally, a residual plot may also be constructed: c 2015 Carl James Schwarz

10 There is no evidence of a lack-of-fit Single continuous X variable - dealing with overdispersion One of the weaknesses of Poisson regression is the very restrictive assumption that the variance of a Poisson distribution is equal to its mean. In some cases, data are over-dispersed, i.e. the variance is greater than predicted by a simple Poisson distribution. In this section, we will illustrate how to detect overdispersion and how to adjust the analysis to account for overdispersion. In the section on Logistic Regression, a dataset was examined on nesting horseshoe crabs 4 that is analyzed in Agresti s book. 5 The design of the study is given in Brockmann H.J. (1996). Satellite male groups in horseshoe crabs, Limulus polyphemus. Ethology, 102, Again it is important to check that the design is a completely randomized design or a simple random sampling. As in regression models, you do have some flexibility in the choice of the X settings, but for a particular weight and color, the data must be selected at random from that relevant population. Each female horseshoe crab had a male resident in her nest. The study investigated other factors affecting whether the female had any other males, called satellites residing nearby. These other factors includes: 4 See 5 These are available from Agresti s web site at c 2015 Carl James Schwarz

11 crab color where 2=light medium, 3=medium, 4=dark medium, 5=dark. spine condition where 1=both good, 2=one worn or broken, or 3=both worn or broken. weight carapace width In the section on Logistic Regression, a derived variable on the presence or absence of satellite males was examined. In this section, we will examine the actual number of satellite males. A JMP dataset crabsatellites.jmp is available from the Sample Program Library at stat.sfu.ca/~cschwarz/stat-650/notes/myprograms. A portion of the datafile is shown below: Note that the color and spine condition variables should be declared with an ordinal scale despite having numerical codes. In this analysis we will use the actual number of satellite males. As noted on the section on Logistic Regression, a preliminary scatter plot of the variables shows some interesting features. c 2015 Carl James Schwarz

12 There is a very high positive relationship between carapace width and weight, but there are few anomalous crabs that should be investigated further as shown in this magnified plot: c 2015 Carl James Schwarz

13 There are three points with weights in the g range whose carapace widths suggest that the weights should be in the g range, i.e. a typographical error in the first digit. There is a single crab whose weight suggests a width of 24 cm rather than 21 cm perhaps a typo in the last digit. Finally, there is one crab which is extremely large compared to the rest of the group. In the analysis that follows, I ve excluded these five data values. To begin with, fit a model that attempts to predict the mean number of satellite crabs as a function of the weight of the female crab, i.e. Y i distributed P oisson(µ i ) λ i = log(µ i ) µ i = β 0 + β 1 W eight i The Generalized Linear Model platform of JMP is used: c 2015 Carl James Schwarz

14 This gives selected output: c 2015 Carl James Schwarz

15 There are two parts of the output which show that the fit is not very satisfactory. First while the studentized residual plot does not show any structural defects (the residuals are scattered around zero) 6, it does show substantial numbers of points outside of the ( 2, 2) range. This suggests that the data are too variable relative to the Passion assumption. Second, the Goodness-of-fit statistic has a vary small p-value indicating that the data are not well fit by the model. This is an example of overdispersion. To see this overdispersion, divide the weight classes into 6 The lines in the plot are artifacts of the discrete nature of the response. See the chapter on residual plots for more details. c 2015 Carl James Schwarz

16 categories, e.g g, g., etc. [This has already been done in the dataset.] 7 Now find the mean and variance of the number of satellite males for each weight class using the Tables- >Summary platform: If the Poisson assumption were true, then the variance of the number of satellite males should be roughly 7 The choice of 4 weight classes is somewhat arbitrary. I would usually try and subdivide the data into between 4 and 10 classes ensuring that at least observations are in each class. c 2015 Carl James Schwarz

17 equal to the mean in each class. In fact, the variance in the number of satellite males appears to be roughly 3 that of the mean. With generalized linear models, there are two ways to adjust for over-dispersion. A different distribution can be used that is more flexible in the mean-to-variance ratio. A common distribution that is used in these cases is the negative binomial distribution. In more advanced classes, you will learn that the negative binomial distribution can arise from a Poisson distribution with extra variation in the mean rates. JMP does not allow the fitting of a negative binomial distribution, but this option is available in SAS. An ad hoc method, that nevertheless has theoretical justification, is to allow some flexibility in the variance. For example, rather than restricting V [Y ] = E[Y ] = µ, perhaps, V [y] = cµ where c is called the over-dispersion factor. Note that if this formulation is used, the data are no longer distributed as a Poisson distribution; in fact, there is NO actual probability function that has this property. Nevertheless, this quasi-distribution still has nice properties and the over-dispersion factor can be estimated using quasi-likelihood methods that are analogous to regular likelihood methods. The end result is that the over-dispersion factor is used to adjust the se and the test-statistics. The adjusted se are obtained by multiplying the se from the Poisson model by ĉ. The adjusted chi-square test statistics are found by dividing the test statistic from the poisson model by ĉ, and p-value is adjusted by looking up the adjusted test-statistic in the appropriate table. How is the over-dispersion factor c estimated? There are two methods, both of which are asymptotically equivalent. These involve taking the goodness-of-fit statistic and dividing by their degrees of freedom: ĉ = goodness-of-fit-statistic df Usually, ĉ s of less than 10 (corresponding to a potential inflation in the se by a factor of about 3) are acceptable if the inflation factor is more than about 10, the lack-of-fit is so large that alternate methods should be used. In JMP, the adjustment of over-dispersion occurs in the Analyze->Fit Model dialogue box: c 2015 Carl James Schwarz

18 The revised output is now: c 2015 Carl James Schwarz

19 Notice that the overdispersion factor has been estimated as ĉ = chi square df = = 3.13 This is very close to the guess that we made based on looking at the variance-to-mean ratio among weight classes. The estimated intercept and slope are unchanged and their interpretation is as before. For example, the estimated slope of is the estimated increase in the log number of male satellite crabs when the female crab s weight increases by 1 g. A 1000g increase in body-weight corresponds to a =.668 increase in the log(number of satellite males) which corresponds to an increase by a factor of e.668 = 1.95, i.e. the mean number of male satellite crabs almost doubles. The estimated se has been inflated by ĉ = 3.13 = The confidence intervals for the slope and intercept are now wider. The chi-square test statistics have been deflated by ĉ and the p-values have been adjusted accordingly. c 2015 Carl James Schwarz

20 Finally, the residual plot has been rescaled by the factor of ĉ and now most residuals lie between 2 and 2. Note that the pattern of the residual plot doesn t change; all that the over-dispersion adjustment does is to change the residual variance so that the standardization brings them closer to 0. Predictions of the mean response at levels of X are obtained in the usual fashion: giving (partial output): The se of the predicted mean will also have been be adjusted for overdispersion as will have the confidence intervals for the mean number of male satellite crabs at each weight value. However, notice that the menu item for a prediction interval for the INDIVIDUAL response is grayed out and it is now impossible to obtain prediction intervals for the ACTUAl number of events. By using the overdispersion factor, you are no longer assuming that the counts are distributed as a Poisson distribution in fact, there is NO REAL DISTRIBUTION that has the mean to variance ratio that implicitly assumed using the overdispersion factor. Without an actual distribution, it is impossible to make predictions for individual events. We save the predicted values to the dataset and do a plot of the final results on both the ordinary c 2015 Carl James Schwarz

21 scale: c 2015 Carl James Schwarz

22 and on the log-scale (the scale where the model is linear ): c 2015 Carl James Schwarz

23 c 2015 Carl James Schwarz

24 23.6 Single Continuous X variable with an OFFSET In the previous examples, the sampling unit (where the counts were obtained) were all the same size (e.g. the number of satellite males around a single female). In some cases, the sampling unit are of different sizes. For example, if the number of weeds are counted in a quadrat plot, then hopefully the size of the plot is constant. However, it is conceivable that the size of the plot varies because different people collected different parts of the data. Of if the number of events are counted in a time interval (e.g. the number of fish captured in a fishing trip), the time intervals could be of different size. Often these type of data are pre-standardized, i.e. converted to a per m 2 or per hour basis and then an analysis is attempted on this standardized variable. However, standardization destroys the poisson shape of the data and turns out to be unnecessary if the size of the sampling unit is also collected. The incidence of non melanoma skin cancer among women in the early 1970 s in Minneapolis-St Paul, Minnesota, and Dallas-Fort Worth, Texas is summarized below: c 2015 Carl James Schwarz

25 City Age Class Age Mid Count Pop Size msp ,675 msp ,065 msp ,216 msp ,051 msp ,159 msp ,722 msp ,185 msp ,328 dfw ,343 dfw ,207 dfw ,374 dfw ,353 dfw ,004 dfw ,932 dfw ,007 dfw ,538 We will first examine the relationship of cancer incidence to age by using the age midpoint as our continuous X variable and only using the Minneapolis data (for now). The data set is available in the JMP data file skincancer.jmp from the Sample Program Library at Is there a relationship between the age of a cohort and the cancer incidence rate? Notice that a comparison of the raw counts is not very sensible because of the different size of the age cohorts. Most people would first STANDARDIZE the incidence rate, e.g. find the incidence per person by dividing the number of cancers by the number of people in each cohort: A plot of the standardized incidence rate by the mid-age of each cohort: c 2015 Carl James Schwarz

26 shows a curved relationship between the incidence rate and the mid-point of the age-cohort. This suggests a theoretical model of the form: Incidence = Ce age i.e. an exponential increase in the cancer rates with age. This suggests that a log-transform is applied to BOTH sides, but a plot: of the logarithm of the incidence rate against log(age midpoint): c 2015 Carl James Schwarz

27 is still not linear with a dip for the youhgest cohorts. There appears to be a strong relationship between the log(cancer rate) and log(age) that may not be linear, but a quadratic looks as if it could fit quite nicely, i.e. a model of the form. log(incidence) = β 0 + β 1 log(age) + β 2 log(age) 2 + residual Is it possible to include the population size direct? Expand the above model: log(incidence) = β 0 + β 1 log(age) + β 2 log(age) 2 + residual log( count pop size ) = β 0 + β 1 log(age) + β 2 log(age) 2 + residual log(count) log(pop size) = β 0 + β 1 log(age) + β 2 log(age) 2 + residual log(count) = log(pop size) + β 0 + β 1 log(age) + β 2 log(age) 2 + residual Notice that the log(pop size) has a known coefficient of 1 associated with it, i.e. there is NO β coefficient associated with log(pop size). Also notice that log(p OP SiZE) is known in advance and is NOT a parameter to be estimated. Variables such as population size are often called offset variables and notice that most packages expect to see the offset variable pre-transformed depending upon the link function used. In this case, the log link was used, so the offset is log(p OP SIZE age ) as you will see in a minute. c 2015 Carl James Schwarz

28 Our GLIM model will then be: Y age distributed P oisson(µ age ) φ age = log(µ age ) = log(p OP SIZE age ) + log(λ age ) φ age = β 0 + β 1 log(age) + β 2 log(age) 2 This can be rewritten slightly as: Y age distributed P oisson(µ age ) φ age = log(µ age ) = log(p OP SIZE age ) + log(λ age ) φ age = log(p OP SIZE age ) + log(λ age ) = β 0 + β 1 log(age) + β 2 log(age) 2 or log(λ age ) = β 0 + β 1 log(age) + β 2 log(age) 2 log(p OP SIZE age ) So the modeling can be done in terms of estimating the effect of log(age) upon the incidence rate, rather than the raw counts, as long as the offset variable (log(p OP SIZE age )) is known. To perform a Poisson regression, first create the offset variable (log(p OP SIZE age )) using the formula editor of JMP. The Analyze->Fit Model platform launches the analysis: c 2015 Carl James Schwarz

29 Note that the raw count is the Y variable, and that the offset variable is specified separately from the X variables. The output is: The goodness-of-fit statistic indictes no evidence of lack-of-fit, i.e. no need to adjust for over-dispersion. Based on the results of the Effect Test for the quadratic term, it appears that a linear fit may actually be sufficient as the p-value for the quadratic term is almost 10%.. The reason for this apparent non-need for the quadratic term is that the smaller age-cohorts have very few counts and so the actual incidence rate is very imprecisely estimated. Finally, the Parameter Estimates section reports the estimated β s (remember these are on the logscale). Each line also tests if the corresponding population coefficient is zero. Because each of the X variables in the model are single variables (i.e. not categories) the results of the parameter estimates tests match the effect tests. Based on the output so far, it appears that we can drop the quadratic term. This term was dropped, and the model refit: c 2015 Carl James Schwarz

30 The final model is The predicted log(λ) for age 40 is found as: log(λ) age = (log(age)) log(λ) 40 = (log(40)) = 8.04 This incidence rate is on the log-scale, so the predicted incidence rate is found by taking the anti-logs, or e 8.04 = or.322/thousand people or 322/million people. In order to make predictions about the expected number of cancers in each age cohort, that would be seen under this model, you would need to add back the log(p OP SIZE) for the appropriate age class: log(µ 40 ) = log(λ) 40 + log(p OP SIZE 40 ) = = 3.42 Finally, the predicted number of cases is simply the anti-log of this value: Ŷ 40 = e logµ 40 = e 3.42 = Of course, this can be done automatically the the platform by requesting: c 2015 Carl James Schwarz

31 This also allows you save the confidence limits for the average number (the mean confidence bounds) of skin cancers expected for this age class (assuming the same population size) and confidence limits (the individual confidence bounds) In this case, the expected number of skin cancer cases for the age group is with a 95% confidence interval for the mean number of cases ranging from ( ). The confidence bound for the actual number of cases (assuming the model is correct) is somewhere between 19 and 43 cases. By adding new data lines to the data table (before the model fit) with the Y variable missing, but the age and offset variable present, you can make forecasts for any set of new X values. The residual plot: c 2015 Carl James Schwarz

32 isn t too bad the large negative residual for the first age class (near when 0 skin cancers are predicted) is a bit worrisome, I suspect this is where the quadratic curve may provide a better fit. A plot of actual vs. predicted values can be obtained directly: c 2015 Carl James Schwarz

33 or by saving the predicted value to the data sheet, and using the Analyze->Fit Y-by-X platform with Fit Special to add the reference line: c 2015 Carl James Schwarz

34 These plot show excellent agreement with data. Finally, it is nice to construct an overlay plot the empirical log(rates) (the first plot constructed) with the estimated log(rate) and confidence bounds as a function of log(age). Create the predicted log(rate) using the formula editor and the predicted skin cancer numbers by subtracting the log(p OP SIZE) (why?): Repeat the same formula for the lower and upper bounds of the 95% confidence interval for the mean number of cases: c 2015 Carl James Schwarz

35 Finally, use the Graph OverlayPlot to plot the empirical estimates, the predicted values of λ and the 95% confidence interval for λ on the same plot: and fiddle 8 with the plot to join up predictions and confidence bounds but leave the actual empirical points as is to give the final plot: 8 I had to use the turn on the connect through missing option the red-triangle. c 2015 Carl James Schwarz

36 Remember that the point with the smallest log(rate) is based on a single skin cancer case and not very reliable. That is why the quadratic fit was likely not selected ANCOVA models Just like in regular multiple-regression, it is possible to mix continuous and categorical variables and test for parallelism of the effects. Of course this parallelism is assessed on the link scale (in most cases for Poisson data, on the log scale). There is nothing new compared to what was seen with ordinary regression and logistic regression. The three appropriate models are: log(λ) = X log(λ) = X Cat log(λ) = X Cat X Cat where X is the continuous predictors, and Cat is the categorical predictor. The first model assumes a common line for all categories of the Cat variable. The second model assumes parallel slopes, but differing intercepts. The third model assumes separate lines for each category. Fitting would start with the most complex model (the third model) and test if there is evidence of non-parallelism. If none were found, the second model would be examined, and a test would be made for common intercepts. Finally, the simplest model may be an adequate fit. Let us return to the skin cancer data examined earlier in this chapter. It is of interest to see if there is a consistent difference in skin cancer rates between the two cities. Presumably, Dallas, which receives more intense sun, would have a higher skin cancer rate. The data is available in the skincancer.jmp data set in the Sample Program Library at stat.sfu.ca/~cschwarz/stat-650/notes/myprograms. Use all of the data. As before, the log(population size) will be the offset variable. A preliminary data plot of the empirical cancer rate for the two cities: c 2015 Carl James Schwarz

37 shows roughly parallel responses, but now the curvature is much more pronounced in Dallas. Perhaps a quadratic model should be first fit, with separate response curve for both cities. In short hand model notation, this is: log(lambda) = City log(age) log(age) 2 City log(age) City log(age) 2 where City is the effect of the two cities, log(age) is the continuous X variable, and the interaction terms represent the non-parallelism of the responses. This is specified as: c 2015 Carl James Schwarz

38 As before, use the Generalized Linear Model option of the Analyze->Fit Model platform and don t forget to specify the log(popsize) as the offset variable. This gives the output: c 2015 Carl James Schwarz

39 The Whole Model Test shows evidence that the model has predictive ability. The Goodness-of-fit Test shows that this model is a reasonable fit (p-values around.30). The Effect Test shows that perhaps both of the interaction terms can be dropped, but some care must be taken as these are marginal tests and cannot be simply combined. A Chunk Test similar to that seen in logistic regression can be done to see if both interaction terms can be dropped simultaneously: c 2015 Carl James Schwarz

40 c 2015 Carl James Schwarz

41 The p-value is just above α =.05 so I would be a little hesitant to drop both interaction terms. On the other hand, some of the larger age classes have such large sample sizes and large count values that very minor differences in fit can likely be detected. The simpler model with two parallel quadratic curves was then fit: c 2015 Carl James Schwarz

42 This simpler model also has no strong evidence of lack-of-fit. Now, however, the quadratic term cannot be dropped. The parameter estimates must be interpreted carefully for categorical data. Every package codes indicator variables in different ways, and so the interpretation of the estimates associated with the indicator c 2015 Carl James Schwarz

43 variables differs among packages. JMP codes indicator variables so that estimates are the difference in response between that specified level and the AVERAGE of all other levels. So in this case, the estimate associated with City[dfw] =.401 represent 1/2 the distance between the two parallel curves. Consequently, the difference in logλ between Minneapolis and Dallas is =.801 (SE =.05). This is a consistent difference for all age groups. This can also be estimated without having to worry too much about the coding details by doing a contrast between the estimates for the city effects:: c 2015 Carl James Schwarz

44 This gives the same results as above. This is a difference on the log-scale. As seen in earlier chapter, this can be converted to an estimate of the ratio of incidence by taking anti-logs. In this case, Dallas is estimated to have e.802 = 2.22 TIMES the skin cancer rate of Minneapolis. This is consistent with what is seen in the raw data. The SE of this ratio is found using an application of the Delta method 9 The delta-method indicates that the SE of an exponentiated estimate is found as SE(e θ) = SE( θ)e θ 9 A form of a Taylor Series Expansion. Consult many books on statistics for details. c 2015 Carl James Schwarz

45 In this case SE(ratio) = =.11 Confidence bounds are found by finding the usual confidence bounds on the log-scale and then taking anti-logs of the end points. In this case, the 95% confidence interval for the difference in log(λ) is (.802 2(.052) (.052)) or ( ). Taking antilogs, gives a 95% confidence interval for the ratio of skin cancer rates as ( ). The residual plot (not shown) look reasonable Categorical X variables - a designed experiment Just like ANOVA is used to analyze data from designed experiments, Generalized linear models can also be used to analyze count data from designed experiments. However, JMP is limited to designs without random effects, e.g. no GLIMs that involve split-plot designs. Consider an experiment to investigate 10 treatments (a control vs. a 3x3 factorial structure for two factors A and B) on controlling insect numbers. The experiment was run in a randomized block design (see earlier chapters). In each block, the 10 treatments were randomized to 10 different trees. On each tree, a trap was mounted, and the number of insects caught in each trap was recorded. Here is the raw data This is example from SAS for Linear Models, 4th Edition. Data extracted from samples/a56655 on c 2015 Carl James Schwarz

46 Block Treatment A B Count c 2015 Carl James Schwarz

47 The data are available in JMP data file insectcount.jmp in the Sample Program Library at http: // The RCB model was fit using a generalized linear model with a log link: Count i distributed P oisson(µ i ) φ i = log(µ i ) φ i = Block T reatment where the simplified syntax Block and Treatment refer to block and treatment effects. Both Blocks and Treatment are categorical, and will be translated to sets of indicator variables in the usual way. This model is fit in JMP using the Analyze->Fit Model platform: Note that the block and treatment variables must be nominally scaled. There is NO offset variable as the insect cages were all equal size. This produces the output: c 2015 Carl James Schwarz

48 The Goodness-of-fit test shows strong evidence that the model doesn t fit as the p-values are very small. Lack-of-fit can be caused by inadequacies of the actual model (perhaps a more complex model with block and treatment interactions is needed?), failure of the Poisson assumption, or using the wrong link-function, The residual plot: c 2015 Carl James Schwarz

49 shows that the data is more variable than expected by a Poisson distribution (about 95% of the residual should be within ± 2). The base model and link function seem reasonable as there is no pattern to the residuals, merely an over-dispersion relative to a Poisson distribution. The adjustment of over-dispersion is made as seen earlier in the Analyze->Fit Model dialogue box: c 2015 Carl James Schwarz

50 which gives the revised output: Note that the over-dispersion factor ĉ = 3.5. The test-statistic for the Effect Test are adjusted by this factor (compare the chi-square of for the treatment effects in the absence of adjusting for overc 2015 Carl James Schwarz

51 dispersion with the chi-square of after adjusting for over-dispersion), and the p-values have been adjusted as well. The residuals have been adjusted by ĉ and now look more acceptable: Note that the pattern of the residual plot doesn t change; all that the over-dispersion adjustment does is to change the residual variance so that the standardization brings them closer to 0. If you compare the parameter estimates between the two models, you will find that the estimates are unchanged, but the reported se are increased by ĉ to account for over-dispersion. As the case with all categorical X variables, the interpretation of the estimates for the indicator variables depends upon the coding used by the package. JMP uses a coding where each indicator variable is compared to the mean response over all indicator variables. Predictions of the mean response at levels of X are obtained in the usual fashion. The se will also be adjusted for overdisperion. However, it is now impossible to obtain prediction intervals for the ACTUAl number of events. By using the overdispersion factor, you are no longer assuming that the counts are distributed as a Poisson distribution in fact, there is NO REAL DISTRIBUTION that has the mean to variance ratio that implicitly assumed using the overdispersion factor. Without an actual distribution, it is impossible to make predictions for individual events. If comparisons are of interest among the treatment levels, it is better to use the built-in Contrast facilities of the package to compute the estimates and standard errors rather than trying to do this by hand. For example, suppose we are interested in comparing treatment 0 (the control), to the treatment with factor A at level 1 and factor B at level 1 (corresponding to treatment 1). The contrast is estimated as: c 2015 Carl James Schwarz

52 c 2015 Carl James Schwarz

53 The estimated difference in the log(mean) is -.34 (se.39) which corresponds to a ratio of e.34 =.71 of treatment 1 to control, i.e. on, average, the number of insects in the treatment 1 traps are 71% of the number of insects in the control trap. An application of the delta-method shows that the se of the ratio is computed as se(e θ) = se( θ)e θ =.39(.71) =.28. However, there was no evidence of a difference in trap counts as the standard error was sufficiently large. A 95% confidence interval for the difference in log(mean) is found as.34 ± 2(.39) which gives ( ). Because the p-value was larger than α =.05, this confidence interval includes zero. When this interval is anti-logged, the 95% confidence interval for the ratio of mean counts is ( ), i.e. the true ratio of treatment counts to control counts is between.32 and Because the p-value was greater than α =.05, this interval contains the value of 1 (indicating that the ratio of counts was 1:1). It is also correct to compute the 95% confidence interval for the ratio using the estimated ratio ± its se@. This gives (.71 ± 2(.28)) or ( ). In large samples, these confidence intervals are equivalent. In smaller samples, there is no real objective way to choose between them Log-linear models for multi-dimensional contingency tables In the chapter on logistic regression, k 2 contingency tables were analyzed to see if the proportion of responses in the population that fell in the two categories (e.g. survived or died) were the same across the k levels of the factor (e.g. sex, or passenger class, or dose of a drug). The use of logisitic regression is a special case of the general r c contingency table where observations are classified by r levels of a factor and c levels of a response. In a separate chapter, the use of χ 2 tests to test the hypothesis of equal population proportions in the c levels of the response across all levels of the the factor. This is also known as the test of independence of the response to levels of the factor. This can be generalized to the analysis of multi-dimensional tables using Poisson-regression. In more advanced courses, you can learn how the two previous cases are simple cases of this more general modelling approach. Consult Agresti s book for a fuller account of this topic Variable selection methods To be added later c 2015 Carl James Schwarz

54 23.11 Summary Poisson-regression is the standard tool for the analysis of smallish count data. If the counts are large (say in the orders of hundreds), you could likely use ordinary or weighted regression methods without difficulty. This chapter only concerns itself with data collected under a simple random sample or a completely randomized design. If the data are collected under other designs, please consult with a statistician for the proper analysis. A common problem that have encountered are data that have been prestandardized. For example, data may recorded on the number of tree stems in a 100 m 2 test plots. This data could likely be modeled using poisson regression. But, then the data are standardized to a per hectare basis. These standardized data are NO LONGER distributed as a Poisson distribution. It would be preferable to analyze the data using the sampling units that were used to collect the data with an offset variable being used to adjust for differing sizes of survey units. A common cause of overdispersion is non-independence in the data. For example, data may be collected using a cluster design rather than by a simple random sample. Overdispersion can be accounted for using quasi-likelihood methods. As a rule of thumb, overdispersion factor ĉ of 10 or less are acceptable. Very larger overdispersion factors indicate other serious problems in the model. Alternatives to the use of the correction factor are using a different distribution such as the negative binomial distribution. Related models for this chapter are the Zero-inflated Poisson (ZIP) models. In these models there are an excess number of zeroes relative to what would be expected under a Poisson model. The ZIP model has two parts the probability that an observation will be zero, and then the distribution of the non-zero counts. There is a substantial base in the literature on this model. This is the end of the chapter c 2015 Carl James Schwarz

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples ST3241 Categorical Data Analysis I Generalized Linear Models Introduction and Some Examples 1 Introduction We have discussed methods for analyzing associations in two-way and three-way tables. Now we will

More information

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data Ronald Heck Class Notes: Week 8 1 Class Notes: Week 8 Probit versus Logit Link Functions and Count Data This week we ll take up a couple of issues. The first is working with a probit link function. While

More information

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author...

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author... From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. Contents About This Book... xiii About The Author... xxiii Chapter 1 Getting Started: Data Analysis with JMP...

More information

Chapter 22: Log-linear regression for Poisson counts

Chapter 22: Log-linear regression for Poisson counts Chapter 22: Log-linear regression for Poisson counts Exposure to ionizing radiation is recognized as a cancer risk. In the United States, EPA sets guidelines specifying upper limits on the amount of exposure

More information

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014 LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Liang (Sally) Shan Nov. 4, 2014 L Laboratory for Interdisciplinary Statistical Analysis LISA helps VT researchers

More information

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form: Outline for today What is a generalized linear model Linear predictors and link functions Example: fit a constant (the proportion) Analysis of deviance table Example: fit dose-response data using logistic

More information

Section Poisson Regression

Section Poisson Regression Section 14.13 Poisson Regression Timothy Hanson Department of Statistics, University of South Carolina Stat 705: Data Analysis II 1 / 26 Poisson regression Regular regression data {(x i, Y i )} n i=1,

More information

C. J. Schwarz Department of Statistics and Actuarial Science, Simon Fraser University December 27, 2013.

C. J. Schwarz Department of Statistics and Actuarial Science, Simon Fraser University December 27, 2013. Errors in the Statistical Analysis of Egri, A., Blahó, M., Kriska, G., Farkas, R., Gyurkovszky, M., Åkesson, S. and Horváth, G. 2012. Polarotactic tabanids find striped patterns with brightness and/or

More information

STAT 7030: Categorical Data Analysis

STAT 7030: Categorical Data Analysis STAT 7030: Categorical Data Analysis 5. Logistic Regression Peng Zeng Department of Mathematics and Statistics Auburn University Fall 2012 Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall 2012

More information

Generalized linear models

Generalized linear models Generalized linear models Outline for today What is a generalized linear model Linear predictors and link functions Example: estimate a proportion Analysis of deviance Example: fit dose- response data

More information

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1 Parametric Modelling of Over-dispersed Count Data Part III / MMath (Applied Statistics) 1 Introduction Poisson regression is the de facto approach for handling count data What happens then when Poisson

More information

Generalized linear models

Generalized linear models Generalized linear models Douglas Bates November 01, 2010 Contents 1 Definition 1 2 Links 2 3 Estimating parameters 5 4 Example 6 5 Model building 8 6 Conclusions 8 7 Summary 9 1 Generalized Linear Models

More information

Chapter 19: Logistic regression

Chapter 19: Logistic regression Chapter 19: Logistic regression Self-test answers SELF-TEST Rerun this analysis using a stepwise method (Forward: LR) entry method of analysis. The main analysis To open the main Logistic Regression dialog

More information

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game. EdPsych/Psych/Soc 589 C.J. Anderson Homework 5: Answer Key 1. Probelm 3.18 (page 96 of Agresti). (a) Y assume Poisson random variable. Plausible Model: E(y) = µt. The expected number of arrests arrests

More information

Poisson regression: Further topics

Poisson regression: Further topics Poisson regression: Further topics April 21 Overdispersion One of the defining characteristics of Poisson regression is its lack of a scale parameter: E(Y ) = Var(Y ), and no parameter is available to

More information

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Logistic Regression 1 / 38 Logistic Regression 1 Introduction

More information

Lecture 14: Introduction to Poisson Regression

Lecture 14: Introduction to Poisson Regression Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why

More information

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview Modelling counts I Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week

More information

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 Introduction to Generalized Univariate Models: Models for Binary Outcomes EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 EPSY 905: Intro to Generalized In This Lecture A short review

More information

Multinomial Logistic Regression Models

Multinomial Logistic Regression Models Stat 544, Lecture 19 1 Multinomial Logistic Regression Models Polytomous responses. Logistic regression can be extended to handle responses that are polytomous, i.e. taking r>2 categories. (Note: The word

More information

Binary Logistic Regression

Binary Logistic Regression The coefficients of the multiple regression model are estimated using sample data with k independent variables Estimated (or predicted) value of Y Estimated intercept Estimated slope coefficients Ŷ = b

More information

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture! Hierarchical Generalized Linear Models ERSH 8990 REMS Seminar on HLM Last Lecture! Hierarchical Generalized Linear Models Introduction to generalized models Models for binary outcomes Interpreting parameter

More information

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011) Ron Heck, Fall 2011 1 EDEP 768E: Seminar in Multilevel Modeling rev. January 3, 2012 (see footnote) Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October

More information

1 A Review of Correlation and Regression

1 A Review of Correlation and Regression 1 A Review of Correlation and Regression SW, Chapter 12 Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then

More information

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1 Lecture Slides Elementary Statistics Tenth Edition and the Triola Statistics Series by Mario F. Triola Slide 1 Chapter 10 Correlation and Regression 10-1 Overview 10-2 Correlation 10-3 Regression 10-4

More information

One-Way Tables and Goodness of Fit

One-Way Tables and Goodness of Fit Stat 504, Lecture 5 1 One-Way Tables and Goodness of Fit Key concepts: One-way Frequency Table Pearson goodness-of-fit statistic Deviance statistic Pearson residuals Objectives: Learn how to compute the

More information

Introducing Generalized Linear Models: Logistic Regression

Introducing Generalized Linear Models: Logistic Regression Ron Heck, Summer 2012 Seminars 1 Multilevel Regression Models and Their Applications Seminar Introducing Generalized Linear Models: Logistic Regression The generalized linear model (GLM) represents and

More information

BIOMETRICS INFORMATION

BIOMETRICS INFORMATION BIOMETRICS INFORMATION Index of Pamphlet Topics (for pamphlets #1 to #60) as of December, 2000 Adjusted R-square ANCOVA: Analysis of Covariance 13: ANCOVA: Analysis of Covariance ANOVA: Analysis of Variance

More information

Semiparametric Generalized Linear Models

Semiparametric Generalized Linear Models Semiparametric Generalized Linear Models North American Stata Users Group Meeting Chicago, Illinois Paul Rathouz Department of Health Studies University of Chicago prathouz@uchicago.edu Liping Gao MS Student

More information

Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence

Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence Sunil Kumar Dhar Center for Applied Mathematics and Statistics, Department of Mathematical Sciences, New Jersey

More information

Using SPSS for One Way Analysis of Variance

Using SPSS for One Way Analysis of Variance Using SPSS for One Way Analysis of Variance This tutorial will show you how to use SPSS version 12 to perform a one-way, between- subjects analysis of variance and related post-hoc tests. This tutorial

More information

Final Exam - Solutions

Final Exam - Solutions Ecn 102 - Analysis of Economic Data University of California - Davis March 19, 2010 Instructor: John Parman Final Exam - Solutions You have until 5:30pm to complete this exam. Please remember to put your

More information

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics Faculty of Health Sciences Regression models Counts, Poisson regression, 27-5-2013 Lene Theil Skovgaard Dept. of Biostatistics 1 / 36 Count outcome PKA & LTS, Sect. 7.2 Poisson regression The Binomial

More information

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE THE ROYAL STATISTICAL SOCIETY 004 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE PAPER II STATISTICAL METHODS The Society provides these solutions to assist candidates preparing for the examinations in future

More information

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University Poisson Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Poisson Regression 1 / 49 Poisson Regression 1 Introduction

More information

MULTIPLE REGRESSION METHODS

MULTIPLE REGRESSION METHODS DEPARTMENT OF POLITICAL SCIENCE AND INTERNATIONAL RELATIONS Posc/Uapp 816 MULTIPLE REGRESSION METHODS I. AGENDA: A. Residuals B. Transformations 1. A useful procedure for making transformations C. Reading:

More information

4 Multicategory Logistic Regression

4 Multicategory Logistic Regression 4 Multicategory Logistic Regression 4.1 Baseline Model for nominal response Response variable Y has J > 2 categories, i = 1,, J π 1,..., π J are the probabilities that observations fall into the categories

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

1 Introduction to Minitab

1 Introduction to Minitab 1 Introduction to Minitab Minitab is a statistical analysis software package. The software is freely available to all students and is downloadable through the Technology Tab at my.calpoly.edu. When you

More information

Categorical data analysis Chapter 5

Categorical data analysis Chapter 5 Categorical data analysis Chapter 5 Interpreting parameters in logistic regression The sign of β determines whether π(x) is increasing or decreasing as x increases. The rate of climb or descent increases

More information

Logistic Regression: Regression with a Binary Dependent Variable

Logistic Regression: Regression with a Binary Dependent Variable Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression

More information

Chapter 5: Logistic Regression-I

Chapter 5: Logistic Regression-I : Logistic Regression-I Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM [Acknowledgements to Tim Hanson and Haitao Chu] D. Bandyopadhyay

More information

Generalised linear models. Response variable can take a number of different formats

Generalised linear models. Response variable can take a number of different formats Generalised linear models Response variable can take a number of different formats Structure Limitations of linear models and GLM theory GLM for count data GLM for presence \ absence data GLM for proportion

More information

Generalized Linear Models

Generalized Linear Models York SPIDA John Fox Notes Generalized Linear Models Copyright 2010 by John Fox Generalized Linear Models 1 1. Topics I The structure of generalized linear models I Poisson and other generalized linear

More information

8 Nominal and Ordinal Logistic Regression

8 Nominal and Ordinal Logistic Regression 8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on

More information

MODELING COUNT DATA Joseph M. Hilbe

MODELING COUNT DATA Joseph M. Hilbe MODELING COUNT DATA Joseph M. Hilbe Arizona State University Count models are a subset of discrete response regression models. Count data are distributed as non-negative integers, are intrinsically heteroskedastic,

More information

Review of Multiple Regression

Review of Multiple Regression Ronald H. Heck 1 Let s begin with a little review of multiple regression this week. Linear models [e.g., correlation, t-tests, analysis of variance (ANOVA), multiple regression, path analysis, multivariate

More information

9 Correlation and Regression

9 Correlation and Regression 9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the

More information

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007) FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter

More information

Stat 135 Fall 2013 FINAL EXAM December 18, 2013

Stat 135 Fall 2013 FINAL EXAM December 18, 2013 Stat 135 Fall 2013 FINAL EXAM December 18, 2013 Name: Person on right SID: Person on left There will be one, double sided, handwritten, 8.5in x 11in page of notes allowed during the exam. The exam is closed

More information

Sections 4.1, 4.2, 4.3

Sections 4.1, 4.2, 4.3 Sections 4.1, 4.2, 4.3 Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1/ 32 Chapter 4: Introduction to Generalized Linear Models Generalized linear

More information

Time Series Analysis. Smoothing Time Series. 2) assessment of/accounting for seasonality. 3) assessment of/exploiting "serial correlation"

Time Series Analysis. Smoothing Time Series. 2) assessment of/accounting for seasonality. 3) assessment of/exploiting serial correlation Time Series Analysis 2) assessment of/accounting for seasonality This (not surprisingly) concerns the analysis of data collected over time... weekly values, monthly values, quarterly values, yearly values,

More information

Cohen s s Kappa and Log-linear Models

Cohen s s Kappa and Log-linear Models Cohen s s Kappa and Log-linear Models HRP 261 03/03/03 10-11 11 am 1. Cohen s Kappa Actual agreement = sum of the proportions found on the diagonals. π ii Cohen: Compare the actual agreement with the chance

More information

LECSS Physics 11 Introduction to Physics and Math Methods 1 Revised 8 September 2013 Don Bloomfield

LECSS Physics 11 Introduction to Physics and Math Methods 1 Revised 8 September 2013 Don Bloomfield LECSS Physics 11 Introduction to Physics and Math Methods 1 Physics 11 Introduction to Physics and Math Methods In this introduction, you will get a more in-depth overview of what Physics is, as well as

More information

LOOKING FOR RELATIONSHIPS

LOOKING FOR RELATIONSHIPS LOOKING FOR RELATIONSHIPS One of most common types of investigation we do is to look for relationships between variables. Variables may be nominal (categorical), for example looking at the effect of an

More information

Lecture 10: Alternatives to OLS with limited dependent variables. PEA vs APE Logit/Probit Poisson

Lecture 10: Alternatives to OLS with limited dependent variables. PEA vs APE Logit/Probit Poisson Lecture 10: Alternatives to OLS with limited dependent variables PEA vs APE Logit/Probit Poisson PEA vs APE PEA: partial effect at the average The effect of some x on y for a hypothetical case with sample

More information

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification, Likelihood Let P (D H) be the probability an experiment produces data D, given hypothesis H. Usually H is regarded as fixed and D variable. Before the experiment, the data D are unknown, and the probability

More information

Generalized Linear Models for Non-Normal Data

Generalized Linear Models for Non-Normal Data Generalized Linear Models for Non-Normal Data Today s Class: 3 parts of a generalized model Models for binary outcomes Complications for generalized multivariate or multilevel models SPLH 861: Lecture

More information

Box-Cox Transformations

Box-Cox Transformations Box-Cox Transformations Revised: 10/10/2017 Summary... 1 Data Input... 3 Analysis Summary... 3 Analysis Options... 5 Plot of Fitted Model... 6 MSE Comparison Plot... 8 MSE Comparison Table... 9 Skewness

More information

Stat 587: Key points and formulae Week 15

Stat 587: Key points and formulae Week 15 Odds ratios to compare two proportions: Difference, p 1 p 2, has issues when applied to many populations Vit. C: P[cold Placebo] = 0.82, P[cold Vit. C] = 0.74, Estimated diff. is 8% What if a year or place

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Methods@Manchester Summer School Manchester University July 2 6, 2018 Generalized Linear Models: a generic approach to statistical modelling www.research-training.net/manchester2018

More information

CHAPTER 5 FUNCTIONAL FORMS OF REGRESSION MODELS

CHAPTER 5 FUNCTIONAL FORMS OF REGRESSION MODELS CHAPTER 5 FUNCTIONAL FORMS OF REGRESSION MODELS QUESTIONS 5.1. (a) In a log-log model the dependent and all explanatory variables are in the logarithmic form. (b) In the log-lin model the dependent variable

More information

Investigating Models with Two or Three Categories

Investigating Models with Two or Three Categories Ronald H. Heck and Lynn N. Tabata 1 Investigating Models with Two or Three Categories For the past few weeks we have been working with discriminant analysis. Let s now see what the same sort of model might

More information

Exam Applied Statistical Regression. Good Luck!

Exam Applied Statistical Regression. Good Luck! Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.

More information

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model 1 Linear Regression 2 Linear Regression In this lecture we will study a particular type of regression model: the linear regression model We will first consider the case of the model with one predictor

More information

Review for Final. Chapter 1 Type of studies: anecdotal, observational, experimental Random sampling

Review for Final. Chapter 1 Type of studies: anecdotal, observational, experimental Random sampling Review for Final For a detailed review of Chapters 1 7, please see the review sheets for exam 1 and. The following only briefly covers these sections. The final exam could contain problems that are included

More information

Psych 230. Psychological Measurement and Statistics

Psych 230. Psychological Measurement and Statistics Psych 230 Psychological Measurement and Statistics Pedro Wolf December 9, 2009 This Time. Non-Parametric statistics Chi-Square test One-way Two-way Statistical Testing 1. Decide which test to use 2. State

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

A Re-Introduction to General Linear Models

A Re-Introduction to General Linear Models A Re-Introduction to General Linear Models Today s Class: Big picture overview Why we are using restricted maximum likelihood within MIXED instead of least squares within GLM Linear model interpretation

More information

Regression Methods for Survey Data

Regression Methods for Survey Data Regression Methods for Survey Data Professor Ron Fricker! Naval Postgraduate School! Monterey, California! 3/26/13 Reading:! Lohr chapter 11! 1 Goals for this Lecture! Linear regression! Review of linear

More information

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION

LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION LAB 5 INSTRUCTIONS LINEAR REGRESSION AND CORRELATION In this lab you will learn how to use Excel to display the relationship between two quantitative variables, measure the strength and direction of the

More information

Hypothesis testing, part 2. With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

Hypothesis testing, part 2. With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal Hypothesis testing, part 2 With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal 1 CATEGORICAL IV, NUMERIC DV 2 Independent samples, one IV # Conditions Normal/Parametric Non-parametric

More information

y response variable x 1, x 2,, x k -- a set of explanatory variables

y response variable x 1, x 2,, x k -- a set of explanatory variables 11. Multiple Regression and Correlation y response variable x 1, x 2,, x k -- a set of explanatory variables In this chapter, all variables are assumed to be quantitative. Chapters 12-14 show how to incorporate

More information

Chapter 9. Correlation and Regression

Chapter 9. Correlation and Regression Chapter 9 Correlation and Regression Lesson 9-1/9-2, Part 1 Correlation Registered Florida Pleasure Crafts and Watercraft Related Manatee Deaths 100 80 60 40 20 0 1991 1993 1995 1997 1999 Year Boats in

More information

STATISTICS 141 Final Review

STATISTICS 141 Final Review STATISTICS 141 Final Review Bin Zou bzou@ualberta.ca Department of Mathematical & Statistical Sciences University of Alberta Winter 2015 Bin Zou (bzou@ualberta.ca) STAT 141 Final Review Winter 2015 1 /

More information

Ordinary Least Squares Regression Explained: Vartanian

Ordinary Least Squares Regression Explained: Vartanian Ordinary Least Squares Regression Explained: Vartanian When to Use Ordinary Least Squares Regression Analysis A. Variable types. When you have an interval/ratio scale dependent variable.. When your independent

More information

Algebra Exam. Solutions and Grading Guide

Algebra Exam. Solutions and Grading Guide Algebra Exam Solutions and Grading Guide You should use this grading guide to carefully grade your own exam, trying to be as objective as possible about what score the TAs would give your responses. Full

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Analysing categorical data using logit models

Analysing categorical data using logit models Analysing categorical data using logit models Graeme Hutcheson, University of Manchester The lecture notes, exercises and data sets associated with this course are available for download from: www.research-training.net/manchester

More information

GROUPED DATA E.G. FOR SAMPLE OF RAW DATA (E.G. 4, 12, 7, 5, MEAN G x / n STANDARD DEVIATION MEDIAN AND QUARTILES STANDARD DEVIATION

GROUPED DATA E.G. FOR SAMPLE OF RAW DATA (E.G. 4, 12, 7, 5, MEAN G x / n STANDARD DEVIATION MEDIAN AND QUARTILES STANDARD DEVIATION FOR SAMPLE OF RAW DATA (E.G. 4, 1, 7, 5, 11, 6, 9, 7, 11, 5, 4, 7) BE ABLE TO COMPUTE MEAN G / STANDARD DEVIATION MEDIAN AND QUARTILES Σ ( Σ) / 1 GROUPED DATA E.G. AGE FREQ. 0-9 53 10-19 4...... 80-89

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) (b) (c) (d) (e) In 2 2 tables, statistical independence is equivalent

More information

Keppel, G. & Wickens, T. D. Design and Analysis Chapter 4: Analytical Comparisons Among Treatment Means

Keppel, G. & Wickens, T. D. Design and Analysis Chapter 4: Analytical Comparisons Among Treatment Means Keppel, G. & Wickens, T. D. Design and Analysis Chapter 4: Analytical Comparisons Among Treatment Means 4.1 The Need for Analytical Comparisons...the between-groups sum of squares averages the differences

More information

Mathematical Notation Math Introduction to Applied Statistics

Mathematical Notation Math Introduction to Applied Statistics Mathematical Notation Math 113 - Introduction to Applied Statistics Name : Use Word or WordPerfect to recreate the following documents. Each article is worth 10 points and should be emailed to the instructor

More information

Discrete Multivariate Statistics

Discrete Multivariate Statistics Discrete Multivariate Statistics Univariate Discrete Random variables Let X be a discrete random variable which, in this module, will be assumed to take a finite number of t different values which are

More information

HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 1 OF 11. Figure 1 - Variation in the Response Variable

HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 1 OF 11. Figure 1 - Variation in the Response Variable Chapter 08: Linear Regression There are lots of ways to model the relationships between variables. It is important that you not think that what we do is the way. There are many paths to the summit We are

More information

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing 1 In most statistics problems, we assume that the data have been generated from some unknown probability distribution. We desire

More information

Sociology 6Z03 Review II

Sociology 6Z03 Review II Sociology 6Z03 Review II John Fox McMaster University Fall 2016 John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 1 / 35 Outline: Review II Probability Part I Sampling Distributions Probability

More information

Analysing data: regression and correlation S6 and S7

Analysing data: regression and correlation S6 and S7 Basic medical statistics for clinical and experimental research Analysing data: regression and correlation S6 and S7 K. Jozwiak k.jozwiak@nki.nl 2 / 49 Correlation So far we have looked at the association

More information

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models SCHOOL OF MATHEMATICS AND STATISTICS Linear and Generalised Linear Models Autumn Semester 2017 18 2 hours Attempt all the questions. The allocation of marks is shown in brackets. RESTRICTED OPEN BOOK EXAMINATION

More information

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science. Texts in Statistical Science Generalized Linear Mixed Models Modern Concepts, Methods and Applications Walter W. Stroup CRC Press Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint

More information

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests Chapter 59 Two Correlated Proportions on- Inferiority, Superiority, and Equivalence Tests Introduction This chapter documents three closely related procedures: non-inferiority tests, superiority (by a

More information

Mathematical Notation Math Introduction to Applied Statistics

Mathematical Notation Math Introduction to Applied Statistics Mathematical Notation Math 113 - Introduction to Applied Statistics Name : Use Word or WordPerfect to recreate the following documents. Each article is worth 10 points and can be printed and given to the

More information

Package rsq. January 3, 2018

Package rsq. January 3, 2018 Title R-Squared and Related Measures Version 1.0.1 Date 2017-12-31 Author Dabao Zhang Package rsq January 3, 2018 Maintainer Dabao Zhang Calculate generalized R-squared, partial

More information

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math. Regression, part II I. What does it all mean? A) Notice that so far all we ve done is math. 1) One can calculate the Least Squares Regression Line for anything, regardless of any assumptions. 2) But, if

More information

Bivariate data analysis

Bivariate data analysis Bivariate data analysis Categorical data - creating data set Upload the following data set to R Commander sex female male male male male female female male female female eye black black blue green green

More information

Modeling Overdispersion

Modeling Overdispersion James H. Steiger Department of Psychology and Human Development Vanderbilt University Regression Modeling, 2009 1 Introduction 2 Introduction In this lecture we discuss the problem of overdispersion in

More information

Business Statistics. Lecture 5: Confidence Intervals

Business Statistics. Lecture 5: Confidence Intervals Business Statistics Lecture 5: Confidence Intervals Goals for this Lecture Confidence intervals The t distribution 2 Welcome to Interval Estimation! Moments Mean 815.0340 Std Dev 0.8923 Std Error Mean

More information

CHAPTER 10. Regression and Correlation

CHAPTER 10. Regression and Correlation CHAPTER 10 Regression and Correlation In this Chapter we assess the strength of the linear relationship between two continuous variables. If a significant linear relationship is found, the next step would

More information

Loglikelihood and Confidence Intervals

Loglikelihood and Confidence Intervals Stat 504, Lecture 2 1 Loglikelihood and Confidence Intervals The loglikelihood function is defined to be the natural logarithm of the likelihood function, l(θ ; x) = log L(θ ; x). For a variety of reasons,

More information