BOOTSTRAPPING WITH MODELS FOR COUNT DATA

Size: px
Start display at page:

Download "BOOTSTRAPPING WITH MODELS FOR COUNT DATA"

Transcription

1 Journal of Biopharmaceutical Statistics, 21: , 2011 Copyright Taylor & Francis Group, LLC ISSN: print/ online DOI: / BOOTSTRAPPING WITH MODELS FOR COUNT DATA Bryan F. J. Manly Western EcoSystems Technology, Inc., Laramie, Wyoming, USA Two methods of bootstrap resampling are discussed with log-linear models for count data. The first involves the resampling of observations and the second involves the resampling of Pearson residuals taking into account changes in the distribution of residuals associated with the expected values of counts. The use of both methods is illustrated on two data sets; one data set concerns the number of ear infections of swimmers related to whether they are frequent swimmers or not and three other variables, and the other data set concerns the number of visits to a doctor made in the last 2 weeks related to the age of subjects and 10 other variables. A third data set on the number of marine mammal interactions in different years and fishing areas is also used as an example. In this case only the second bootstrap method can be used because the nature of the data allows the bootstrap resampling of observations to produce sets of data that could not have occurred in practice. Simulation results indicate that the bootstrap results are slightly better than the results from a conventional analysis for the first data set, and much better than the results from a conventional analysis for the second data set, but a conventional analysis works well for the third data set while there are problems with bootstrap analyses. Key Words: Bootstrap resampling; Computer-intensive methods; Generalized linear models; Log-linear models. 1. INTRODUCTION Count data often occur in practice and the need to model the data through a generalized linear model is common. For example, suppose that Y i, the annual number of deaths from a disease in part of a country, is recorded for a number of years, together with the estimated population size in each year and the values of certain variables X 1 X 2 X p that are thought to possibly be related to the incidence of the disease. Then there might be interest in fitting a model of the form E Y i = N i exp X i1 + + p X ip (1) to the data, where E Y i is the expected number of deaths in year i, N i is the population size in year i, and X ij is the value for the jth variable in year i. This is an example of a generalized linear model (McCullagh and Nelder, 1989) where the expected count for a year is proportional to the population size multiplied by Received March 30, 2011; Accepted June 21, 2011 Address correspondence to Bryan F. J. Manly, Western EcoSystems Technology, Inc., 200b South 2nd Street, Laramie, WY 82070, USA; bmanly@west-inc.com 1164

2 BOOTSTRAPPING WITH MODELS FOR COUNT DATA 1165 an exponential function of the predictor variables. Models of this type are also called log-linear models, although for some sets of data there is no equivalent to the population size so that the expected value of the count is just an exponential function of the predictor variables. A common way to fit a model of the form of Eq. (1) involves assuming that the dependent variable has a Poisson distribution, in which case the model can be fitted by maximum likelihood using many statistical packages. Alternatively, as often happens, there may be more or less variation in the Y variable than expected from the Poisson distribution, for which the variance of the observation Y i is equal to its expected value from Eq. (1). In that case a model allowing for this underdispersion or overdispersion can be fitted by assuming that the variance of Y i is the expected value multiplied by a constant using quasi-maximum likelihood (McCullagh and Nelder, 1989). Other possibilities involve assuming a particular distribution for the errors in the model, such as a negative binomial distribution, a zero-inflated Poisson distribution, or a zero-inflated negative binomial distribution as was done by Horton et al. (2007) when analyzing the data on alcohol consumption from a randomized clinical trial, and fitting the model using maximum likelihood. In this article it is suggested that rather than search for an appropriate model for the error distribution in a log-linear model it is simpler to fit the Poisson model with an allowance for underdispersion or overdispersion using quasimaximum likelihood and use bootstrap resampling to allow for a possibly non- Poisson error distribution. One bootstrap method that can then be used involves the resampling with replacement of the individual observation in the original data to get a bootstrap set of data that represents an alternative set of data that might have occurred instead of the observed data. This is appropriate when the observations are effectively in a random order with no fixed structure in the X variables. The second bootstrap method considered here is more complicated. It involves determining bootstrap sets of data by fixing the X variables at their observed values and determining new Y values by stratified resampling of the residuals from the original fitted model that takes into account how the distribution of the residuals may depend on the expected values of counts. With both methods of resampling many bootstrap sets of data are generated and analyzed just like the original data in order to determine standard errors and significance levels for the estimates from the real data, and confidence limits for true parameter values. For the remainder of this article the estimation of regression parameters using quasi-maximum likelihood with an allowance for more or less variation in the counts than expected from the Poisson distribution is referred to as the conventional analysis. As well as the estimates of parameters, it provides standard errors for those estimates, which can be used with t-tests to determine whether a regression estimate is significantly different from zero or some other hypothetical value. The first bootstrap method, which will be called resampling of cases, then provides alternative estimates of the regression standard errors and the generated bootstrap distribution of t-statistics can be used to assess whether an observed t-statistic based on a bootstrap standard error is significantly different from zero. The second bootstrap method, which is called resampling of residuals, also provides alternative estimates of the regression standard errors and again the generated bootstrap distribution of t-statistics can be used to assess whether an observed t-statistic based on a bootstrap standard error is significantly different from zero.

3 1166 MANLY The next section of this article describes three example sets of data. The two bootstrap methods are then explained and illustrated on these data sets, and finally some general conclusions are presented. Model selection questions are not addressed with these examples. Rather, the emphasis is on the properties of estimates of the parameters for an assumed model. 2. EXAMPLE SETS OF DATA The first example set of data concerns ear infections in swimmers and comes from the 1990 Pilot Surf/Health Study of the New South Wales Water Board in Australia. The results of a survey of 287 young swimmers are available (StatSci.Org., 2011). Each of the swimmers was asked how many ear infections they had experienced, which is the dependent count variable. The potential predictor variables recorded at the same time are: Swimmer, whether they were a frequent swimmer (1) or not (0); Beach, whether they usually swam at the beach (1) or not (0); Age, whether their age range was 15 to 19 (1), 20 to 24 (2), or 25 to 29 years (3); and Sex, whether they were female (0) or male (1). The question of interest was whether the frequency of ear infections is related to any of the other recorded variables. A summarized analysis provided with the data at the StatSci.org website suggests that the only important variables are Swimmers and Beach, with individuals tending to report fewer ear infections if they are frequent swimmers that usually swim at the beach. The second example set of data is much larger, with 5190 observations and 12 predictor variables. These data were used as an example in chapter 3 of the book by Cameron and Trivedi (1998) and concern the number of visits to a medical doctor in the 2 weeks before individuals were interviewed for a Australian Health Survey. The dependent variable is the number of doctor visits and the potential predictor variables are: Sex, 1 for female and 0 for male; Age, in years divided by 100; Age2, Age squared; Income, in Australian dollars divided by 1000; LevyPlus, 1 if the person was covered by private health insurance or otherwise 0; FreePoor, 1 if covered by the government because of low income or otherwise 0; FreeRepa, 1 if covered by the government because of old age or a disability otherwise 0; Illness, the number of illnesses in the previous 2 weeks with a maximum of 5; ActDays, number of days of reduced activity in the previous 2 weeks because of illness or injury; HScore, a general Goldberg health questionnaire score with a high score for poor health; ChCond1, 1 with a chronic condition not limiting activity or otherwise 0; and ChCond2, 1 with a chronic condition limiting activity or otherwise 0. The website provides the data. The interest in this case is in how the number of doctor visits is related to the 12 predictor variables, with an initial analysis indicating a significant relationship for six of these variables. The third set of data comes from a nonmedical area, and was chosen as an example of a set of data with a high proportion of zero counts. It concerns fisheries interactions with marine mammals for a long-line fishery in New Zealand. The dependent variable is the number of marine mammals observed to be caught in fishing nets by government observers (the fisheries bycatch) on fishing trips in different areas and different years. There are 34 observed counts, with the number of fishing days in an area and year varying from 1 to 192. The potential predictor

4 BOOTSTRAPPING WITH MODELS FOR COUNT DATA 1167 variables are: FMA, the Fisheries Management Area, of which there were five; and Year, the fishing year from 1997 to The data are available in Table 3.16 of Manly (2009) but for this example the results for different target species of fish have been combined. The interest with this example is whether the marine mammal bycatch rate per fishing day was particularly high in one or more of the fisheries management areas, or in one or more of the years. If so, the reasons for the high bycatch rates become of interest. 3. BOOTSTRAP RESAMPLING OF CASES Bootstrap resampling of cases involves producing a bootstrap set of data by resampling the observations in the real data, with replacement, to get new sets of data that represent alternative sets of data that might have occurred instead of the data actually observed. For example, with the first set of data described earlier there are results for 287 young swimmers on the number of ear infections that they reported and four variables that the ear infections might be related to. Producing a bootstrap sample involves randomly selecting the results for one of the 287 swimmers to provide the first observation, randomly selecting one of the 287 swimmers to provide the second observation, and so on until 287 selections have been made. The bootstrap sample is then expected to contain some of the swimmers from the original sample no times, some one time, some two times, and so on. Repeating the bootstrap sampling process many times yields many bootstrap samples that are assumed to represent alternative sample that might have occurred instead of the observed sample, and these bootstrap samples can be used to estimate the properties of the estimation process, such as the standard errors of the estimated regression parameters. 4. BOOTSTRAP RESAMPLING OF RESIDUALS An alternative to resampling the observations involves resampling the residuals from the fitted model for count data (Moulton and Zeger, 1991), with stratified sampling being used to allow for possible differences in the distribution of residuals with different expected counts (Davison and Hinkley, 1972), section 7.2). As described by Manly and Chotkowski (2006), this involves fitting the loglinear model of interest to the available data using the standard quasi-maximum likelihood approach with an allowance for overdispersion. The Pearson residuals are then calculated, with the residual for the ith observed count Y i being R i = Y i E Y i / E Y i (2) where E Y i is the expected value of Y i from the fitted model. The n residuals are then put in order based on the values for E Y i and divided into m groups with approximately n/m residuals in each group, with the first group containing residuals with the smallest values for E Y i and the last group containing the residuals with the largest values for E Y i. To generate a bootstrap value for the ith observation a residual is randomly selected from those in the group that includes the value of E Y i. Assume that this

5 1168 MANLY is R i. This is then set equal to the residual for this observation in the bootstrap set of data so that R i = Y i E Y i / E Y i where Y i is the bootstrap value for the count. Rearranging this equation then gives the bootstrap count to be Y i = E Y i + R i E Yi (3) To make this a count it is replaced by the maximum of zero and the integer part of Y i Generating a count for all of the observations in the original data in this way results in a bootstrap set of data set of data with the values of the predictor variables exactly the same as for the original data and only the count values changed. A modification to this procedure is made for very low expected counts of 0.01 or less. For these a random number between zero and one is generated and the observed count is set at one if the random number is less than the expected count. This then gives the correct expected count. 5. RESULTS 5.1. The Ear Infection Data The conventional analysis of the ear infection data results in the estimated equation E Infections = exp Swimmer Beach Age Age Sex (4) where E(Infections) is the expected number of infections and the other variables are as described earlier. This equation then makes the standard observation with E Infections = exp = 2 78 being for a casual swimmer, usually not swimming at the beach, with an age in the range 15 to 19 years, and female. Equation (4) includes all of the variables in the available data set and here only the estimation of this full equation is considered, without the removal of nonsignificant variables. This is because the interest with this set of data is in the properties of estimates with and without the use of bootstrapping, rather than in variable selection. Equation (4) was obtained using the GenStat statistical package (VSN International, 2010) using quasi-maximum likelihood estimation for the standard log-linear model regression procedure for count data assuming a Poisson error model with an estimated overdispersion parameter (the residual deviance divided by the residual degrees of freedom) of 2.75 to allow for the variance of counts being larger than expected from Poisson distributions. With the stratified bootstrap sampling of residuals it is necessary to decide how many strata are needed to account for any changes in the distribution of residuals related to the expected values of the counts. Here the principle used is that

6 BOOTSTRAPPING WITH MODELS FOR COUNT DATA 1169 Figure 1 Standardized residuals plotted against the expected number of ear infections. (Color figure available online.) the number should be as small as possible while still taking into account how the residual distributions change with the expected values. A reasonable number in this respect can be determined from a plot of the observed residuals given by Eq. (4) against the values for the expected counts, as shown in Fig. 1. A logarithmic scale is used for the expected count in the figure because this often shows the changes in the distribution of residuals more clearly than the use of a linear scale. In the present example the residual distribution seems fairly constant except that the standardized residuals can be slightly more negative for expected counts above 1 than they can be for expected counts below 1. As the residual distribution is quite constant and there are 287 observations it was decided for the resampling of residuals to stratify the residuals into five strata based on their expected count, with about 57 residuals in each of the strata. Table 1 shows the estimated coefficients with their estimated standard errors, t-values (estimates divided by the standard errors) for testing whether the coefficients are significantly different from zero, and the significance of the t- values based on the t-distribution with 281 df from the conventional analysis. The table also shows the results from bootstrap resampling of the cases and bootstrap resampling of the residuals, with 5000 bootstrap samples used for both resampling methods. For the bootstrap analyses the table shows the bootstrap estimates of the standard errors of the coefficients, which are just the bootstrap standard deviations, and the significance of the t-values estimated as the proportion of bootstrap sets of data with absolute t-values as large as or larger than the observed absolute t-values. The bootstrap t-distribution that was used to assess the significance of the t-values was the values of (Bootstrap estimates Bootstrap mean)/(bootstrap estimated standard error) to allow for the possibility of the bootstrap means of estimates differing from the estimates from the original data to some extent. Comparing the bootstrap results with those from the conventional analysis it is seen that the bootstrap standard errors for the estimated coefficients are all larger than the standard errors from the conventional analysis. As a result of the larger standard deviations from bootstrapping, the t-values from the original data are also found to be less significant from the bootstrap analyses than from the conventional analysis, with the two bootstrap analyses giving fairly similar results in this respect.

7 1170 MANLY Table 1 Results from fitting Eq. (4) using the conventional log-linear method with an allowance for overdispersion and the results obtained from 5000 bootstrap resamples of cases (Bootstrap 1) and 5000 resamples of residuals (Bootstrap 2) Conventional analysis Bootstrap 1 Bootstrap 2 Est Std Err t-value t-dist Signif Std Err Signif Std Err Signif Constant Swimmer Beach Age Age Sex Note. Values shown are the regression coefficient estimates from the original data (Est), the estimated standard errors (Std Err) of the estimates from the conventional analysis and the estimated values from the two bootstrap resampling methods, and the significance (Signif) of the t-values based on the t-tables and the two bootstrap resampling methods. A simulation study was conducted to check these differences in the results from the different analyses. In total, 1000 sets of data similar to the observed data were simulated with the expected values of counts given by Eq. (4) and with the Poisson error inflated by the factor 2.75 using the zero inflated count model where an observed count is either the value zero with probability p or a random value from a Poisson distribution with probability 1 p (Cameron and Trivedi, 1998, section 4.7.2). It was found that the standard errors from the conventional analysis were on average about 5% too low, the standard errors from bootstrap resampling of cases were on average about 3% too high, and the standard errors from bootstrap resampling of residuals were on average about equal to the standard deviations of the 1000 simulated regression estimates. It seems, therefore, that for the ear infection data the conventional analysis may tend to slightly underestimate the standard errors of regression estimates, and bootstrap resampling of cases may tend to slightly overestimate the standard errors, but bootstrap resampling of residuals shows little bias. The results shown in Table 1 are therefore what is expected from the simulation. The simulated sets of data were also used to estimate the percentage of times that estimated regression coefficients would be within 95% confidence intervals. All the analyses performed well in that respect, with the observed coverage being 94.2% for the conventional analysis, 95.8% for bootstrap resampling of cases, and 94.2% for bootstrap resampling of residuals The Doctor Visits Data The conventional analysis of the doctor visits data results in the estimated equation E Visits = exp Sex Age Age Income LevyPlus FreePoor

8 BOOTSTRAPPING WITH MODELS FOR COUNT DATA FreeRepa Illness ActDays HScore ChCond ChCond2 (5) where E(Visits) is the expected number of doctor visits and the other variables are as described earlier. Cameron and Trivedi (1998) discussed the estimation of an equation relating the number of doctor visits to all of the variables shown in Eq. (5) even though some of them are not significantly different from zero at the 5% level, and here all of the variables are considered because the properties of estimation methods are of interest rather than the selection of variables. The conventional analysis estimates that the variance of counts is what is expected from the Poisson distribution multiplied by If anything there is therefore less variation than expected from the Poisson distribution. This is not a problem with the quasi-maximum likelihood method but does mean that models like the negative binomial that allow for more variation than the Poisson distribution are not appropriate for these data. For stratified resampling of standardized residuals it is necessary to decide on the number of strata to use for the 5190 residuals. Figure 2 shows that for these data there is considerable variation in the distribution of the residuals as the expected number of visits changes from about 0.07 to about 4.0. Given the relatively large number of observations and the large amount of variation in the residual distribution it was decided to use 20 strata, with about 260 residuals in each of these. Table 2 has the same format as Table 1 and shows the estimated coefficients with their estimated standard errors, t-values for testing whether the coefficients are significantly different from zero, the significance of the t-values based on the t- distribution, the means and standard deviations of 5000 bootstrap estimates of the coefficients, and the significance of the t-values estimated from the bootstrap data. Comparing the bootstrap results with those from the conventional analysis, it is seen that the bootstrap standard errors for the estimated coefficients are all larger than the standard errors from the conventional analysis, and the t-values are all less significant from the bootstrap analyses than from the conventional analysis, which is similar to the results that were obtained for the ear infection data. Figure 2 Standardized residuals plotted against the expected number of doctor visits. (Color figure available online.)

9 1172 MANLY Table 2 Results from fitting Eq. (5) using the conventional log-linear method with an allowance for underdispersion and the results obtained from 5000 bootstrap resamples of cases (Bootstrap 1) and 5000 resamples of residuals (Bootstrap 2) Conventional analysis Bootstrap 1 Bootstrap 2 Est Std Err t-value t-dist Signif Std Err Signif Std Err Signif Constant Sex Age AgeSq Income evyplus FreePoor FreeRepa Illness ActDays HScore ChCond ChCond Note. Values shown are the regression estimates (Est), the estimated standard errors (Std Err) of the estimates from the conventional analysis and the estimated standard errors from the two bootstrap resampling methods, and the significance (Signif) of the t-values based on the t-tables and the two bootstrap resampling methods. To examine whether these results occur with simulated data, 250 sets of data similar to the observed data were generated and analyzed in the same way as the observed data. Only 250 sets were generated, because the size of the original data set made the simulation of a data set and the conventional and bootstrap analyses a relatively slow process. The counts for the simulated data sets had expected values given by Eq. (5) with Poisson distributions because, if anything, the variation in the observed counts shows less variation than is expected from this distribution. It was found that on average the regression standard errors from the conventional analysis were about 12% lower than the standard deviations of the simulated estimates while both bootstrap methods gave similar average estimates with little apparent bias. The simulation therefore indicates that the bootstrap analyses are more reliable than the conventional analysis in terms of the estimation of standard errors and the determination of the significance of regression coefficients. These results are also reflected in the observed coverage of 95% confidence intervals for the simulated data, which was only 88% for the conventional analysis, 97% for bootstrap resampling of cases, and 94% for bootstrap resampling of residuals The Marine Mammal Interaction Data The conventional analysis of the marine mammal interaction data results in the estimated equation E Interactions = Days exp FMA FMA FMA FMA Y1997

10 BOOTSTRAPPING WITH MODELS FOR COUNT DATA 1173 Figure 3 Standardized residuals plotted against the number of marine mammal interactions. (Color figure available online.) Y Y Y Y Y2002 (6) where E(Interactions) is the expected number of marine mammal interactions, Days is the number of fishing days, FMA1 is 1 for fishing in fisheries management area 1 or otherwise 0 (with similar definitions for FMA3, FMA5 and FMA6), and Y1997 is 1 for fishing in year 1998 (with similar definitions for Y1998 to Y2002). This then makes the standard observation with E Interactions = Days exp = Days apply for fishing in fisheries management area 7 in 2003, with the effects for the other fishing areas and years being estimated relative to the last fishing area and the last fishing year. The effect for fishing days was allowed for in the standard analysis by including the natural logarithm of Days as an offset in the argument for the exponential function. The conventional analysis estimates that the variances of the counts of the number of interactions are what is expected from the Poisson distribution multiplied by 3.00, so that there is considerable overdispersion in the data. A problem arose when bootstrap resampling of the observations was attempted with this set of data because there are only 34 of these observations, with these being in five fisheries management areas and in seven years. Because of this the probability of a bootstrap set of data having no observations in one of the fisheries management areas or in one of the years is quite high. The estimation process will then fail because one of the predictor variables has no data. This happened immediately when bootstrap resampling of observations was attempted and hence no results with the resampling of observations are available. This problem did not occur with the first two sets of example data because of the much larger numbers of observations. For stratified resampling of standardized residuals it is necessary to decide on the number of strata to use for the 41 residuals. Figure 3 shows that for these data there is some variation in the distribution of the residuals as the expected count changes from about 0.1 to about 60, but given the relatively small number of observations it was decided to use just two strata, with 17 residuals in each of these.

11 1174 MANLY Table 3 Results from fitting Eq. (6) using the conventional log-linear method with an allowance for overdispersion and the results obtained from 5000 resamples of residuals (Bootstrap 2) Conventional analysis Bootstrap 2 Est Std Err t-value t-dist Signif Std Err Signif Constant FMA FMA FMA FMA Y Y Y Y Y Y Note. Values shown are the regression estimates (Est), the estimated standard errors (Std Err) of the estimates from the conventional analysis and the second bootstrap resampling method, and the significance (Signif) of the t-values based on the t-tables and the bootstrap resampling. Table 3 shows the results obtained from the conventional analysis of the data and the analysis using stratified bootstrap resampling. The bootstrap standard errors are similar to those from the conventional analysis except for the estimates of the coefficients of FMA1 and FMA6. For these two fisheries management areas the bootstrap standard errors are higher than the estimates from the conventional analysis. The reason for this is that because of the small size of the data set and the relatively low expected values for these two fisheries management areas there were some bootstrap sets of data with zero counts for all of the observations in one of these areas. An equation was still estimated in these cases but the coefficient for the fisheries management area with zero counts was a large negative number, so that estimated expected frequencies for the bootstrap data were very close to zero for all the observations in this area. This is not unreasonable because the observed data suggest that zero counts for all observations in fisheries management area 1 or 6 could easily have occurred. From the conventional analysis there are three coefficients in Eq. (6) that are significantly different from zero at the 5% level, for the variables FMA1, FMA3, and Y1999. The bootstrap analysis also gives significance at this level for FMA1 and FMA3, but with considerably less significance, while Y1999 is no longer significant. In total, 1000 sets of data similar to the observed data were simulated with the expected values of counts given by Eq. (6) and with the Poisson error inflated by the factor 3.00 using the same zero inflated count model that was used for the simulation of the ear infection data. This showed that the conventional analysis works well with data like the mammal data. If anything it tends to be a slightly conservative, with nominal 95% confidence limits giving 96.7% cover for the simulated data. The bootstrap method also gave reasonable results for the estimation of the regression coefficients other than those for FMA1, FMA3 and FMA6, with nominal 95% confidence intervals giving 95.8% cover for the simulated data. However, the bootstrap resampling did not give good results for the

12 BOOTSTRAPPING WITH MODELS FOR COUNT DATA 1175 coefficients of FMA1, FMA3 and FMA6 because of the large number of simulated sets of data and the large number of bootstrap sets of data generated for the simulated data sets where the estimated probability of an interaction in one or more of these areas was zero, with a resulting large negative coefficient for the variable. For these three variables the nominal 95% bootstrap confidence intervals gave 84% cover for FMA1, 89% cover for FMA3, and 85% cover for FMA6. 6. DISCUSSION Three sets of count data have been analysed using the conventional method for fitting a log-linear model with variance inflation and using bootstrap analyses. With two of the data sets the results from bootstrap resampling of observations and stratified bootstrap resampling of Pearson residuals are available, while for the third set of data only stratified resampling of residuals could be used because of the small sample size. For the first data set, on ear infections, the two bootstrap methods gave similar results in terms of the estimated standard deviation of the estimated regression parameters, and the significance of t-values for testing whether the true regression parameters are zero. However, the bootstrap standard errors are larger than those from the conventional analysis, and the t-values are all less significant for the bootstrap analyses than they are for the conventional analysis. A simulation study indicates that these results are what is expected with these data but in terms of confidence limits for true parameter values the three methods of analysis all have about the same performance overall. For the second set of data, on the number of doctor visits, both bootstrap methods give similar results in terms of the estimated standard errors of estimated regression coefficients and the significance of the estimated coefficients from t-tests. For both bootstrap methods the estimated standard errors of estimated regression coefficients are higher and t-values are less significant than the conventional analysis suggests. A simulation study in this case indicates that the results from the conventional analysis have some problems with data like this and that either of the bootstrap methods gives a more reliable analysis for the data. For the third set of data, on the number of fisheries interactions with marine mammals in different fisheries management areas in different years, bootstrap resampling of individual observations was not possible because the probability of getting no data for a fisheries management area or a year was quite high for a bootstrap set of data. Even if this was not the case it can be argued that for the observed data there was at most one observation in a fisheries management area in a year so that bootstrap sets of data with more than one observation in a fisheries management area in a year could not have occurred. In other words, bootstrap resampling of observations produces impossible data, and is therefore not a reasonable thing to do. In fact, any resampling of the data should maintain the structure of the observed data in terms of the sampling in fisheries management areas and years, and also keep the number of fishing days constant for an observation. This argument does not seem as strong for the example on the number of ear infections and the number of doctor visits because for those data there was presumably no structure on the predictor variables because of the way that the data were collected.

13 1176 MANLY Bootstrap resampling of residuals keeps the structure of the observations constant in terms of the fisheries management areas, years sampled, and the fishing effort in days, and sets of data with no observations for years or fisheries management areas cannot occur. This bootstrap method could therefore be applied with the data on the counts of the number of marine interactions with fishing. It was found that the results obtain by bootstrapping were very similar to those from the conventional analysis except in terms of the coefficients for the variables FMA1 and FMA6 that indicate that fishing was in the fisheries management areas 1 and 6. The bootstrap standard errors for these variables are much larger than the estimates from the conventional analysis because some bootstrap sets of data had zero counts in one or more fisheries management areas 1 and 6, leading to large negative estimated coefficients of one or both of the coefficients of FMA1 and FMA6. The simulation study for these data involved generating 1000 sets of data using Eq. (6) and estimating each set using the conventional analysis and the bootstrap resampling of residuals. The results were similar for seven of the variables in the equation, but the bootstrap results were poor for the three variables FMA1, FMA3, and FMA6 because so many of the bootstrap sets of data had all zero counts in these fishing areas. Therefore for this set of data there are problems with bootstrap analyses but the conventional analysis seems to have good properties. Three sets of data were considered. For the first set the two bootstrap analyses appear to give slightly better results than a conventional analysis. For the second set of data the conventional analysis does not appear to give satisfactory results but the bootstrap methods seem to work well. For the third set of data there are problems with the bootstrap analyses but the conventional analysis appears to give reliable results. The final conclusion suggested by these examples is therefore that a bootstrap analysis is more reliable than the conventional analysis for some sets of count data but it can also be expected that in some cases, particularly with small data sets, a bootstrap analysis may give worse results than a conventional analysis. REFERENCES Cameron, A. C., Trivedi, P. K. (1998). Regression Analysis of Count Data. Cambridge: Cambridge University Press. Davison, A. C., Hinkley, D. V. (1972). Bootstrap Methods and Their Applications. Cambridge: Cambridge University Press. Horton, N. J., Kim, E., Saitz, R. (2007). A cautionary note regarding count models of alcohol consumption in randomized clinical trials. BMC Medical Research Methodology 7:9. Manly, B. F. J. (2009). Statistics of Environmental Science and Management. 2nd ed. Boca Raton, FL: Chapman and Hall/CRC. Manly, B. F. J., Chotkowski, M. (2006). Two new methods for regime change analyses. Archives in Hydrobiology 167: McCullagh, P., Nelder, J. A. (1989). Generalized Linear Models. 2nd ed. Chapman and Hall, London. Moulton, L. H., Zeger, S. L. (1991). Bootstrapping generalized linear models. Computational Statistics and Data Analysis 11: StatSci.Org. (2011). OzDASL: Ear infections in swimmers. Data available at org/data/oz/earinf.html VSN International. (2010). GenStat 13th edition. Available at

MODELING COUNT DATA Joseph M. Hilbe

MODELING COUNT DATA Joseph M. Hilbe MODELING COUNT DATA Joseph M. Hilbe Arizona State University Count models are a subset of discrete response regression models. Count data are distributed as non-negative integers, are intrinsically heteroskedastic,

More information

Compare Predicted Counts between Groups of Zero Truncated Poisson Regression Model based on Recycled Predictions Method

Compare Predicted Counts between Groups of Zero Truncated Poisson Regression Model based on Recycled Predictions Method Compare Predicted Counts between Groups of Zero Truncated Poisson Regression Model based on Recycled Predictions Method Yan Wang 1, Michael Ong 2, Honghu Liu 1,2,3 1 Department of Biostatistics, UCLA School

More information

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key Statistical Methods III Statistics 212 Problem Set 2 - Answer Key 1. (Analysis to be turned in and discussed on Tuesday, April 24th) The data for this problem are taken from long-term followup of 1423

More information

CHAPTER 3 Count Regression

CHAPTER 3 Count Regression CHAPTER 3 Count Regression When the response is a count (a positive integer), we can use a count regression model to explain this response in terms of the given predictors. Sometimes, the total count is

More information

DISPLAYING THE POISSON REGRESSION ANALYSIS

DISPLAYING THE POISSON REGRESSION ANALYSIS Chapter 17 Poisson Regression Chapter Table of Contents DISPLAYING THE POISSON REGRESSION ANALYSIS...264 ModelInformation...269 SummaryofFit...269 AnalysisofDeviance...269 TypeIII(Wald)Tests...269 MODIFYING

More information

Modeling Overdispersion

Modeling Overdispersion James H. Steiger Department of Psychology and Human Development Vanderbilt University Regression Modeling, 2009 1 Introduction 2 Introduction In this lecture we discuss the problem of overdispersion in

More information

Lecture 14: Introduction to Poisson Regression

Lecture 14: Introduction to Poisson Regression Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why

More information

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview Modelling counts I Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week

More information

Faculty of Science FINAL EXAMINATION Mathematics MATH 523 Generalized Linear Models

Faculty of Science FINAL EXAMINATION Mathematics MATH 523 Generalized Linear Models Faculty of Science FINAL EXAMINATION Mathematics MATH 523 Generalized Linear Models Examiner: Professor K.J. Worsley Associate Examiner: Professor R. Steele Date: Thursday, April 17, 2008 Time: 14:00-17:00

More information

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data Ronald Heck Class Notes: Week 8 1 Class Notes: Week 8 Probit versus Logit Link Functions and Count Data This week we ll take up a couple of issues. The first is working with a probit link function. While

More information

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn A Handbook of Statistical Analyses Using R Brian S. Everitt and Torsten Hothorn CHAPTER 6 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, and Colonic Polyps

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

8 Nominal and Ordinal Logistic Regression

8 Nominal and Ordinal Logistic Regression 8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on

More information

22s:152 Applied Linear Regression. RECALL: The Poisson distribution. Let Y be distributed as a Poisson random variable with the single parameter λ.

22s:152 Applied Linear Regression. RECALL: The Poisson distribution. Let Y be distributed as a Poisson random variable with the single parameter λ. 22s:152 Applied Linear Regression Chapter 15 Section 2: Poisson Regression RECALL: The Poisson distribution Let Y be distributed as a Poisson random variable with the single parameter λ. P (Y = y) = e

More information

Generalized Linear Models (GLZ)

Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) are an extension of the linear modeling process that allows models to be fit to data that follow probability distributions other than the

More information

Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p )

Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p ) Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p. 376-390) BIO656 2009 Goal: To see if a major health-care reform which took place in 1997 in Germany was

More information

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science. Texts in Statistical Science Generalized Linear Mixed Models Modern Concepts, Methods and Applications Walter W. Stroup CRC Press Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint

More information

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion Biostokastikum Overdispersion is not uncommon in practice. In fact, some would maintain that overdispersion is the norm in practice and nominal dispersion the exception McCullagh and Nelder (1989) Overdispersion

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

Outline of GLMs. Definitions

Outline of GLMs. Definitions Outline of GLMs Definitions This is a short outline of GLM details, adapted from the book Nonparametric Regression and Generalized Linear Models, by Green and Silverman. The responses Y i have density

More information

Application of Prediction Techniques to Road Safety in Developing Countries

Application of Prediction Techniques to Road Safety in Developing Countries International Journal of Applied Science and Engineering 2009. 7, 2: 169-175 Application of Prediction Techniques to Road Safety in Developing Countries Dr. Jamal Al-Matawah * and Prof. Khair Jadaan Department

More information

SAMPLE SIZE RE-ESTIMATION FOR ADAPTIVE SEQUENTIAL DESIGN IN CLINICAL TRIALS

SAMPLE SIZE RE-ESTIMATION FOR ADAPTIVE SEQUENTIAL DESIGN IN CLINICAL TRIALS Journal of Biopharmaceutical Statistics, 18: 1184 1196, 2008 Copyright Taylor & Francis Group, LLC ISSN: 1054-3406 print/1520-5711 online DOI: 10.1080/10543400802369053 SAMPLE SIZE RE-ESTIMATION FOR ADAPTIVE

More information

LOGISTIC REGRESSION Joseph M. Hilbe

LOGISTIC REGRESSION Joseph M. Hilbe LOGISTIC REGRESSION Joseph M. Hilbe Arizona State University Logistic regression is the most common method used to model binary response data. When the response is binary, it typically takes the form of

More information

Biased Urn Theory. Agner Fog October 4, 2007

Biased Urn Theory. Agner Fog October 4, 2007 Biased Urn Theory Agner Fog October 4, 2007 1 Introduction Two different probability distributions are both known in the literature as the noncentral hypergeometric distribution. These two distributions

More information

Estimation of AUC from 0 to Infinity in Serial Sacrifice Designs

Estimation of AUC from 0 to Infinity in Serial Sacrifice Designs Estimation of AUC from 0 to Infinity in Serial Sacrifice Designs Martin J. Wolfsegger Department of Biostatistics, Baxter AG, Vienna, Austria Thomas Jaki Department of Statistics, University of South Carolina,

More information

Midterm 2 - Solutions

Midterm 2 - Solutions Ecn 102 - Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman Midterm 2 - Solutions You have until 10:20am to complete this exam. Please remember to put

More information

Lecture 12: Effect modification, and confounding in logistic regression

Lecture 12: Effect modification, and confounding in logistic regression Lecture 12: Effect modification, and confounding in logistic regression Ani Manichaikul amanicha@jhsph.edu 4 May 2007 Today Categorical predictor create dummy variables just like for linear regression

More information

UNIVERSITÄT POTSDAM Institut für Mathematik

UNIVERSITÄT POTSDAM Institut für Mathematik UNIVERSITÄT POTSDAM Institut für Mathematik Testing the Acceleration Function in Life Time Models Hannelore Liero Matthias Liero Mathematische Statistik und Wahrscheinlichkeitstheorie Universität Potsdam

More information

Trends in Human Development Index of European Union

Trends in Human Development Index of European Union Trends in Human Development Index of European Union Department of Statistics, Hacettepe University, Beytepe, Ankara, Turkey spxl@hacettepe.edu.tr, deryacal@hacettepe.edu.tr Abstract: The Human Development

More information

Chapter 22: Log-linear regression for Poisson counts

Chapter 22: Log-linear regression for Poisson counts Chapter 22: Log-linear regression for Poisson counts Exposure to ionizing radiation is recognized as a cancer risk. In the United States, EPA sets guidelines specifying upper limits on the amount of exposure

More information

Sample size calculations for logistic and Poisson regression models

Sample size calculations for logistic and Poisson regression models Biometrika (2), 88, 4, pp. 93 99 2 Biometrika Trust Printed in Great Britain Sample size calculations for logistic and Poisson regression models BY GWOWEN SHIEH Department of Management Science, National

More information

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model 1 Linear Regression 2 Linear Regression In this lecture we will study a particular type of regression model: the linear regression model We will first consider the case of the model with one predictor

More information

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn A Handbook of Statistical Analyses Using R 2nd Edition Brian S. Everitt and Torsten Hothorn CHAPTER 7 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, Colonic

More information

Generalized Linear Mixed-Effects Models. Copyright c 2015 Dan Nettleton (Iowa State University) Statistics / 58

Generalized Linear Mixed-Effects Models. Copyright c 2015 Dan Nettleton (Iowa State University) Statistics / 58 Generalized Linear Mixed-Effects Models Copyright c 2015 Dan Nettleton (Iowa State University) Statistics 510 1 / 58 Reconsideration of the Plant Fungus Example Consider again the experiment designed to

More information

DEEP, University of Lausanne Lectures on Econometric Analysis of Count Data Pravin K. Trivedi May 2005

DEEP, University of Lausanne Lectures on Econometric Analysis of Count Data Pravin K. Trivedi May 2005 DEEP, University of Lausanne Lectures on Econometric Analysis of Count Data Pravin K. Trivedi May 2005 The lectures will survey the topic of count regression with emphasis on the role on unobserved heterogeneity.

More information

Multinomial Logistic Regression Models

Multinomial Logistic Regression Models Stat 544, Lecture 19 1 Multinomial Logistic Regression Models Polytomous responses. Logistic regression can be extended to handle responses that are polytomous, i.e. taking r>2 categories. (Note: The word

More information

Logistic Regression - problem 6.14

Logistic Regression - problem 6.14 Logistic Regression - problem 6.14 Let x 1, x 2,, x m be given values of an input variable x and let Y 1,, Y m be independent binomial random variables whose distributions depend on the corresponding values

More information

Poisson regression: Further topics

Poisson regression: Further topics Poisson regression: Further topics April 21 Overdispersion One of the defining characteristics of Poisson regression is its lack of a scale parameter: E(Y ) = Var(Y ), and no parameter is available to

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models Generalized Linear Models - part II Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs.

More information

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics Faculty of Health Sciences Regression models Counts, Poisson regression, 27-5-2013 Lene Theil Skovgaard Dept. of Biostatistics 1 / 36 Count outcome PKA & LTS, Sect. 7.2 Poisson regression The Binomial

More information

Ecn Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman. Midterm 2. Name: ID Number: Section:

Ecn Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman. Midterm 2. Name: ID Number: Section: Ecn 102 - Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman Midterm 2 You have until 10:20am to complete this exam. Please remember to put your name,

More information

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction ReCap. Parts I IV. The General Linear Model Part V. The Generalized Linear Model 16 Introduction 16.1 Analysis

More information

A Simple Approximate Procedure for Constructing Binomial and Poisson Tolerance Intervals

A Simple Approximate Procedure for Constructing Binomial and Poisson Tolerance Intervals This article was downloaded by: [Kalimuthu Krishnamoorthy] On: 11 February 01, At: 08:40 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 107954 Registered office:

More information

The GENMOD Procedure. Overview. Getting Started. Syntax. Details. Examples. References. SAS/STAT User's Guide. Book Contents Previous Next

The GENMOD Procedure. Overview. Getting Started. Syntax. Details. Examples. References. SAS/STAT User's Guide. Book Contents Previous Next Book Contents Previous Next SAS/STAT User's Guide Overview Getting Started Syntax Details Examples References Book Contents Previous Next Top http://v8doc.sas.com/sashtml/stat/chap29/index.htm29/10/2004

More information

STAT 510 Final Exam Spring 2015

STAT 510 Final Exam Spring 2015 STAT 510 Final Exam Spring 2015 Instructions: The is a closed-notes, closed-book exam No calculator or electronic device of any kind may be used Use nothing but a pen or pencil Please write your name and

More information

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University Poisson Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Poisson Regression 1 / 49 Poisson Regression 1 Introduction

More information

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011) Ron Heck, Fall 2011 1 EDEP 768E: Seminar in Multilevel Modeling rev. January 3, 2012 (see footnote) Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October

More information

A strategy for modelling count data which may have extra zeros

A strategy for modelling count data which may have extra zeros A strategy for modelling count data which may have extra zeros Alan Welsh Centre for Mathematics and its Applications Australian National University The Data Response is the number of Leadbeater s possum

More information

Improving the Precision of Estimation by fitting a Generalized Linear Model, and Quasi-likelihood.

Improving the Precision of Estimation by fitting a Generalized Linear Model, and Quasi-likelihood. Improving the Precision of Estimation by fitting a Generalized Linear Model, and Quasi-likelihood. P.M.E.Altham, Statistical Laboratory, University of Cambridge June 27, 2006 This article was published

More information

This document contains 3 sets of practice problems.

This document contains 3 sets of practice problems. P RACTICE PROBLEMS This document contains 3 sets of practice problems. Correlation: 3 problems Regression: 4 problems ANOVA: 8 problems You should print a copy of these practice problems and bring them

More information

Unit 14: Nonparametric Statistical Methods

Unit 14: Nonparametric Statistical Methods Unit 14: Nonparametric Statistical Methods Statistics 571: Statistical Methods Ramón V. León 8/8/2003 Unit 14 - Stat 571 - Ramón V. León 1 Introductory Remarks Most methods studied so far have been based

More information

Discrete Response Multilevel Models for Repeated Measures: An Application to Voting Intentions Data

Discrete Response Multilevel Models for Repeated Measures: An Application to Voting Intentions Data Quality & Quantity 34: 323 330, 2000. 2000 Kluwer Academic Publishers. Printed in the Netherlands. 323 Note Discrete Response Multilevel Models for Repeated Measures: An Application to Voting Intentions

More information

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ Logistic Regression The goal of a logistic regression analysis is to find the best fitting and most parsimonious, yet biologically reasonable, model to describe the relationship between an outcome (dependent

More information

Generalized Linear Models

Generalized Linear Models York SPIDA John Fox Notes Generalized Linear Models Copyright 2010 by John Fox Generalized Linear Models 1 1. Topics I The structure of generalized linear models I Poisson and other generalized linear

More information

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46 A Generalized Linear Model for Binomial Response Data Copyright c 2017 Dan Nettleton (Iowa State University) Statistics 510 1 / 46 Now suppose that instead of a Bernoulli response, we have a binomial response

More information

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form: Outline for today What is a generalized linear model Linear predictors and link functions Example: fit a constant (the proportion) Analysis of deviance table Example: fit dose-response data using logistic

More information

Model Estimation Example

Model Estimation Example Ronald H. Heck 1 EDEP 606: Multivariate Methods (S2013) April 7, 2013 Model Estimation Example As we have moved through the course this semester, we have encountered the concept of model estimation. Discussions

More information

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Logistic Regression 1 / 38 Logistic Regression 1 Introduction

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population

More information

An EM Algorithm for Multivariate Mixed Poisson Regression Models and its Application

An EM Algorithm for Multivariate Mixed Poisson Regression Models and its Application Applied Mathematical Sciences, Vol. 6, 2012, no. 137, 6843-6856 An EM Algorithm for Multivariate Mixed Poisson Regression Models and its Application M. E. Ghitany 1, D. Karlis 2, D.K. Al-Mutairi 1 and

More information

Generalized linear models

Generalized linear models Generalized linear models Douglas Bates November 01, 2010 Contents 1 Definition 1 2 Links 2 3 Estimating parameters 5 4 Example 6 5 Model building 8 6 Conclusions 8 7 Summary 9 1 Generalized Linear Models

More information

Practical Considerations Surrounding Normality

Practical Considerations Surrounding Normality Practical Considerations Surrounding Normality Prof. Kevin E. Thorpe Dalla Lana School of Public Health University of Toronto KE Thorpe (U of T) Normality 1 / 16 Objectives Objectives 1. Understand the

More information

Chapter 9. Correlation and Regression

Chapter 9. Correlation and Regression Chapter 9 Correlation and Regression Lesson 9-1/9-2, Part 1 Correlation Registered Florida Pleasure Crafts and Watercraft Related Manatee Deaths 100 80 60 40 20 0 1991 1993 1995 1997 1999 Year Boats in

More information

MAT 2379, Introduction to Biostatistics, Sample Calculator Questions 1. MAT 2379, Introduction to Biostatistics

MAT 2379, Introduction to Biostatistics, Sample Calculator Questions 1. MAT 2379, Introduction to Biostatistics MAT 2379, Introduction to Biostatistics, Sample Calculator Questions 1 MAT 2379, Introduction to Biostatistics Sample Calculator Problems for the Final Exam Note: The exam will also contain some problems

More information

Introducing Generalized Linear Models: Logistic Regression

Introducing Generalized Linear Models: Logistic Regression Ron Heck, Summer 2012 Seminars 1 Multilevel Regression Models and Their Applications Seminar Introducing Generalized Linear Models: Logistic Regression The generalized linear model (GLM) represents and

More information

TESTS FOR EQUIVALENCE BASED ON ODDS RATIO FOR MATCHED-PAIR DESIGN

TESTS FOR EQUIVALENCE BASED ON ODDS RATIO FOR MATCHED-PAIR DESIGN Journal of Biopharmaceutical Statistics, 15: 889 901, 2005 Copyright Taylor & Francis, Inc. ISSN: 1054-3406 print/1520-5711 online DOI: 10.1080/10543400500265561 TESTS FOR EQUIVALENCE BASED ON ODDS RATIO

More information

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY (formerly the Examinations of the Institute of Statisticians) GRADUATE DIPLOMA, 2007

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY (formerly the Examinations of the Institute of Statisticians) GRADUATE DIPLOMA, 2007 EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY (formerly the Examinations of the Institute of Statisticians) GRADUATE DIPLOMA, 2007 Applied Statistics I Time Allowed: Three Hours Candidates should answer

More information

Webinar Session 1. Introduction to Modern Methods for Analyzing Capture- Recapture Data: Closed Populations 1

Webinar Session 1. Introduction to Modern Methods for Analyzing Capture- Recapture Data: Closed Populations 1 Webinar Session 1. Introduction to Modern Methods for Analyzing Capture- Recapture Data: Closed Populations 1 b y Bryan F.J. Manly Western Ecosystems Technology Inc. Cheyenne, Wyoming bmanly@west-inc.com

More information

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1 Parametric Modelling of Over-dispersed Count Data Part III / MMath (Applied Statistics) 1 Introduction Poisson regression is the de facto approach for handling count data What happens then when Poisson

More information

Lecture 5: Poisson and logistic regression

Lecture 5: Poisson and logistic regression Dankmar Böhning Southampton Statistical Sciences Research Institute University of Southampton, UK S 3 RI, 3-5 March 2014 introduction to Poisson regression application to the BELCAP study introduction

More information

Lattice Data. Tonglin Zhang. Spatial Statistics for Point and Lattice Data (Part III)

Lattice Data. Tonglin Zhang. Spatial Statistics for Point and Lattice Data (Part III) Title: Spatial Statistics for Point Processes and Lattice Data (Part III) Lattice Data Tonglin Zhang Outline Description Research Problems Global Clustering and Local Clusters Permutation Test Spatial

More information

Chapter 10 Nonlinear Models

Chapter 10 Nonlinear Models Chapter 10 Nonlinear Models Nonlinear models can be classified into two categories. In the first category are models that are nonlinear in the variables, but still linear in terms of the unknown parameters.

More information

Online publication date: 12 January 2010

Online publication date: 12 January 2010 This article was downloaded by: [Zhang, Lanju] On: 13 January 2010 Access details: Access Details: [subscription number 918543200] Publisher Taylor & Francis Informa Ltd Registered in England and Wales

More information

Exam Applied Statistical Regression. Good Luck!

Exam Applied Statistical Regression. Good Luck! Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.

More information

Generalised linear models. Response variable can take a number of different formats

Generalised linear models. Response variable can take a number of different formats Generalised linear models Response variable can take a number of different formats Structure Limitations of linear models and GLM theory GLM for count data GLM for presence \ absence data GLM for proportion

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) (b) (c) (d) (e) In 2 2 tables, statistical independence is equivalent

More information

BOOtstrapping the Generalized Linear Model. Link to the last RSS article here:factor Analysis with Binary items: A quick review with examples. -- Ed.

BOOtstrapping the Generalized Linear Model. Link to the last RSS article here:factor Analysis with Binary items: A quick review with examples. -- Ed. MyUNT EagleConnect Blackboard People & Departments Maps Calendars Giving to UNT Skip to content Benchmarks ABOUT BENCHMARK ONLINE SEARCH ARCHIVE SUBSCRIBE TO BENCHMARKS ONLINE Columns, October 2014 Home»

More information

Prediction of Bike Rental using Model Reuse Strategy

Prediction of Bike Rental using Model Reuse Strategy Prediction of Bike Rental using Model Reuse Strategy Arun Bala Subramaniyan and Rong Pan School of Computing, Informatics, Decision Systems Engineering, Arizona State University, Tempe, USA. {bsarun, rong.pan}@asu.edu

More information

Approximate and Fiducial Confidence Intervals for the Difference Between Two Binomial Proportions

Approximate and Fiducial Confidence Intervals for the Difference Between Two Binomial Proportions Approximate and Fiducial Confidence Intervals for the Difference Between Two Binomial Proportions K. Krishnamoorthy 1 and Dan Zhang University of Louisiana at Lafayette, Lafayette, LA 70504, USA SUMMARY

More information

STAT 526 Spring Final Exam. Thursday May 5, 2011

STAT 526 Spring Final Exam. Thursday May 5, 2011 STAT 526 Spring 2011 Final Exam Thursday May 5, 2011 Time: 2 hours Name (please print): Show all your work and calculations. Partial credit will be given for work that is partially correct. Points will

More information

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators Multiple Regression Relating a response (dependent, input) y to a set of explanatory (independent, output, predictor) variables x, x 2, x 3,, x q. A technique for modeling the relationship between variables.

More information

9 Generalized Linear Models

9 Generalized Linear Models 9 Generalized Linear Models The Generalized Linear Model (GLM) is a model which has been built to include a wide range of different models you already know, e.g. ANOVA and multiple linear regression models

More information

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics Linear, Generalized Linear, and Mixed-Effects Models in R John Fox McMaster University ICPSR 2018 John Fox (McMaster University) Statistical Models in R ICPSR 2018 1 / 19 Linear and Generalized Linear

More information

GLM models and OLS regression

GLM models and OLS regression GLM models and OLS regression Graeme Hutcheson, University of Manchester These lecture notes are based on material published in... Hutcheson, G. D. and Sofroniou, N. (1999). The Multivariate Social Scientist:

More information

Linear Mixed Models: Methodology and Algorithms

Linear Mixed Models: Methodology and Algorithms Linear Mixed Models: Methodology and Algorithms David M. Allen University of Kentucky January 8, 2018 1 The Linear Mixed Model This Chapter introduces some terminology and definitions relating to the main

More information

Application of Poisson and Negative Binomial Regression Models in Modelling Oil Spill Data in the Niger Delta

Application of Poisson and Negative Binomial Regression Models in Modelling Oil Spill Data in the Niger Delta International Journal of Science and Engineering Investigations vol. 7, issue 77, June 2018 ISSN: 2251-8843 Application of Poisson and Negative Binomial Regression Models in Modelling Oil Spill Data in

More information

Testing and Model Selection

Testing and Model Selection Testing and Model Selection This is another digression on general statistics: see PE App C.8.4. The EViews output for least squares, probit and logit includes some statistics relevant to testing hypotheses

More information

USE OF STATISTICAL BOOTSTRAPPING FOR SAMPLE SIZE DETERMINATION TO ESTIMATE LENGTH-FREQUENCY DISTRIBUTIONS FOR PACIFIC ALBACORE TUNA (THUNNUS ALALUNGA)

USE OF STATISTICAL BOOTSTRAPPING FOR SAMPLE SIZE DETERMINATION TO ESTIMATE LENGTH-FREQUENCY DISTRIBUTIONS FOR PACIFIC ALBACORE TUNA (THUNNUS ALALUNGA) FRI-UW-992 March 1999 USE OF STATISTICAL BOOTSTRAPPING FOR SAMPLE SIZE DETERMINATION TO ESTIMATE LENGTH-FREQUENCY DISTRIBUTIONS FOR PACIFIC ALBACORE TUNA (THUNNUS ALALUNGA) M. GOMEZ-BUCKLEY, L. CONQUEST,

More information

Chapter 11. Correlation and Regression

Chapter 11. Correlation and Regression Chapter 11. Correlation and Regression The word correlation is used in everyday life to denote some form of association. We might say that we have noticed a correlation between foggy days and attacks of

More information

University, Tempe, Arizona, USA b Department of Mathematics and Statistics, University of New. Mexico, Albuquerque, New Mexico, USA

University, Tempe, Arizona, USA b Department of Mathematics and Statistics, University of New. Mexico, Albuquerque, New Mexico, USA This article was downloaded by: [University of New Mexico] On: 27 September 2012, At: 22:13 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered

More information

Bivariate Poisson and Diagonal Inflated Bivariate Poisson Regression Models in R

Bivariate Poisson and Diagonal Inflated Bivariate Poisson Regression Models in R 1 Bivariate Poisson and Diagonal Inflated Bivariate Poisson Regression Models in R Dimitris Karlis and Ioannis Ntzoufras Department of Statistics, Athens University of Economics and Business, 76, Patission

More information

Distribution Theory. Comparison Between Two Quantiles: The Normal and Exponential Cases

Distribution Theory. Comparison Between Two Quantiles: The Normal and Exponential Cases Communications in Statistics Simulation and Computation, 34: 43 5, 005 Copyright Taylor & Francis, Inc. ISSN: 0361-0918 print/153-4141 online DOI: 10.1081/SAC-00055639 Distribution Theory Comparison Between

More information

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/ Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/28.0018 Statistical Analysis in Ecology using R Linear Models/GLM Ing. Daniel Volařík, Ph.D. 13.

More information

PAPER 206 APPLIED STATISTICS

PAPER 206 APPLIED STATISTICS MATHEMATICAL TRIPOS Part III Thursday, 1 June, 2017 9:00 am to 12:00 pm PAPER 206 APPLIED STATISTICS Attempt no more than FOUR questions. There are SIX questions in total. The questions carry equal weight.

More information

Section 4.6 Simple Linear Regression

Section 4.6 Simple Linear Regression Section 4.6 Simple Linear Regression Objectives ˆ Basic philosophy of SLR and the regression assumptions ˆ Point & interval estimation of the model parameters, and how to make predictions ˆ Point and interval

More information

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression Rebecca Barter April 20, 2015 Fisher s Exact Test Fisher s Exact Test

More information

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples ST3241 Categorical Data Analysis I Generalized Linear Models Introduction and Some Examples 1 Introduction We have discussed methods for analyzing associations in two-way and three-way tables. Now we will

More information

A Non-parametric bootstrap for multilevel models

A Non-parametric bootstrap for multilevel models A Non-parametric bootstrap for multilevel models By James Carpenter London School of Hygiene and ropical Medicine Harvey Goldstein and Jon asbash Institute of Education 1. Introduction Bootstrapping is

More information

Lecture (chapter 13): Association between variables measured at the interval-ratio level

Lecture (chapter 13): Association between variables measured at the interval-ratio level Lecture (chapter 13): Association between variables measured at the interval-ratio level Ernesto F. L. Amaral April 9 11, 2018 Advanced Methods of Social Research (SOCI 420) Source: Healey, Joseph F. 2015.

More information

In Defence of Score Intervals for Proportions and their Differences

In Defence of Score Intervals for Proportions and their Differences In Defence of Score Intervals for Proportions and their Differences Robert G. Newcombe a ; Markku M. Nurminen b a Department of Primary Care & Public Health, Cardiff University, Cardiff, United Kingdom

More information