Abstract This paper discusses the modeling of over- and under-dispersion for grouped binary (proportions or frequency) data in binomial regression mod

Size: px
Start display at page:

Download "Abstract This paper discusses the modeling of over- and under-dispersion for grouped binary (proportions or frequency) data in binomial regression mod"

Transcription

1 Analysis of Proportions Data 1 Bradley Palmquist Department of Political Science Vanderbilt University brad.palmquist@vanderbilt.edu July Prepared for the 1999 Annual Meeting of the Political Methodology Society, College Station, Texas, July 15 { 19, 1999.

2 Abstract This paper discusses the modeling of over- and under-dispersion for grouped binary (proportions or frequency) data in binomial regression models (logit and probit). The sources of non-binomial variance in heterogeneity and dependence across or within units is described. Even when point predictions in conventional logit and probit analyses are only slightly affected, standard errors, and inferences based on them, will be incorrect. Particular attention is paid to the extended beta-binomial model. Using the extended beta binomial distribution, the non-binomial variance can be explicitly modeled. Another benet of modeling the non-binomial variance with the EBB distribution is that negative correlations which lead to under-dispersion can be accommodated. Although common in some disciplines, EBB in conjunction with binary regressions (logit, probit) has only recently been used in political science applications. Some example analyses are presented and the results are compared to those from other models and post-analysis xes.

3 1 Introduction It has been suggested (Lindsey 1995) that most applied statistical analysis involves dependent variables with discrete distributions and that therefore the textbook emphasis on continuous variables and standard regression techniques gives a student the wrong impression of what is ahead. Judging by recent methodological work there may be some truth to this suggestion in political science. Much attention has been given lately to various logit and probit models of binary outcomes (ordered, multinomial, multivariate, conditional, nested) and to count models with non-negative integral dependent variables. The simplest count models are based on the poisson distribution. As a one-parameter distribution, the variance and the mean are necessarily functionally related, and in this case they are equal. Actual data often do not exhibit this mean variance equality. The analyst may have theoretical reasons for expecting violations of the poisson assumptions. Frequently it is simply observed that the variance either exceeds (overdispersion) the expected value or is less (underdispersion). Those who have studied count models have emphasized from the beginning the importance of paying attention to these phenomena and have developed a variety of more exible models and estimation techniques to account for them (King 1989). Political scientists have not paid as much attention to similar issues of over- and underdispersion in the case of sums of binary outcomes. There are a few exceptions. King (1989) briey discusses extra-binomial dispersion and the use of the beta-binomial model. Cox and Katz (1999) use the beta-binomial model in a study of the eects of redistricting. Globetti (1997), Futter and Mebane (1996), and Baker and Scheiner (1999) are other recent examples. The object of this paper is to explain the issues and discuss the modeling and estimation techniques for non-binomial dispersion in the analysis of proportions (or frequency, or binary count) variables. Of course, in some cases a straightforward binomial distribution may be assumed to characterize a sum of binary outcomes. But when there is parameter 1

4 heterogeneity or dependence among the binary variables that make up the sums, a binomial model will not apply. Standard errors will be incorrectly estimated, comparisons of means or parameter values can be misleading, and estimates will not be ecient. Previous research on count models can provide a guide. Mixture models, random-eects models, robust variance estimation, and heterogeneity factors have been used to extend the poisson regression (log-linear) model. Analogous techniques exist for binomial regression models (typically logit and probit). Previous work in biostatistics provides much of the statistical machinery we require. For example, the (extended) beta binomial model has been used in the toxicological research literature to model these non-binomial characteristics. The beta-binomial model is one of the central subjects of this paper. But the received wisdom on count models can also provide a counterpoint. There are some unique features to the under- and over-dispersion in models of binary data. In particular the role of \heterogeneity" can be surprising. Another goal of this paper is to describe carefully the various sources of non-binomial variation and to clarify the way heterogeneity and dependence enter in. There are a several possible connections between parameter heterogeneity and non-binomial varation depending on the kind of heterogeneity being modeled. Accounting for heterogeneity across units involves a straightforward application of the beta binomial model where overdispersion is to be expected. Characterizing heterogeneity within units is a dierent problem. The consequences, of this \heterogeneity" is under dispersion not overdispersion. 2 Binomial or Grouped Binary Response Data The data we are interested in are binary outcomes: success or failure, voting yes or no, elected not elected, employed or unemployed, etc. When the outcomes can be modeled 2

5 individually, conventional logit, probit, and related techniques can be used. But when the individual binary outcomes are grouped, the usual assumptions of standard logit analysis may not be met. Instead of modeling the individual binary variables we focus on their sum, Y, which is the number of positive outcomes conditional on the number of trials, n, for that group. These sums of binary variables are indexed by the subscript, i (i = 1; 2; :::; N) so that Y i is the number of positive outcomes recorded for the i'th observational unit out of n i trials. If the binary variables recorded for the i'th unit have a common probability of success, i, and they are independent, then Y i is distributed binomially. This familiar distribution is Pr(Y i = y i j i ; n i ) = 0 n i y i 1 C A i y i (1? i ) n?y i : The subscripts, i, are included to remind us from the beginning that we are not interested simply in a univariate distribution, but in a whole series of Y i modeling of their expectations. each conditional on the The expectation of Y i is n i i and the binomial variance of Y i equals n i i (1? i ): Sometimes it is more natural to consider the proportion of successes, p i. The expectation of this proportion is just i with variance i (1? i )=n i. On occasion it makes sense to discuss the underlying components of the sum, the n i Bernoulli trials, which I labelu ij, but in most of what follows I focus on the Y i directly since I assume that the data are produced or recorded in the grouped form, and that the analyst does not have added information about the specic U ij. For example, any potential independent variables have values that vary by the i'th unit and not by the U ij within each unit. An analogy to integer count data can be made. One typically models the poisson sum, for example, not the individual outcomes which make up the sum. If in fact the grouping is an articial combination of independent cases that happen to 3

6 have the same values on measured independent variables, grouped binary variables provide no new challenges. If, however, the assumptions of independence and identical distribution are not met, Y i will not be distributed binomially. The issue of varying distributions will be taken up later. Lack of independence can result from two dierent mechanisms. The individual binary components of the observational unit can be directly correlated. Or this correlation can be induced as a result of heterogeneity across the Y i. The next several sections deal mostly with this issue of heterogeneity. 3 Heterogeneity The heterogeneity we consider in this section involves varying response probabilities for dierent individuals or observation units. 1 For the time being we maintain the assumption of a single response probability i constant across each of the binary variables that make up the sum, Y i. What can dier is i and i 0 for two dierent units. To understand the sense in which heterogeneity across cases can produce non-binomial variation in analyses of binary data, consider rst the textbook linear regression setup of a continuous dependent variable.. In a regression analysis we attempt to account for varying E(Y i ) by modeling these expectations as functions of the independent variables. Thus, to take the simplest example of a linear regression, we model the data as E(Y i jx i ) = x 0 i; V (Y i jx i ) = 2 where the variance of the Y i is assumed to be the same for dierent units. (Note that if the x i are regarded as xed in repeated samples the expectations and variance need not be 1 I emphasize again that I am not discussing the individual binary components, U ij, of the binomial sums, Y i. For our purposes the \individual" observation i relates to the entire sum Y i. To help avoid confusion I sometimes refer to this i'th \individual" as the i'th \observational unit." 4

7 stated in conditional form.) We can also rewrite the regression model in a form that isolates the variable part of Y i as a disturbance Y i = x 0 i + e i ; E(e i ) = 0; V (e i ) = 2 : In order to do maximum likelihood analysis we would also need to assume a particular distribution for the Y i, typically normal. At some points in the discussion we will do so, but for the moment this is not necessary. What we assume for the linear regression setup is that the distribution of Y i is continuous and unbounded. There can be some ambiguity about just what is the data generation process in a regression model. If it makes sense to identify specic individuals in a population, or one can somehow distinguish the random outcomes by labeling, we can in principle distinguish two components of e i. In addition to the fundamental variation (due to sampling, observation, or random process) for each individual there can also be variation due to heterogeneity among the individuals or observational units even after conditioning on the independent variables. Labeling the distinguishable individuals with the subscript i and a sampling or outcome for that individual with the subscript j we have E(Y ij ) = x 0 i + i + j ; E( i ) = E( i ) = 0; V ( i ) = 2 ; V ( i ) = 2 : Under this model each individual (or observed unit) has an expectation of x 0 i + i. The set of Y ij as whole has expectation x 0 i. The i can be considered individual-specic eects. This sort of setup is most familiar with panel data where an individual (or country or rm) is observed on multiple occasions over time and therefore the subscript j is conventionally t. The i are then the familiar random eects. (We are ignoring here any questions of heteroskedasticity.) But even if all the data we have come from a single cross-section, the variation of the Y i include that due to the i. For usual OLS and maximum likelihood 5

8 estimation with normal errors, this added source of variation is observationally equivalent to having a single disturbance term e i with variance equal to And, importantly, it does not aect the way we estimate the main structural parameters of interest,. The added uncertainty is naturally just absorbed into the estimate of e. The situation is dierent for the distribution of binary data. Additional variation across observation units on top of the fundamental binomial variation means that V (Y i ) no longer equals n i i (1? i ). 2 Based on the previous discussion this should not be surprising. The problem is that the standard logit or probit setup does not provide a means for modeling this overdispersion due to heterogeneity. Rather than there being a free dispersion parameter like 2 e in the case of regression of a continuous and unbounded dependent variable, if we assume Y i to be distributed as binomial, then its variance is constrained to equal i (1? i ). To model the dependence of Y i on x i we can still use the general form of a regression model. Typically, however, the expectation of (Y i ) itself is not directly modeled. Rather, the systematic relationship with the regressors is between some function of E(Y i ), here indicated by i : i = x 0 i; i = g( i ): Consider the logit case: 3 g( i ) = logfy i =(n i? y i )g = logf i =(1? i )g: As before, E(Y i ) = n i and V (Y i ) = n i i (1? i ) and E(p i ) = and V (p i ) = i (1? i )=n i. In contrast to the linear regression setup, note that there is no disturbance term in the equation 2 This is not true if n i = 1, that is, for ungrouped binary variables. More about this below. 3 I discuss logit throughout, but the same points can be made for probit, complementary log-log, or other setups. Logit is the canonical link in the terminology of generalized linear models or exponential distributions and so allows simpler or more elegant derivations in some contexts. 6

9 for the systematic component, i. Nor is there a separate dispersion term in the expression for V (Y i ). Y i varies only according to the variance function of. In practice, however, many data sets of grouped binary variables display extra-binomial variance, meaning the variance of the Y i is observed to exceed the expected n i^ i (1? ^ i ). If we continue to maintain the assumption that the individual components of the sum Y i are independently and identically distributed, conditional on i, then this extra-binomial variation, or overdispersion, must result from heterogeneity of the i among the units. We have no place else to turn, since given xed n i, the binomial distribution has but a single parameter. The details of how this heterogeneity translates into overdispersion are in the next section. In the (non-panel data) regression model of a continuous, unbounded dependent variable these additional sources of variation do not raise concerns because we have no modeldetermined prior notion of a functional relationship between the mean and the conditional variance. The same phenomenon of over (and under) dispersion does arise with regard to nonnegative integer count variables modeled by exponential poisson regression. 4 The poisson distribution is also a single-parameter distribution. The mean of Y distributed as poisson equals its variance. But again, in practice, many integer count variables do not match this model-determined constraint. The issues related to modeling such data have been intensively discussed by political scientists in recent years (see for example King). The dierence between poisson and binomial regressions on the one hand, and regressions of unbounded continuous variables on the other, is that if we believe we have modeled the conditional mean function appropriately, this has consequences for the expected conditional variance. This is not only true for panel data, but applies to single cross-sections. A similar 4 This is also referred to as the log linear model (for example, McCullagh and Nelder 1989). 7

10 standard or benchmark does not exist for continuous variable regression unless panel data are available. What are the sources of heterogeneity among the binomial variables, Y i? The most commonly mentioned source is \omitted variables." The notion here is that Y i (in the continuous case) or i = logf i =(1? i )g varies not just for random reasons, but with a systematic relationship to some additional variables not included in the model. In some tautological sense this must always be true. Presumably any model is a simplication. But the possible implication that the analyst should just go out and nd those needed variables to eliminate overdispersion is of course not practical. Even if all of the systematic interindividual variation could be conditioned on an increased list of independent variables, there might well remain random individual eects. In any case, simply including more variables will not always eliminate extra-binomial variance. Let us assume that the analyst has included just those variables for which they want to estimate coecients. What is not included are the unit-specic eects (random or xed). The omission of unrelated variables in the continuous variable case (we are not discussing omitted variable bias here) neither changes the form of the likelihood equation (assuming normality) nor invalidates the estimated standard errors and techniques of statistical inference, since an estimate of the \total" 2 e is made. For binary variables problems arise. A mixture of binomials is not in turn distributed binomially as a mixture of random normals is. The standard errors calculated by conventional logit assume a binomial error distribution. They will have incorrect magnitude and testing procedures may be based on incorrect distributional assumptions. This can lead to incorrect comparisons of coecients. Lastly estimation of coecients will be inecient. Similarly a mixture of poisson variables is not in turn distributed as a poisson. We will see below that there are a number of \adjustments" that can be made to binomial (logit, probit) and poisson regressions. But 8

11 in any case, the underlying probability distribution is changed and therefore any likelihoodbased techniques must be altered. These individual-specic eects arise because there are unmodeled inuences that aect all the components of each binomial sum, but are independent of the same kind of eects on the other units. In the biological and agricultural experimental literature the canonical example is \litter" or \batch" eects (Kleinman 1973, Williams 1975, Crowder 1978, Haseman and Kupper 1979, Morgan 1992). Groups of animals (mice, cows, our-beetles, chicken embryos) or plants (plum-roots, Orobanche seeds), or insects (moths, houseies) are subjected to some treatment. Litters or batches which are conceived or raised or handled together may have more in common than the levels of the experimental treatments. A \litter eect" is the tendency of members of a group to respond more alike than members of other groups (Haseman and Kupper 1978, 281). In examples more directly related to political science multiple responses forming a binomial sum (even if they are conditionally independent) recorded for a single politician, a single survey interviewee, a single administration, or a single country or other jurisdiction will often have something in common that responses for other units do not namely a dierent, observation-specic probability. These individual eects (random or xed) are in addition to the the eects of x i included in the model for the expectation of i. Other possible sources of overdispersion include measurement error, the wrong functional forms of independent variables, the wrong link function (e.g. complementary log-log instead of logit), and the presence of outliers. Although these are important practical concerns, this paper will focus primarily on heterogeneity among units and, in a later section, the logical dual of correlation between the components within units. 9

12 4 Overdispersion In the previous section we noted that if individual components, U i j, of the sum Y i are independently and identically distributed, conditional on i, then overdispersion can only be generated through heterogeneity, by which we mean response probability variation among the binomial units (not within each). Let us begin to put more structure on the problem. Assume that each observed case, i, has an expected response probability of i. These i are systematically related to the included x i. To some extent, then, heterogeneity is accounted for in the model. Now, in addition, we allow the response probabilities to vary. One way to do this is formally just like the regression setup for continuous variables: E( i ) = F (x 0 i); V ( i jx i ) = 2 i or i = F (x 0 i) + i ; E( i ) = 0; V ( i ) = 2 i : (Note again, that if the x i are nonstochastic, the variances need not be expressed in conditional form.) Allison has characterized this modeling of non-binomial variation as \including a disturbance into logit and probit regression models" (1987). 5 There are some contrasts with the setup for regression of a continuous variable. First, it is the response probabilities i, not Y i directly, that are modeled. Secondly, the functional relationship to the x i is not linear. F () is the inverse of the logit function, or the logistic distribution function: F (x 0 i) = i ex0 1 + e : i x0 5 The disturbance introduced in this section is \external" in Allison's terminology. Below we consider a random eects model which introduces an \internal" disturbance term. 10

13 Also, we allow for heteroskedasticity right from the start by subscripting 2. OLS regression is generally rst presented with the assumption of homoskedasticity (i.e. constant 2 across individuals), but for these bounded random parameters i that does not seem reasonable. The realizations of the random variable i are not observed. They are in this sense latent variables. What we do observe are the consequences on the y i. Conditional on a specic realization of 6 i the familiar expectation and variance of a binomial variable hold true: E(Y i j i ) = n i i ; V (Y i j i ) = n i i (1? i ): What are the unconditional expectation and variance of Y i? Standard results from conditional probability theory lead us to conclude E(Y i ) = E[E(Y i j i )] = E(n i i ) = n i E( i ) = n i ~ i ; where we have used ~ i to represent the expectation of the random probability response (it in turn is conditional on x i ). Again using standard conditional probability results, we note that the unconditional variance of Y i equals E[V (Y i j i )] + V [E(Y i j i )]. Deriving these two components separately we nd that E[V (Y i j i )] = E[n i i (1? i )] = n i E( i )? n i E( 2 i ) = n i ~ i? n i [ 2 i + ~ i 2 ] = n i ~ i (1? ~ i )? n i 2 i 6 I am using i to represent both the random variable and its realizations. One could use a capital i for the former and a lower-case i for the latter, as for Y i and y i, but that does not seem necessary. 11

14 and V [E(Y i j i )] = V (n i i ) = n 2 i V ( i) = n 2 i 2 i : Hence V (Y i ) = n i ~ i (1? ~ i )? n i 2 i + n2 i 2 i = n i ~ i (1? ~ i ) + n i (1? n i ) 2 i : We can reparameterize 2 as i ~ i (1? ~ i ) to obtain V (Y i ) = n i ~ i (1? ~ i )[1 + (n i? 1) i ]: Note that, for the moment, no assumptions are made about the individual i, except for the fact that i cannot exceed 1. 7 Eventually to estimate the parameters of this model we will have to reduce their number either by assuming that all i are equal or by modeling them as functions of a vector of independent variables which may or may not coincide with the variables in the logit equation for i. A couple of observations can be made about the variance of Y i as written. First, if the binomial sum only has one component Bernoulli variable, i.e. n i equals 1, then whether the response probability is a random i with expectation ~ i or a xed ~ i, the variance of Y i is still just n i ~ i (1? ~ i ). This may seem surprising. How can the introduction of additional randomness not increase the variance of Y i? The second thing to be noted about the above expression is that it suggests an interpretation in terms of covariances among the component binary random variables, U i within each 7 Maximal variance for the varying probabilities with expectation of i would be to have only the two possibilities of i equal to 0 or 1 with probabilities of i and 1? i. (This is just like any Bernoulli variable.) 12

15 individual binomial sum, Y i. There are n i contributions of ~ i (1? ~ i ) and n i (n i? 1) contributions of ~ i (1? ~ i ) to the variance of the sum, Y i. This suggests the n i diagnonal elements of an n i by n i covariance matrix representing V (U i ) and the n i (n i? 1) o-diagonal elmenents representing Cov(U ij U ik ). On this interpretation, then, i is the pairwise correlation among any two U ij components of Y i. This is as far as we can go without further assumptions. In the next section we make some assumptions about i and the distributional form of the i. 5 Beta-Binomial Model The beta-binomial model is the rst method we consider for explicitly accounting for overdispersion. It plays the same role for binomial regression that the negative binomial distribution does for poisson regression. Both are mixture or compound distributions that accommodate variances which exceed the variance functions for the binomial (n i i (1? i )) and poisson ( i ). Under the beta-binomial model, it is not assumed that all the Y i have the same binary response probability, i. Rather, the i dier across observations even after any explanatory modeling has been done. The underlying probabilities, i, are postulated to follow a beta distribution. That is, for each binomial sum, Y i i is modeled as if drawn independently from a common (a; b) distribution. Conditional on this i then, each Y i is a binomial variable with response probability i. The probability i varies across the observational or experimental units for which the binomial variables Y i are observed. Since i must lie within 0 and 1, and because of its exibility, the beta distribution is a promising way to characterize this parameter heterogeneity: f() =?1 (a; b) a?1 (1? ) b?1 13

16 where a and b are the conventional parameters of the beta distribution, and (a; b) =?(a)?(b)=?(a + b): Prentice (1986) suggests a reparameterization where = (a + b)?1 and = a(a + b)?1. This is the parameterization that King (1989) uses also. The standard expectation result for the beta distribution is E() = a a + b = = ~ I replace Prentice's with ~ as a reminder that this parameter is the expectation over the. The variance of a Beta distribution, using both parameterizations, is V () = ab 1 (a + b) 2 a + b + 1 = ~(1? ~) 1 1 +?1 = ~(1? ~)(1 + )?1 : Our primary interest is in the consequences this parameter variation has on the observed Y i. The unconditional, beta-binomial distribution of each Y i, is: Pr(Y i = y i ja i ; b i ; n i ) = 0 n i y i 1 C A (a i + y i ; b i + n i? y i ) : (a i ; b i ) or: Pr(Y i = y i j~ i ; i ; n i ) = 0 n i y i 1 C Y A y i?1 (~ i + i j) j=0 Y n i?y i?1 j=0 Y n i?1 (1? ~ i + i j)= (1 + i j); j=0 The expectation and variance of Y i are n i a i (a i +b i )?1 or n i ~ i, and nab(a+b)?2 [1+(n?1)(a+ b+1)?1 ] or n(1?)[1+(n?1)(1+)?1 ]. Dening i = i (1+ i )?1 and rearranging we can see that one interpretation of i is as the within unit pairwise correlation coecient. Then i n i (n i? 1) times the binomial variance is the contribution from cross-terms that produces the extra-binomial variance. The expression for the variance in this form directly matches the general characterization of extra-binomial variation derived above. 14

17 V (Y i ) = ~ i (1? ~ i )[n i + n i (n i? 1) i ] Thus the beta-binomial distribution provides a fully parameterized likelihood basis for modeling non-binomial variance. When i goes to 0 and the beta-binomial approaches the simple binomial, the correlation of pairs of response variables within the unit goes to 0. Conversely, when i goes to innity and the distribution of i approaches a binary random variable with no conditional variation in the U ij, the correlation of pairs of response variables within the unit goes to 1 reecting the fact that conditional on i the U ij are either all 1s or all 0s with probability approaching 1. The eects of across-unit parameter heterogeneity cannot be separated from within-unit correlation, since they necessarily occur together in this model. In fact, capitalizing on the correlation induced by this two-step or hierarchical probability model, contagion eects can be modeled even though the simulation does not directly build in the correlation. 6 GLM and Quasi-Likelihood Alternative approachs to modeling non-binomial variance for binary variables derive from the generalized linear models (GLM) literature. Rather than assume that we know the full likelihood for the distribution of the Y i as in the beta-binomial model, more limited assumptions are made focusing on the relationship of the mean to the variance. The simplest version merely includes a \heterogeneity" factor which scales up the standard errors based on the estimated overdispersion. The standard test for overdispersion is simply the conventional Pearson's X 2 -statistic. For N binomial sums, Y i (i = 1; :::; N), each with n i components, observed p i, and predicted 15

18 ^ i based on a model with k paramters, NX X 2 (y i? n i^ i ) 2 X N = i=1 n i^ i (1? ^ i ) = n i (p i? ^ i ) 2 i=1 ^ i (1? ^ i ) which is distributed as a 2 N?k statistic. 8 To the extent the estimated X 2 exceeds its expected value of N?k, and if other sources of this departure can be eliminated, then this can provide a test of overdispersion. Furthermore the heterogeneity factor,, is estimated by X 2 =(N?k). The estimate of the quantity we have called in the expression for V (Y i ) in previous sections would then be: ^ = X 2? (N? k) ( P n i =N? 1)(N? k) (see Allison 1987). This is an approximate expression that does not account fully for dierent values of n i across the observations. Note that we are constrained to estimate a common value for all Y i. Williams (1982) presents a more elaborate iterated weighted least squares method that does account for the varying n i. This technique, does however, also assume constant for all observations. 7 Random Eects Models In previous sections we discussed the addition of variation to the response probabilities after conditioning the i on the x i or, in Allison's (1987) terminology, we introduced a disturbance which is \external" to the systematic relation between i and the x i. Instead, we might introduce an \internal" disturbance. Above, the disturbance was in the metric of the i : i = F (x 0 i) + i ; E( i ) = 0; V ( i ) = 2 i 8 Collett 1991, 38 or McCullagh and Nelder 1989, 127). 16

19 An \internal" disturbance corresponds more closely to random eects models in other settings: i = F (x 0 i + i ); E( i ) = 0; V ( i ) = 2 i : Typically the i are assumed to be normally distributed. Now it is the logit equation itself that resembles the standard regression setup, writing i for logf i =(1? i )g: i = x 0 i) + i ; E( i ) = 0; V ( i ) = 2 i Allison (1987) describes an approximate method of estimating a random eects model for logistic regression. Williams (1982) discusses two other approximate methods. Lindsey (1995) describes a method based on Gauss-Hermite quadrature for estimating the random eects model. 8 Internally Variable Response Probabilities Up to now, the discussion of heterogeneity has dealt only with variation in the response probabilities among the observational units. Another kind of heterogeneity we might consider is internal to, or within, each unit. This section highlights some previously unappreciated or misunderstood aspects of nonuniform probabilities for the component Bernoulli variables, the U i, that are summed to create Y i. In this situation we have sums of binomials, not mixtures of them as in the previous discussion of the beta-binomial model. If the component response probabilities ij vary, then the random sum Y i is no longer even conditionally binomial. A sum of binomials is in turn binomial only if the component probabilities i j are equal (Feller 1968, 268, 282). The 17

20 result of this \internal heterogeneity" is to reduce, not increase the unconditional variance of the sums Y i. Note that these features of random sums of binomials should be distinguished from the poisson case. The sum of poisson variables with varying means, i, is in turn distributed as poisson with expectation and variance equal to P n i i=1 ij (Feller 1968, 268). In the poisson case, \internal heterogeneity" of this type produces neither over- nor under-dispersion. The more familiar result that heterogeneity in poisson models leads to a negative binomial distribution, applies to mixtures of poisson distributions for modeling cross-unit heterogeneity. We begin the discussion of sums of varying probabilities by considering the case of random, not xed i. This might seem like a complication, but it extends the previous discussion in a natural manner. Our conclusion above was that if the i had variance ~ i (1? ~ i ) then V (Y i ) = n i ~ i (1? ~ i ) + [1 + (n i? 1) i ]. We maintain the assumption of variable response probabilities with expectation ~ and variance ~ i (1? ~ i ), but stipulate that each Y i is the sum of J i \clusters" of component binary variables. Each cluster now has its own response probablity ij distributed as before. To determine the unconditional mean and variance of Y i we again make use of standard conditional probability theory. For ease of explication we assume that the number of components in each of the J i \clusters" is the same. Then we can use m = n i =J i for the cluster size. The average of the realized ij for an observed y i is written as ij. The full set of ij is denoted as i:, not to be comfused with the expected value of ij which we continue to denote as ~ i. First we need the conditional mean and variance: E(Y i j i: ) = XJ i j=1 m ij = mj i ij = n i ij 18

21 and V (Y i j i: ) = XJ i m ij (1? ij ) j=1 = X m ij? X m 2 ij = (mj i ij? mj i 2 X ij)? (m 2 ij? mj i 2 ij) = X mj i ij (1? ij )? m ( ij? ij ) 2 = X n i ij (1? ij )? m(j i? 1) ( ij? ij ) 2 =(J i? 1) Comparing this result to the conditional variance when there are no clusters, or more precisely where the whole sum is a single cluster, we see that this conditional variance is less by the amount of the second term. This expression is a more general result that includes the earlier one since if the whole sum is a single cluster this term drops out. This in turn is an example of the general phenomenon that the variance of a sum of binomials with varying probabilities must in all cases be less than the variance of a sum of binomials each with probability equal to the average (Feller 1968, 231). Now we can derive the unconditional mean and variance of Y i. The unconditional mean is unchanged by this new structure: E(Y i ) = E[E(Y i j i: )] = E(n i ij ) = n i E( ij ) = n i ~: The two components of the unconditional variance are E[V (Y i j i: )] = X E[mJ i ij (1? ij )? m(j i? 1) ( ij? ij ) 2 =(J i? 1)] = mj i ~ ij? mj i [ i ~ i (1? ~ i )=J i + ~ ij] 2? m(j i? 1) i ~ i (1? ~ i ) = n i ~ i (1? ~ i )(1? i ) 19

22 and V [E(Y i j i: )] = V (mj i ij ) = m 2 J 2 i i~ i (1? ~ i )=J i = mn i i ~ i (1? ~ i ): Putting them together we get nally V (Y i ) = n i ~ i (1? ~ i )[1 + (m? 1) i ]: Again this is a more general result that subsumes our earlier one. If there is one uniform response probability for the i'th observation and, hence, m equals n i we have the expression derived above of V (Y i ) = n i ~ i (1?~ i )[1+(n i?1) i ]. Note also the inclusion of the implication made above, that if the number of \clusters", J i, equals n i so that the the eective \cluster size," m, equals one, then the term involving i drops out and Y i is binomial again. So for example, if the data are produced by simple random sampling without clustering, then the variable Y i indeed binomial and standard results follow. 9 Underdispersion as a Result of Nonuniform Probabilities In the last section we considered response probabilities that varied within each Y i as a result of random draws of ij. Nonuniform probabilities might also, and more simply, be the result of what we might call xed parameter heterogeneity, i.e. dierent ij for each of the component U ij. Using the same algebraic steps we used to derive the conditional variance of Y i in the previouse section we get 20

23 Xn i V (Y i ) = ij (1? ij ) j=1 X = n i ij (1? ij )? (ij? ij ) 2 = n i ij (1? ij )[1? i ]; where P (ij? ij ) 2 i = n i ij (1? ij ) is a relative measure of the internal parameter heterogeneity that can vary from 0 to 1. Since i 0, V (Y i ) will be less than the binomial variance if there is any internal heterogeneity. Note that? i =(n i? 1) = i, where i equals the negative correlation which results between the component U ij. Inter-unit heterogeneity and positive dependence of the U ij components have long been recognized to be equivalent. As far as I know, no one has previously shown the connection between internal response probability variation (or intra-unit heterogeneity) and negative dependence among the component U ij. 21

24 References [1] Adams, Greg D \Abortion: Evidence of an Issue Evolution." American Journal of Political Science. 41: [2] Altham, Patricia M.E. Altham \Two Generalizations of the Binomial Distribution." Applied Statistics. 27: [3] Allison, Paul D \Introducing a Disturbance into Logit and Probit Regression Models." Sociological Methods and Research. 15: [4] Cameron, A. Colin, and Pravin K. Trivedi Regression Analysis of Count Data. Cambridge Univ. Press. [5] Collett, David Modeling Binary Data. [6] Baker, Andrew, and Ethan Scheiner \Smart Parties in a Rigged System: Party Strategy and the Eects of Malapportionment under Japanese STV." Working paper. [7] Cox, Gary W., and Jonathan N. Katz \The Reapportionment Revolution and Bias in U.S. Congressional Elections." American Journal of Political Science 43: [8] Crowder, Martin J \Beta-binomial Anova for Proportions." Applied Statistics. 27: [9] Feller, William. [1950] An Introduction to Probability Theory and Its Applications. New York: John Wiley & Sons. 22

25 [10] Globetti, Suzanne \What We Know About 'Don't Knows': An Analysis of Seven Point Issue Placements." Paper presented at the poster session of the 1997 Political Methodology Meeting in Columbus, Ohio, July [11] Haseman, J.K., and L.L. Kupper \Analysis of Dichotomous Response Data from Certain Toxicological Experiments." Biometrics. 35: [12] King, Gary Unifying Political Methodology: The Likelihood Theory of Statistical Inference. Cambridge: Cambridge U. Pr. [13] Kleinman, Joel C \Proportions with Extraneous Variance: Single and Independent Samples." Journal of the American Statistical Association. 68: [14] Lindsey, J.K Modelling Frequency and Count Data. Oxford: Oxford Univesity Press. [15] Lindsey, J.K Applying Generalized Linear Models. New York: Springer-Verlag. [16] McCullagh, P., and J.A. Nelder Generalized Linear Models. London: Chapman and Hall. [17] Futter, Stacy, and Mebane, Walter \Developments in Rape Law Reform in the U.S. and Eects of Reform on Rape Reports and Arrests, " Working paper. [18] Prentice, R.L \Binary Regression Using an Extended Beta-Binomial Distribution, With Discussion of Correlation Induced by Covariate Measurement Errors." Journal of the American Statistical Association. 81: [19] von Mises, Richard. [1928] Probability, Statistics and Truth. New York: Dover Publications, Inc. 23

26 [20] Williams, D.A \The Analysis of Binary Responses from Toxicological Experiments Involving Reproduction and Teratogeneity." Biometrics. 31:

Generalized Linear Models (GLZ)

Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) are an extension of the linear modeling process that allows models to be fit to data that follow probability distributions other than the

More information

Generalized Linear Models

Generalized Linear Models York SPIDA John Fox Notes Generalized Linear Models Copyright 2010 by John Fox Generalized Linear Models 1 1. Topics I The structure of generalized linear models I Poisson and other generalized linear

More information

Augustin: Some Basic Results on the Extension of Quasi-Likelihood Based Measurement Error Correction to Multivariate and Flexible Structural Models

Augustin: Some Basic Results on the Extension of Quasi-Likelihood Based Measurement Error Correction to Multivariate and Flexible Structural Models Augustin: Some Basic Results on the Extension of Quasi-Likelihood Based Measurement Error Correction to Multivariate and Flexible Structural Models Sonderforschungsbereich 386, Paper 196 (2000) Online

More information

LOGISTIC REGRESSION Joseph M. Hilbe

LOGISTIC REGRESSION Joseph M. Hilbe LOGISTIC REGRESSION Joseph M. Hilbe Arizona State University Logistic regression is the most common method used to model binary response data. When the response is binary, it typically takes the form of

More information

Review of Panel Data Model Types Next Steps. Panel GLMs. Department of Political Science and Government Aarhus University.

Review of Panel Data Model Types Next Steps. Panel GLMs. Department of Political Science and Government Aarhus University. Panel GLMs Department of Political Science and Government Aarhus University May 12, 2015 1 Review of Panel Data 2 Model Types 3 Review and Looking Forward 1 Review of Panel Data 2 Model Types 3 Review

More information

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL Intesar N. El-Saeiti Department of Statistics, Faculty of Science, University of Bengahzi-Libya. entesar.el-saeiti@uob.edu.ly

More information

STAT5044: Regression and Anova

STAT5044: Regression and Anova STAT5044: Regression and Anova Inyoung Kim 1 / 18 Outline 1 Logistic regression for Binary data 2 Poisson regression for Count data 2 / 18 GLM Let Y denote a binary response variable. Each observation

More information

Generalized Linear Models for Non-Normal Data

Generalized Linear Models for Non-Normal Data Generalized Linear Models for Non-Normal Data Today s Class: 3 parts of a generalized model Models for binary outcomes Complications for generalized multivariate or multilevel models SPLH 861: Lecture

More information

11. Generalized Linear Models: An Introduction

11. Generalized Linear Models: An Introduction Sociology 740 John Fox Lecture Notes 11. Generalized Linear Models: An Introduction Copyright 2014 by John Fox Generalized Linear Models: An Introduction 1 1. Introduction I A synthesis due to Nelder and

More information

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 Introduction to Generalized Univariate Models: Models for Binary Outcomes EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 EPSY 905: Intro to Generalized In This Lecture A short review

More information

Repeated ordinal measurements: a generalised estimating equation approach

Repeated ordinal measurements: a generalised estimating equation approach Repeated ordinal measurements: a generalised estimating equation approach David Clayton MRC Biostatistics Unit 5, Shaftesbury Road Cambridge CB2 2BW April 7, 1992 Abstract Cumulative logit and related

More information

Poisson regression: Further topics

Poisson regression: Further topics Poisson regression: Further topics April 21 Overdispersion One of the defining characteristics of Poisson regression is its lack of a scale parameter: E(Y ) = Var(Y ), and no parameter is available to

More information

Biased Urn Theory. Agner Fog October 4, 2007

Biased Urn Theory. Agner Fog October 4, 2007 Biased Urn Theory Agner Fog October 4, 2007 1 Introduction Two different probability distributions are both known in the literature as the noncentral hypergeometric distribution. These two distributions

More information

Generalized Linear Models: An Introduction

Generalized Linear Models: An Introduction Applied Statistics With R Generalized Linear Models: An Introduction John Fox WU Wien May/June 2006 2006 by John Fox Generalized Linear Models: An Introduction 1 A synthesis due to Nelder and Wedderburn,

More information

Discrete Dependent Variable Models

Discrete Dependent Variable Models Discrete Dependent Variable Models James J. Heckman University of Chicago This draft, April 10, 2006 Here s the general approach of this lecture: Economic model Decision rule (e.g. utility maximization)

More information

A Course in Applied Econometrics Lecture 14: Control Functions and Related Methods. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

A Course in Applied Econometrics Lecture 14: Control Functions and Related Methods. Jeff Wooldridge IRP Lectures, UW Madison, August 2008 A Course in Applied Econometrics Lecture 14: Control Functions and Related Methods Jeff Wooldridge IRP Lectures, UW Madison, August 2008 1. Linear-in-Parameters Models: IV versus Control Functions 2. Correlated

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Model Selection for Semiparametric Bayesian Models with Application to Overdispersion

Model Selection for Semiparametric Bayesian Models with Application to Overdispersion Proceedings 59th ISI World Statistics Congress, 25-30 August 2013, Hong Kong (Session CPS020) p.3863 Model Selection for Semiparametric Bayesian Models with Application to Overdispersion Jinfang Wang and

More information

Eco517 Fall 2014 C. Sims FINAL EXAM

Eco517 Fall 2014 C. Sims FINAL EXAM Eco517 Fall 2014 C. Sims FINAL EXAM This is a three hour exam. You may refer to books, notes, or computer equipment during the exam. You may not communicate, either electronically or in any other way,

More information

Discrete Response Multilevel Models for Repeated Measures: An Application to Voting Intentions Data

Discrete Response Multilevel Models for Repeated Measures: An Application to Voting Intentions Data Quality & Quantity 34: 323 330, 2000. 2000 Kluwer Academic Publishers. Printed in the Netherlands. 323 Note Discrete Response Multilevel Models for Repeated Measures: An Application to Voting Intentions

More information

Goals. PSCI6000 Maximum Likelihood Estimation Multiple Response Model 2. Recap: MNL. Recap: MNL

Goals. PSCI6000 Maximum Likelihood Estimation Multiple Response Model 2. Recap: MNL. Recap: MNL Goals PSCI6000 Maximum Likelihood Estimation Multiple Response Model 2 Tetsuya Matsubayashi University of North Texas November 9, 2010 Learn multiple responses models that do not require the assumption

More information

Rewrap ECON November 18, () Rewrap ECON 4135 November 18, / 35

Rewrap ECON November 18, () Rewrap ECON 4135 November 18, / 35 Rewrap ECON 4135 November 18, 2011 () Rewrap ECON 4135 November 18, 2011 1 / 35 What should you now know? 1 What is econometrics? 2 Fundamental regression analysis 1 Bivariate regression 2 Multivariate

More information

Logistic Regression: Regression with a Binary Dependent Variable

Logistic Regression: Regression with a Binary Dependent Variable Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression

More information

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i,

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i, A Course in Applied Econometrics Lecture 18: Missing Data Jeff Wooldridge IRP Lectures, UW Madison, August 2008 1. When Can Missing Data be Ignored? 2. Inverse Probability Weighting 3. Imputation 4. Heckman-Type

More information

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction ReCap. Parts I IV. The General Linear Model Part V. The Generalized Linear Model 16 Introduction 16.1 Analysis

More information

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1 Parametric Modelling of Over-dispersed Count Data Part III / MMath (Applied Statistics) 1 Introduction Poisson regression is the de facto approach for handling count data What happens then when Poisson

More information

disc choice5.tex; April 11, ffl See: King - Unifying Political Methodology ffl See: King/Tomz/Wittenberg (1998, APSA Meeting). ffl See: Alvarez

disc choice5.tex; April 11, ffl See: King - Unifying Political Methodology ffl See: King/Tomz/Wittenberg (1998, APSA Meeting). ffl See: Alvarez disc choice5.tex; April 11, 2001 1 Lecture Notes on Discrete Choice Models Copyright, April 11, 2001 Jonathan Nagler 1 Topics 1. Review the Latent Varible Setup For Binary Choice ffl Logit ffl Likelihood

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent Latent Variable Models for Binary Data Suppose that for a given vector of explanatory variables x, the latent variable, U, has a continuous cumulative distribution function F (u; x) and that the binary

More information

High-Throughput Sequencing Course

High-Throughput Sequencing Course High-Throughput Sequencing Course DESeq Model for RNA-Seq Biostatistics and Bioinformatics Summer 2017 Outline Review: Standard linear regression model (e.g., to model gene expression as function of an

More information

,..., θ(2),..., θ(n)

,..., θ(2),..., θ(n) Likelihoods for Multivariate Binary Data Log-Linear Model We have 2 n 1 distinct probabilities, but we wish to consider formulations that allow more parsimonious descriptions as a function of covariates.

More information

Econometric Analysis of Cross Section and Panel Data

Econometric Analysis of Cross Section and Panel Data Econometric Analysis of Cross Section and Panel Data Jeffrey M. Wooldridge / The MIT Press Cambridge, Massachusetts London, England Contents Preface Acknowledgments xvii xxiii I INTRODUCTION AND BACKGROUND

More information

Mohammed. Research in Pharmacoepidemiology National School of Pharmacy, University of Otago

Mohammed. Research in Pharmacoepidemiology National School of Pharmacy, University of Otago Mohammed Research in Pharmacoepidemiology (RIPE) @ National School of Pharmacy, University of Otago What is zero inflation? Suppose you want to study hippos and the effect of habitat variables on their

More information

SUPPLEMENTARY SIMULATIONS & FIGURES

SUPPLEMENTARY SIMULATIONS & FIGURES Supplementary Material: Supplementary Material for Mixed Effects Models for Resampled Network Statistics Improve Statistical Power to Find Differences in Multi-Subject Functional Connectivity Manjari Narayan,

More information

Statistical Analysis of List Experiments

Statistical Analysis of List Experiments Statistical Analysis of List Experiments Graeme Blair Kosuke Imai Princeton University December 17, 2010 Blair and Imai (Princeton) List Experiments Political Methodology Seminar 1 / 32 Motivation Surveys

More information

Department of Mathematical Sciences, Norwegian University of Science and Technology, Trondheim

Department of Mathematical Sciences, Norwegian University of Science and Technology, Trondheim Tests for trend in more than one repairable system. Jan Terje Kvaly Department of Mathematical Sciences, Norwegian University of Science and Technology, Trondheim ABSTRACT: If failure time data from several

More information

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Jonathan Gruhl March 18, 2010 1 Introduction Researchers commonly apply item response theory (IRT) models to binary and ordinal

More information

Longitudinal and Panel Data: Analysis and Applications for the Social Sciences. Table of Contents

Longitudinal and Panel Data: Analysis and Applications for the Social Sciences. Table of Contents Longitudinal and Panel Data Preface / i Longitudinal and Panel Data: Analysis and Applications for the Social Sciences Table of Contents August, 2003 Table of Contents Preface i vi 1. Introduction 1.1

More information

WISE International Masters

WISE International Masters WISE International Masters ECONOMETRICS Instructor: Brett Graham INSTRUCTIONS TO STUDENTS 1 The time allowed for this examination paper is 2 hours. 2 This examination paper contains 32 questions. You are

More information

Charles E. McCulloch Biometrics Unit and Statistics Center Cornell University

Charles E. McCulloch Biometrics Unit and Statistics Center Cornell University A SURVEY OF VARIANCE COMPONENTS ESTIMATION FROM BINARY DATA by Charles E. McCulloch Biometrics Unit and Statistics Center Cornell University BU-1211-M May 1993 ABSTRACT The basic problem of variance components

More information

EMERGING MARKETS - Lecture 2: Methodology refresher

EMERGING MARKETS - Lecture 2: Methodology refresher EMERGING MARKETS - Lecture 2: Methodology refresher Maria Perrotta April 4, 2013 SITE http://www.hhs.se/site/pages/default.aspx My contact: maria.perrotta@hhs.se Aim of this class There are many different

More information

Outline of GLMs. Definitions

Outline of GLMs. Definitions Outline of GLMs Definitions This is a short outline of GLM details, adapted from the book Nonparametric Regression and Generalized Linear Models, by Green and Silverman. The responses Y i have density

More information

Statistics 572 Semester Review

Statistics 572 Semester Review Statistics 572 Semester Review Final Exam Information: The final exam is Friday, May 16, 10:05-12:05, in Social Science 6104. The format will be 8 True/False and explains questions (3 pts. each/ 24 pts.

More information

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at Biometrika Trust Some Remarks on Overdispersion Author(s): D. R. Cox Source: Biometrika, Vol. 70, No. 1 (Apr., 1983), pp. 269-274 Published by: Oxford University Press on behalf of Biometrika Trust Stable

More information

Econometric Analysis of Count Data

Econometric Analysis of Count Data Econometric Analysis of Count Data Springer-Verlag Berlin Heidelberg GmbH Rainer Winkelmann Econometric Analysis of Count Data Third, Revised and Enlarged Edition With l3 Figures and 20 Tables, Springer

More information

Semiparametric Generalized Linear Models

Semiparametric Generalized Linear Models Semiparametric Generalized Linear Models North American Stata Users Group Meeting Chicago, Illinois Paul Rathouz Department of Health Studies University of Chicago prathouz@uchicago.edu Liping Gao MS Student

More information

Sampling: A Brief Review. Workshop on Respondent-driven Sampling Analyst Software

Sampling: A Brief Review. Workshop on Respondent-driven Sampling Analyst Software Sampling: A Brief Review Workshop on Respondent-driven Sampling Analyst Software 201 1 Purpose To review some of the influences on estimates in design-based inference in classic survey sampling methods

More information

Investigating Models with Two or Three Categories

Investigating Models with Two or Three Categories Ronald H. Heck and Lynn N. Tabata 1 Investigating Models with Two or Three Categories For the past few weeks we have been working with discriminant analysis. Let s now see what the same sort of model might

More information

PQL Estimation Biases in Generalized Linear Mixed Models

PQL Estimation Biases in Generalized Linear Mixed Models PQL Estimation Biases in Generalized Linear Mixed Models Woncheol Jang Johan Lim March 18, 2006 Abstract The penalized quasi-likelihood (PQL) approach is the most common estimation procedure for the generalized

More information

A Guide to Modern Econometric:

A Guide to Modern Econometric: A Guide to Modern Econometric: 4th edition Marno Verbeek Rotterdam School of Management, Erasmus University, Rotterdam B 379887 )WILEY A John Wiley & Sons, Ltd., Publication Contents Preface xiii 1 Introduction

More information

GLM models and OLS regression

GLM models and OLS regression GLM models and OLS regression Graeme Hutcheson, University of Manchester These lecture notes are based on material published in... Hutcheson, G. D. and Sofroniou, N. (1999). The Multivariate Social Scientist:

More information

MC3: Econometric Theory and Methods. Course Notes 4

MC3: Econometric Theory and Methods. Course Notes 4 University College London Department of Economics M.Sc. in Economics MC3: Econometric Theory and Methods Course Notes 4 Notes on maximum likelihood methods Andrew Chesher 25/0/2005 Course Notes 4, Andrew

More information

MODELING COUNT DATA Joseph M. Hilbe

MODELING COUNT DATA Joseph M. Hilbe MODELING COUNT DATA Joseph M. Hilbe Arizona State University Count models are a subset of discrete response regression models. Count data are distributed as non-negative integers, are intrinsically heteroskedastic,

More information

On estimation of the Poisson parameter in zero-modied Poisson models

On estimation of the Poisson parameter in zero-modied Poisson models Computational Statistics & Data Analysis 34 (2000) 441 459 www.elsevier.com/locate/csda On estimation of the Poisson parameter in zero-modied Poisson models Ekkehart Dietz, Dankmar Bohning Department of

More information

Mantel-Haenszel Test Statistics. for Correlated Binary Data. Department of Statistics, North Carolina State University. Raleigh, NC

Mantel-Haenszel Test Statistics. for Correlated Binary Data. Department of Statistics, North Carolina State University. Raleigh, NC Mantel-Haenszel Test Statistics for Correlated Binary Data by Jie Zhang and Dennis D. Boos Department of Statistics, North Carolina State University Raleigh, NC 27695-8203 tel: (919) 515-1918 fax: (919)

More information

Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model

Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model Most of this course will be concerned with use of a regression model: a structure in which one or more explanatory

More information

Microeconometrics: Clustering. Ethan Kaplan

Microeconometrics: Clustering. Ethan Kaplan Microeconometrics: Clustering Ethan Kaplan Gauss Markov ssumptions OLS is minimum variance unbiased (MVUE) if Linear Model: Y i = X i + i E ( i jx i ) = V ( i jx i ) = 2 < cov i ; j = Normally distributed

More information

dqd: A command for treatment effect estimation under alternative assumptions

dqd: A command for treatment effect estimation under alternative assumptions UC3M Working Papers Economics 14-07 April 2014 ISSN 2340-5031 Departamento de Economía Universidad Carlos III de Madrid Calle Madrid, 126 28903 Getafe (Spain) Fax (34) 916249875 dqd: A command for treatment

More information

ECON 594: Lecture #6

ECON 594: Lecture #6 ECON 594: Lecture #6 Thomas Lemieux Vancouver School of Economics, UBC May 2018 1 Limited dependent variables: introduction Up to now, we have been implicitly assuming that the dependent variable, y, was

More information

Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models

Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models Optimum Design for Mixed Effects Non-Linear and generalized Linear Models Cambridge, August 9-12, 2011 Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Part 8: GLMs and Hierarchical LMs and GLMs

Part 8: GLMs and Hierarchical LMs and GLMs Part 8: GLMs and Hierarchical LMs and GLMs 1 Example: Song sparrow reproductive success Arcese et al., (1992) provide data on a sample from a population of 52 female song sparrows studied over the course

More information

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science. Texts in Statistical Science Generalized Linear Mixed Models Modern Concepts, Methods and Applications Walter W. Stroup CRC Press Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint

More information

Management Programme. MS-08: Quantitative Analysis for Managerial Applications

Management Programme. MS-08: Quantitative Analysis for Managerial Applications MS-08 Management Programme ASSIGNMENT SECOND SEMESTER 2013 MS-08: Quantitative Analysis for Managerial Applications School of Management Studies INDIRA GANDHI NATIONAL OPEN UNIVERSITY MAIDAN GARHI, NEW

More information

Regression and Statistical Inference

Regression and Statistical Inference Regression and Statistical Inference Walid Mnif wmnif@uwo.ca Department of Applied Mathematics The University of Western Ontario, London, Canada 1 Elements of Probability 2 Elements of Probability CDF&PDF

More information

A COEFFICIENT OF DETERMINATION FOR LOGISTIC REGRESSION MODELS

A COEFFICIENT OF DETERMINATION FOR LOGISTIC REGRESSION MODELS A COEFFICIENT OF DETEMINATION FO LOGISTIC EGESSION MODELS ENATO MICELI UNIVESITY OF TOINO After a brief presentation of the main extensions of the classical coefficient of determination ( ), a new index

More information

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 Statistics Boot Camp Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 March 21, 2018 Outline of boot camp Summarizing and simplifying data Point and interval estimation Foundations of statistical

More information

Ninth ARTNeT Capacity Building Workshop for Trade Research "Trade Flows and Trade Policy Analysis"

Ninth ARTNeT Capacity Building Workshop for Trade Research Trade Flows and Trade Policy Analysis Ninth ARTNeT Capacity Building Workshop for Trade Research "Trade Flows and Trade Policy Analysis" June 2013 Bangkok, Thailand Cosimo Beverelli and Rainer Lanz (World Trade Organization) 1 Selected econometric

More information

Generalized Linear Models Introduction

Generalized Linear Models Introduction Generalized Linear Models Introduction Statistics 135 Autumn 2005 Copyright c 2005 by Mark E. Irwin Generalized Linear Models For many problems, standard linear regression approaches don t work. Sometimes,

More information

Single-level Models for Binary Responses

Single-level Models for Binary Responses Single-level Models for Binary Responses Distribution of Binary Data y i response for individual i (i = 1,..., n), coded 0 or 1 Denote by r the number in the sample with y = 1 Mean and variance E(y) =

More information

CountDataBiblio.doc 2005, Timothy G. Gregoire, Yale University http://www.yale.edu/forestry/gregoire/downloads/stats/countdatabiblio.pdf Last revised: August 25, 2005 Count Data Bibliography 1. Min, Y.

More information

Using Estimating Equations for Spatially Correlated A

Using Estimating Equations for Spatially Correlated A Using Estimating Equations for Spatially Correlated Areal Data December 8, 2009 Introduction GEEs Spatial Estimating Equations Implementation Simulation Conclusion Typical Problem Assess the relationship

More information

Models for Heterogeneous Choices

Models for Heterogeneous Choices APPENDIX B Models for Heterogeneous Choices Heteroskedastic Choice Models In the empirical chapters of the printed book we are interested in testing two different types of propositions about the beliefs

More information

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006 Hypothesis Testing Part I James J. Heckman University of Chicago Econ 312 This draft, April 20, 2006 1 1 A Brief Review of Hypothesis Testing and Its Uses values and pure significance tests (R.A. Fisher)

More information

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011) Ron Heck, Fall 2011 1 EDEP 768E: Seminar in Multilevel Modeling rev. January 3, 2012 (see footnote) Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October

More information

review session gov 2000 gov 2000 () review session 1 / 38

review session gov 2000 gov 2000 () review session 1 / 38 review session gov 2000 gov 2000 () review session 1 / 38 Overview Random Variables and Probability Univariate Statistics Bivariate Statistics Multivariate Statistics Causal Inference gov 2000 () review

More information

Econometrics Lecture 5: Limited Dependent Variable Models: Logit and Probit

Econometrics Lecture 5: Limited Dependent Variable Models: Logit and Probit Econometrics Lecture 5: Limited Dependent Variable Models: Logit and Probit R. G. Pierse 1 Introduction In lecture 5 of last semester s course, we looked at the reasons for including dichotomous variables

More information

STATISTICAL INFERENCE FOR SURVEY DATA ANALYSIS

STATISTICAL INFERENCE FOR SURVEY DATA ANALYSIS STATISTICAL INFERENCE FOR SURVEY DATA ANALYSIS David A Binder and Georgia R Roberts Methodology Branch, Statistics Canada, Ottawa, ON, Canada K1A 0T6 KEY WORDS: Design-based properties, Informative sampling,

More information

ISQS 5349 Spring 2013 Final Exam

ISQS 5349 Spring 2013 Final Exam ISQS 5349 Spring 2013 Final Exam Name: General Instructions: Closed books, notes, no electronic devices. Points (out of 200) are in parentheses. Put written answers on separate paper; multiple choices

More information

Figure 36: Respiratory infection versus time for the first 49 children.

Figure 36: Respiratory infection versus time for the first 49 children. y BINARY DATA MODELS We devote an entire chapter to binary data since such data are challenging, both in terms of modeling the dependence, and parameter interpretation. We again consider mixed effects

More information

Proving Completeness for Nested Sequent Calculi 1

Proving Completeness for Nested Sequent Calculi 1 Proving Completeness for Nested Sequent Calculi 1 Melvin Fitting abstract. Proving the completeness of classical propositional logic by using maximal consistent sets is perhaps the most common method there

More information

Hierarchical Linear Models. Jeff Gill. University of Florida

Hierarchical Linear Models. Jeff Gill. University of Florida Hierarchical Linear Models Jeff Gill University of Florida I. ESSENTIAL DESCRIPTION OF HIERARCHICAL LINEAR MODELS II. SPECIAL CASES OF THE HLM III. THE GENERAL STRUCTURE OF THE HLM IV. ESTIMATION OF THE

More information

STA 216, GLM, Lecture 16. October 29, 2007

STA 216, GLM, Lecture 16. October 29, 2007 STA 216, GLM, Lecture 16 October 29, 2007 Efficient Posterior Computation in Factor Models Underlying Normal Models Generalized Latent Trait Models Formulation Genetic Epidemiology Illustration Structural

More information

Power Calculations for Preclinical Studies Using a K-Sample Rank Test and the Lehmann Alternative Hypothesis

Power Calculations for Preclinical Studies Using a K-Sample Rank Test and the Lehmann Alternative Hypothesis Power Calculations for Preclinical Studies Using a K-Sample Rank Test and the Lehmann Alternative Hypothesis Glenn Heller Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center,

More information

Births at Edendale Hospital

Births at Edendale Hospital CHAPTER 14 Births at Edendale Hospital 14.1 Introduction Haines, Munoz and van Gelderen (1989) have described the fitting of Gaussian ARIMA models to various discrete-valued time series related to births

More information

Introducing Generalized Linear Models: Logistic Regression

Introducing Generalized Linear Models: Logistic Regression Ron Heck, Summer 2012 Seminars 1 Multilevel Regression Models and Their Applications Seminar Introducing Generalized Linear Models: Logistic Regression The generalized linear model (GLM) represents and

More information

Generalized Linear Models for Count, Skewed, and If and How Much Outcomes

Generalized Linear Models for Count, Skewed, and If and How Much Outcomes Generalized Linear Models for Count, Skewed, and If and How Much Outcomes Today s Class: Review of 3 parts of a generalized model Models for discrete count or continuous skewed outcomes Models for two-part

More information

Simulating Longer Vectors of Correlated Binary Random Variables via Multinomial Sampling

Simulating Longer Vectors of Correlated Binary Random Variables via Multinomial Sampling Simulating Longer Vectors of Correlated Binary Random Variables via Multinomial Sampling J. Shults a a Department of Biostatistics, University of Pennsylvania, PA 19104, USA (v4.0 released January 2015)

More information

multilevel modeling: concepts, applications and interpretations

multilevel modeling: concepts, applications and interpretations multilevel modeling: concepts, applications and interpretations lynne c. messer 27 october 2010 warning social and reproductive / perinatal epidemiologist concepts why context matters multilevel models

More information

Testing Goodness Of Fit Of The Geometric Distribution: An Application To Human Fecundability Data

Testing Goodness Of Fit Of The Geometric Distribution: An Application To Human Fecundability Data Journal of Modern Applied Statistical Methods Volume 4 Issue Article 8 --5 Testing Goodness Of Fit Of The Geometric Distribution: An Application To Human Fecundability Data Sudhir R. Paul University of

More information

BAYESIAN ANALYSIS OF DOSE-RESPONSE CALIBRATION CURVES

BAYESIAN ANALYSIS OF DOSE-RESPONSE CALIBRATION CURVES Libraries Annual Conference on Applied Statistics in Agriculture 2005-17th Annual Conference Proceedings BAYESIAN ANALYSIS OF DOSE-RESPONSE CALIBRATION CURVES William J. Price Bahman Shafii Follow this

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models Generalized Linear Models - part II Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs.

More information

on probabilities and neural networks Michael I. Jordan Massachusetts Institute of Technology Computational Cognitive Science Technical Report 9503

on probabilities and neural networks Michael I. Jordan Massachusetts Institute of Technology Computational Cognitive Science Technical Report 9503 ftp://psyche.mit.edu/pub/jordan/uai.ps Why the logistic function? A tutorial discussion on probabilities and neural networks Michael I. Jordan Massachusetts Institute of Technology Computational Cognitive

More information

Package dispmod. March 17, 2018

Package dispmod. March 17, 2018 Version 1.2 Date 2018-03-17 Title Modelling Dispersion in GLM Package dispmod March 17, 2018 Functions for estimating Gaussian dispersion regression models (Aitkin, 1987 ), overdispersed

More information

An Introduction to Multilevel Models. PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 25: December 7, 2012

An Introduction to Multilevel Models. PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 25: December 7, 2012 An Introduction to Multilevel Models PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 25: December 7, 2012 Today s Class Concepts in Longitudinal Modeling Between-Person vs. +Within-Person

More information

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University Econ 690 Purdue University In virtually all of the previous lectures, our models have made use of normality assumptions. From a computational point of view, the reason for this assumption is clear: combined

More information

MODEL SELECTION BASED ON QUASI-LIKELIHOOD WITH APPLICATION TO OVERDISPERSED DATA

MODEL SELECTION BASED ON QUASI-LIKELIHOOD WITH APPLICATION TO OVERDISPERSED DATA J. Jpn. Soc. Comp. Statist., 26(2013), 53 69 DOI:10.5183/jjscs.1212002 204 MODEL SELECTION BASED ON QUASI-LIKELIHOOD WITH APPLICATION TO OVERDISPERSED DATA Yiping Tang ABSTRACT Overdispersion is a common

More information

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data Ronald Heck Class Notes: Week 8 1 Class Notes: Week 8 Probit versus Logit Link Functions and Count Data This week we ll take up a couple of issues. The first is working with a probit link function. While

More information

GIST 4302/5302: Spatial Analysis and Modeling

GIST 4302/5302: Spatial Analysis and Modeling GIST 4302/5302: Spatial Analysis and Modeling Basics of Statistics Guofeng Cao www.myweb.ttu.edu/gucao Department of Geosciences Texas Tech University guofeng.cao@ttu.edu Spring 2015 Outline of This Week

More information

A Course in Applied Econometrics Lecture 7: Cluster Sampling. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

A Course in Applied Econometrics Lecture 7: Cluster Sampling. Jeff Wooldridge IRP Lectures, UW Madison, August 2008 A Course in Applied Econometrics Lecture 7: Cluster Sampling Jeff Wooldridge IRP Lectures, UW Madison, August 2008 1. The Linear Model with Cluster Effects 2. Estimation with a Small Number of roups and

More information