Abstract This paper discusses the modeling of over- and under-dispersion for grouped binary (proportions or frequency) data in binomial regression mod

Similar documents
Generalized Linear Models (GLZ)

Generalized Linear Models

Augustin: Some Basic Results on the Extension of Quasi-Likelihood Based Measurement Error Correction to Multivariate and Flexible Structural Models

LOGISTIC REGRESSION Joseph M. Hilbe

Review of Panel Data Model Types Next Steps. Panel GLMs. Department of Political Science and Government Aarhus University.

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL

STAT5044: Regression and Anova

Generalized Linear Models for Non-Normal Data

11. Generalized Linear Models: An Introduction

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Repeated ordinal measurements: a generalised estimating equation approach

Poisson regression: Further topics

Biased Urn Theory. Agner Fog October 4, 2007

Generalized Linear Models: An Introduction

Discrete Dependent Variable Models

A Course in Applied Econometrics Lecture 14: Control Functions and Related Methods. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

Stat 5101 Lecture Notes

Model Selection for Semiparametric Bayesian Models with Application to Overdispersion

Eco517 Fall 2014 C. Sims FINAL EXAM

Discrete Response Multilevel Models for Repeated Measures: An Application to Voting Intentions Data

Goals. PSCI6000 Maximum Likelihood Estimation Multiple Response Model 2. Recap: MNL. Recap: MNL

Rewrap ECON November 18, () Rewrap ECON 4135 November 18, / 35

Logistic Regression: Regression with a Binary Dependent Variable

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i,

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

disc choice5.tex; April 11, ffl See: King - Unifying Political Methodology ffl See: King/Tomz/Wittenberg (1998, APSA Meeting). ffl See: Alvarez

Chapter 1 Statistical Inference

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

High-Throughput Sequencing Course

,..., θ(2),..., θ(n)

Econometric Analysis of Cross Section and Panel Data

Mohammed. Research in Pharmacoepidemiology National School of Pharmacy, University of Otago

SUPPLEMENTARY SIMULATIONS & FIGURES

Statistical Analysis of List Experiments

Department of Mathematical Sciences, Norwegian University of Science and Technology, Trondheim

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Longitudinal and Panel Data: Analysis and Applications for the Social Sciences. Table of Contents

WISE International Masters

Charles E. McCulloch Biometrics Unit and Statistics Center Cornell University

EMERGING MARKETS - Lecture 2: Methodology refresher

Outline of GLMs. Definitions

Statistics 572 Semester Review

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at

Econometric Analysis of Count Data

Semiparametric Generalized Linear Models

Sampling: A Brief Review. Workshop on Respondent-driven Sampling Analyst Software

Investigating Models with Two or Three Categories

PQL Estimation Biases in Generalized Linear Mixed Models

A Guide to Modern Econometric:

GLM models and OLS regression

MC3: Econometric Theory and Methods. Course Notes 4

MODELING COUNT DATA Joseph M. Hilbe

On estimation of the Poisson parameter in zero-modied Poisson models

Mantel-Haenszel Test Statistics. for Correlated Binary Data. Department of Statistics, North Carolina State University. Raleigh, NC

Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model

Microeconometrics: Clustering. Ethan Kaplan

dqd: A command for treatment effect estimation under alternative assumptions

ECON 594: Lecture #6

Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models

Linear Regression and Its Applications

Part 8: GLMs and Hierarchical LMs and GLMs

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Management Programme. MS-08: Quantitative Analysis for Managerial Applications

Regression and Statistical Inference

A COEFFICIENT OF DETERMINATION FOR LOGISTIC REGRESSION MODELS

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Ninth ARTNeT Capacity Building Workshop for Trade Research "Trade Flows and Trade Policy Analysis"

Generalized Linear Models Introduction

Single-level Models for Binary Responses


Using Estimating Equations for Spatially Correlated A

Models for Heterogeneous Choices

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

review session gov 2000 gov 2000 () review session 1 / 38

Econometrics Lecture 5: Limited Dependent Variable Models: Logit and Probit

STATISTICAL INFERENCE FOR SURVEY DATA ANALYSIS

ISQS 5349 Spring 2013 Final Exam

Figure 36: Respiratory infection versus time for the first 49 children.

Proving Completeness for Nested Sequent Calculi 1

Hierarchical Linear Models. Jeff Gill. University of Florida

STA 216, GLM, Lecture 16. October 29, 2007

Power Calculations for Preclinical Studies Using a K-Sample Rank Test and the Lehmann Alternative Hypothesis

Births at Edendale Hospital

Introducing Generalized Linear Models: Logistic Regression

Generalized Linear Models for Count, Skewed, and If and How Much Outcomes

Simulating Longer Vectors of Correlated Binary Random Variables via Multinomial Sampling

multilevel modeling: concepts, applications and interpretations

Testing Goodness Of Fit Of The Geometric Distribution: An Application To Human Fecundability Data

BAYESIAN ANALYSIS OF DOSE-RESPONSE CALIBRATION CURVES

Introduction to General and Generalized Linear Models

on probabilities and neural networks Michael I. Jordan Massachusetts Institute of Technology Computational Cognitive Science Technical Report 9503

Package dispmod. March 17, 2018

An Introduction to Multilevel Models. PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 25: December 7, 2012

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

MODEL SELECTION BASED ON QUASI-LIKELIHOOD WITH APPLICATION TO OVERDISPERSED DATA

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

GIST 4302/5302: Spatial Analysis and Modeling

A Course in Applied Econometrics Lecture 7: Cluster Sampling. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

Transcription:

Analysis of Proportions Data 1 Bradley Palmquist Department of Political Science Vanderbilt University brad.palmquist@vanderbilt.edu July 1999 1 Prepared for the 1999 Annual Meeting of the Political Methodology Society, College Station, Texas, July 15 { 19, 1999.

Abstract This paper discusses the modeling of over- and under-dispersion for grouped binary (proportions or frequency) data in binomial regression models (logit and probit). The sources of non-binomial variance in heterogeneity and dependence across or within units is described. Even when point predictions in conventional logit and probit analyses are only slightly affected, standard errors, and inferences based on them, will be incorrect. Particular attention is paid to the extended beta-binomial model. Using the extended beta binomial distribution, the non-binomial variance can be explicitly modeled. Another benet of modeling the non-binomial variance with the EBB distribution is that negative correlations which lead to under-dispersion can be accommodated. Although common in some disciplines, EBB in conjunction with binary regressions (logit, probit) has only recently been used in political science applications. Some example analyses are presented and the results are compared to those from other models and post-analysis xes.

1 Introduction It has been suggested (Lindsey 1995) that most applied statistical analysis involves dependent variables with discrete distributions and that therefore the textbook emphasis on continuous variables and standard regression techniques gives a student the wrong impression of what is ahead. Judging by recent methodological work there may be some truth to this suggestion in political science. Much attention has been given lately to various logit and probit models of binary outcomes (ordered, multinomial, multivariate, conditional, nested) and to count models with non-negative integral dependent variables. The simplest count models are based on the poisson distribution. As a one-parameter distribution, the variance and the mean are necessarily functionally related, and in this case they are equal. Actual data often do not exhibit this mean variance equality. The analyst may have theoretical reasons for expecting violations of the poisson assumptions. Frequently it is simply observed that the variance either exceeds (overdispersion) the expected value or is less (underdispersion). Those who have studied count models have emphasized from the beginning the importance of paying attention to these phenomena and have developed a variety of more exible models and estimation techniques to account for them (King 1989). Political scientists have not paid as much attention to similar issues of over- and underdispersion in the case of sums of binary outcomes. There are a few exceptions. King (1989) briey discusses extra-binomial dispersion and the use of the beta-binomial model. Cox and Katz (1999) use the beta-binomial model in a study of the eects of redistricting. Globetti (1997), Futter and Mebane (1996), and Baker and Scheiner (1999) are other recent examples. The object of this paper is to explain the issues and discuss the modeling and estimation techniques for non-binomial dispersion in the analysis of proportions (or frequency, or binary count) variables. Of course, in some cases a straightforward binomial distribution may be assumed to characterize a sum of binary outcomes. But when there is parameter 1

heterogeneity or dependence among the binary variables that make up the sums, a binomial model will not apply. Standard errors will be incorrectly estimated, comparisons of means or parameter values can be misleading, and estimates will not be ecient. Previous research on count models can provide a guide. Mixture models, random-eects models, robust variance estimation, and heterogeneity factors have been used to extend the poisson regression (log-linear) model. Analogous techniques exist for binomial regression models (typically logit and probit). Previous work in biostatistics provides much of the statistical machinery we require. For example, the (extended) beta binomial model has been used in the toxicological research literature to model these non-binomial characteristics. The beta-binomial model is one of the central subjects of this paper. But the received wisdom on count models can also provide a counterpoint. There are some unique features to the under- and over-dispersion in models of binary data. In particular the role of \heterogeneity" can be surprising. Another goal of this paper is to describe carefully the various sources of non-binomial variation and to clarify the way heterogeneity and dependence enter in. There are a several possible connections between parameter heterogeneity and non-binomial varation depending on the kind of heterogeneity being modeled. Accounting for heterogeneity across units involves a straightforward application of the beta binomial model where overdispersion is to be expected. Characterizing heterogeneity within units is a dierent problem. The consequences, of this \heterogeneity" is under dispersion not overdispersion. 2 Binomial or Grouped Binary Response Data The data we are interested in are binary outcomes: success or failure, voting yes or no, elected not elected, employed or unemployed, etc. When the outcomes can be modeled 2

individually, conventional logit, probit, and related techniques can be used. But when the individual binary outcomes are grouped, the usual assumptions of standard logit analysis may not be met. Instead of modeling the individual binary variables we focus on their sum, Y, which is the number of positive outcomes conditional on the number of trials, n, for that group. These sums of binary variables are indexed by the subscript, i (i = 1; 2; :::; N) so that Y i is the number of positive outcomes recorded for the i'th observational unit out of n i trials. If the binary variables recorded for the i'th unit have a common probability of success, i, and they are independent, then Y i is distributed binomially. This familiar distribution is Pr(Y i = y i j i ; n i ) = 0 B @ n i y i 1 C A i y i (1? i ) n?y i : The subscripts, i, are included to remind us from the beginning that we are not interested simply in a univariate distribution, but in a whole series of Y i modeling of their expectations. each conditional on the The expectation of Y i is n i i and the binomial variance of Y i equals n i i (1? i ): Sometimes it is more natural to consider the proportion of successes, p i. The expectation of this proportion is just i with variance i (1? i )=n i. On occasion it makes sense to discuss the underlying components of the sum, the n i Bernoulli trials, which I labelu ij, but in most of what follows I focus on the Y i directly since I assume that the data are produced or recorded in the grouped form, and that the analyst does not have added information about the specic U ij. For example, any potential independent variables have values that vary by the i'th unit and not by the U ij within each unit. An analogy to integer count data can be made. One typically models the poisson sum, for example, not the individual outcomes which make up the sum. If in fact the grouping is an articial combination of independent cases that happen to 3

have the same values on measured independent variables, grouped binary variables provide no new challenges. If, however, the assumptions of independence and identical distribution are not met, Y i will not be distributed binomially. The issue of varying distributions will be taken up later. Lack of independence can result from two dierent mechanisms. The individual binary components of the observational unit can be directly correlated. Or this correlation can be induced as a result of heterogeneity across the Y i. The next several sections deal mostly with this issue of heterogeneity. 3 Heterogeneity The heterogeneity we consider in this section involves varying response probabilities for dierent individuals or observation units. 1 For the time being we maintain the assumption of a single response probability i constant across each of the binary variables that make up the sum, Y i. What can dier is i and i 0 for two dierent units. To understand the sense in which heterogeneity across cases can produce non-binomial variation in analyses of binary data, consider rst the textbook linear regression setup of a continuous dependent variable.. In a regression analysis we attempt to account for varying E(Y i ) by modeling these expectations as functions of the independent variables. Thus, to take the simplest example of a linear regression, we model the data as E(Y i jx i ) = x 0 i; V (Y i jx i ) = 2 where the variance of the Y i is assumed to be the same for dierent units. (Note that if the x i are regarded as xed in repeated samples the expectations and variance need not be 1 I emphasize again that I am not discussing the individual binary components, U ij, of the binomial sums, Y i. For our purposes the \individual" observation i relates to the entire sum Y i. To help avoid confusion I sometimes refer to this i'th \individual" as the i'th \observational unit." 4

stated in conditional form.) We can also rewrite the regression model in a form that isolates the variable part of Y i as a disturbance Y i = x 0 i + e i ; E(e i ) = 0; V (e i ) = 2 : In order to do maximum likelihood analysis we would also need to assume a particular distribution for the Y i, typically normal. At some points in the discussion we will do so, but for the moment this is not necessary. What we assume for the linear regression setup is that the distribution of Y i is continuous and unbounded. There can be some ambiguity about just what is the data generation process in a regression model. If it makes sense to identify specic individuals in a population, or one can somehow distinguish the random outcomes by labeling, we can in principle distinguish two components of e i. In addition to the fundamental variation (due to sampling, observation, or random process) for each individual there can also be variation due to heterogeneity among the individuals or observational units even after conditioning on the independent variables. Labeling the distinguishable individuals with the subscript i and a sampling or outcome for that individual with the subscript j we have E(Y ij ) = x 0 i + i + j ; E( i ) = E( i ) = 0; V ( i ) = 2 ; V ( i ) = 2 : Under this model each individual (or observed unit) has an expectation of x 0 i + i. The set of Y ij as whole has expectation x 0 i. The i can be considered individual-specic eects. This sort of setup is most familiar with panel data where an individual (or country or rm) is observed on multiple occasions over time and therefore the subscript j is conventionally t. The i are then the familiar random eects. (We are ignoring here any questions of heteroskedasticity.) But even if all the data we have come from a single cross-section, the variation of the Y i include that due to the i. For usual OLS and maximum likelihood 5

estimation with normal errors, this added source of variation is observationally equivalent to having a single disturbance term e i with variance equal to 2 + 2. And, importantly, it does not aect the way we estimate the main structural parameters of interest,. The added uncertainty is naturally just absorbed into the estimate of e. The situation is dierent for the distribution of binary data. Additional variation across observation units on top of the fundamental binomial variation means that V (Y i ) no longer equals n i i (1? i ). 2 Based on the previous discussion this should not be surprising. The problem is that the standard logit or probit setup does not provide a means for modeling this overdispersion due to heterogeneity. Rather than there being a free dispersion parameter like 2 e in the case of regression of a continuous and unbounded dependent variable, if we assume Y i to be distributed as binomial, then its variance is constrained to equal i (1? i ). To model the dependence of Y i on x i we can still use the general form of a regression model. Typically, however, the expectation of (Y i ) itself is not directly modeled. Rather, the systematic relationship with the regressors is between some function of E(Y i ), here indicated by i : i = x 0 i; i = g( i ): Consider the logit case: 3 g( i ) = logfy i =(n i? y i )g = logf i =(1? i )g: As before, E(Y i ) = n i and V (Y i ) = n i i (1? i ) and E(p i ) = and V (p i ) = i (1? i )=n i. In contrast to the linear regression setup, note that there is no disturbance term in the equation 2 This is not true if n i = 1, that is, for ungrouped binary variables. More about this below. 3 I discuss logit throughout, but the same points can be made for probit, complementary log-log, or other setups. Logit is the canonical link in the terminology of generalized linear models or exponential distributions and so allows simpler or more elegant derivations in some contexts. 6

for the systematic component, i. Nor is there a separate dispersion term in the expression for V (Y i ). Y i varies only according to the variance function of. In practice, however, many data sets of grouped binary variables display extra-binomial variance, meaning the variance of the Y i is observed to exceed the expected n i^ i (1? ^ i ). If we continue to maintain the assumption that the individual components of the sum Y i are independently and identically distributed, conditional on i, then this extra-binomial variation, or overdispersion, must result from heterogeneity of the i among the units. We have no place else to turn, since given xed n i, the binomial distribution has but a single parameter. The details of how this heterogeneity translates into overdispersion are in the next section. In the (non-panel data) regression model of a continuous, unbounded dependent variable these additional sources of variation do not raise concerns because we have no modeldetermined prior notion of a functional relationship between the mean and the conditional variance. The same phenomenon of over (and under) dispersion does arise with regard to nonnegative integer count variables modeled by exponential poisson regression. 4 The poisson distribution is also a single-parameter distribution. The mean of Y distributed as poisson equals its variance. But again, in practice, many integer count variables do not match this model-determined constraint. The issues related to modeling such data have been intensively discussed by political scientists in recent years (see for example King). The dierence between poisson and binomial regressions on the one hand, and regressions of unbounded continuous variables on the other, is that if we believe we have modeled the conditional mean function appropriately, this has consequences for the expected conditional variance. This is not only true for panel data, but applies to single cross-sections. A similar 4 This is also referred to as the log linear model (for example, McCullagh and Nelder 1989). 7

standard or benchmark does not exist for continuous variable regression unless panel data are available. What are the sources of heterogeneity among the binomial variables, Y i? The most commonly mentioned source is \omitted variables." The notion here is that Y i (in the continuous case) or i = logf i =(1? i )g varies not just for random reasons, but with a systematic relationship to some additional variables not included in the model. In some tautological sense this must always be true. Presumably any model is a simplication. But the possible implication that the analyst should just go out and nd those needed variables to eliminate overdispersion is of course not practical. Even if all of the systematic interindividual variation could be conditioned on an increased list of independent variables, there might well remain random individual eects. In any case, simply including more variables will not always eliminate extra-binomial variance. Let us assume that the analyst has included just those variables for which they want to estimate coecients. What is not included are the unit-specic eects (random or xed). The omission of unrelated variables in the continuous variable case (we are not discussing omitted variable bias here) neither changes the form of the likelihood equation (assuming normality) nor invalidates the estimated standard errors and techniques of statistical inference, since an estimate of the \total" 2 e is made. For binary variables problems arise. A mixture of binomials is not in turn distributed binomially as a mixture of random normals is. The standard errors calculated by conventional logit assume a binomial error distribution. They will have incorrect magnitude and testing procedures may be based on incorrect distributional assumptions. This can lead to incorrect comparisons of coecients. Lastly estimation of coecients will be inecient. Similarly a mixture of poisson variables is not in turn distributed as a poisson. We will see below that there are a number of \adjustments" that can be made to binomial (logit, probit) and poisson regressions. But 8

in any case, the underlying probability distribution is changed and therefore any likelihoodbased techniques must be altered. These individual-specic eects arise because there are unmodeled inuences that aect all the components of each binomial sum, but are independent of the same kind of eects on the other units. In the biological and agricultural experimental literature the canonical example is \litter" or \batch" eects (Kleinman 1973, Williams 1975, Crowder 1978, Haseman and Kupper 1979, Morgan 1992). Groups of animals (mice, cows, our-beetles, chicken embryos) or plants (plum-roots, Orobanche seeds), or insects (moths, houseies) are subjected to some treatment. Litters or batches which are conceived or raised or handled together may have more in common than the levels of the experimental treatments. A \litter eect" is the tendency of members of a group to respond more alike than members of other groups (Haseman and Kupper 1978, 281). In examples more directly related to political science multiple responses forming a binomial sum (even if they are conditionally independent) recorded for a single politician, a single survey interviewee, a single administration, or a single country or other jurisdiction will often have something in common that responses for other units do not namely a dierent, observation-specic probability. These individual eects (random or xed) are in addition to the the eects of x i included in the model for the expectation of i. Other possible sources of overdispersion include measurement error, the wrong functional forms of independent variables, the wrong link function (e.g. complementary log-log instead of logit), and the presence of outliers. Although these are important practical concerns, this paper will focus primarily on heterogeneity among units and, in a later section, the logical dual of correlation between the components within units. 9

4 Overdispersion In the previous section we noted that if individual components, U i j, of the sum Y i are independently and identically distributed, conditional on i, then overdispersion can only be generated through heterogeneity, by which we mean response probability variation among the binomial units (not within each). Let us begin to put more structure on the problem. Assume that each observed case, i, has an expected response probability of i. These i are systematically related to the included x i. To some extent, then, heterogeneity is accounted for in the model. Now, in addition, we allow the response probabilities to vary. One way to do this is formally just like the regression setup for continuous variables: E( i ) = F (x 0 i); V ( i jx i ) = 2 i or i = F (x 0 i) + i ; E( i ) = 0; V ( i ) = 2 i : (Note again, that if the x i are nonstochastic, the variances need not be expressed in conditional form.) Allison has characterized this modeling of non-binomial variation as \including a disturbance into logit and probit regression models" (1987). 5 There are some contrasts with the setup for regression of a continuous variable. First, it is the response probabilities i, not Y i directly, that are modeled. Secondly, the functional relationship to the x i is not linear. F () is the inverse of the logit function, or the logistic distribution function: F (x 0 i) = i ex0 1 + e : i x0 5 The disturbance introduced in this section is \external" in Allison's terminology. Below we consider a random eects model which introduces an \internal" disturbance term. 10

Also, we allow for heteroskedasticity right from the start by subscripting 2. OLS regression is generally rst presented with the assumption of homoskedasticity (i.e. constant 2 across individuals), but for these bounded random parameters i that does not seem reasonable. The realizations of the random variable i are not observed. They are in this sense latent variables. What we do observe are the consequences on the y i. Conditional on a specic realization of 6 i the familiar expectation and variance of a binomial variable hold true: E(Y i j i ) = n i i ; V (Y i j i ) = n i i (1? i ): What are the unconditional expectation and variance of Y i? Standard results from conditional probability theory lead us to conclude E(Y i ) = E[E(Y i j i )] = E(n i i ) = n i E( i ) = n i ~ i ; where we have used ~ i to represent the expectation of the random probability response (it in turn is conditional on x i ). Again using standard conditional probability results, we note that the unconditional variance of Y i equals E[V (Y i j i )] + V [E(Y i j i )]. Deriving these two components separately we nd that E[V (Y i j i )] = E[n i i (1? i )] = n i E( i )? n i E( 2 i ) = n i ~ i? n i [ 2 i + ~ i 2 ] = n i ~ i (1? ~ i )? n i 2 i 6 I am using i to represent both the random variable and its realizations. One could use a capital i for the former and a lower-case i for the latter, as for Y i and y i, but that does not seem necessary. 11

and V [E(Y i j i )] = V (n i i ) = n 2 i V ( i) = n 2 i 2 i : Hence V (Y i ) = n i ~ i (1? ~ i )? n i 2 i + n2 i 2 i = n i ~ i (1? ~ i ) + n i (1? n i ) 2 i : We can reparameterize 2 as i ~ i (1? ~ i ) to obtain V (Y i ) = n i ~ i (1? ~ i )[1 + (n i? 1) i ]: Note that, for the moment, no assumptions are made about the individual i, except for the fact that i cannot exceed 1. 7 Eventually to estimate the parameters of this model we will have to reduce their number either by assuming that all i are equal or by modeling them as functions of a vector of independent variables which may or may not coincide with the variables in the logit equation for i. A couple of observations can be made about the variance of Y i as written. First, if the binomial sum only has one component Bernoulli variable, i.e. n i equals 1, then whether the response probability is a random i with expectation ~ i or a xed ~ i, the variance of Y i is still just n i ~ i (1? ~ i ). This may seem surprising. How can the introduction of additional randomness not increase the variance of Y i? The second thing to be noted about the above expression is that it suggests an interpretation in terms of covariances among the component binary random variables, U i within each 7 Maximal variance for the varying probabilities with expectation of i would be to have only the two possibilities of i equal to 0 or 1 with probabilities of i and 1? i. (This is just like any Bernoulli variable.) 12

individual binomial sum, Y i. There are n i contributions of ~ i (1? ~ i ) and n i (n i? 1) contributions of ~ i (1? ~ i ) to the variance of the sum, Y i. This suggests the n i diagnonal elements of an n i by n i covariance matrix representing V (U i ) and the n i (n i? 1) o-diagonal elmenents representing Cov(U ij U ik ). On this interpretation, then, i is the pairwise correlation among any two U ij components of Y i. This is as far as we can go without further assumptions. In the next section we make some assumptions about i and the distributional form of the i. 5 Beta-Binomial Model The beta-binomial model is the rst method we consider for explicitly accounting for overdispersion. It plays the same role for binomial regression that the negative binomial distribution does for poisson regression. Both are mixture or compound distributions that accommodate variances which exceed the variance functions for the binomial (n i i (1? i )) and poisson ( i ). Under the beta-binomial model, it is not assumed that all the Y i have the same binary response probability, i. Rather, the i dier across observations even after any explanatory modeling has been done. The underlying probabilities, i, are postulated to follow a beta distribution. That is, for each binomial sum, Y i i is modeled as if drawn independently from a common (a; b) distribution. Conditional on this i then, each Y i is a binomial variable with response probability i. The probability i varies across the observational or experimental units for which the binomial variables Y i are observed. Since i must lie within 0 and 1, and because of its exibility, the beta distribution is a promising way to characterize this parameter heterogeneity: f() =?1 (a; b) a?1 (1? ) b?1 13

where a and b are the conventional parameters of the beta distribution, and (a; b) =?(a)?(b)=?(a + b): Prentice (1986) suggests a reparameterization where = (a + b)?1 and = a(a + b)?1. This is the parameterization that King (1989) uses also. The standard expectation result for the beta distribution is E() = a a + b = = ~ I replace Prentice's with ~ as a reminder that this parameter is the expectation over the. The variance of a Beta distribution, using both parameterizations, is V () = ab 1 (a + b) 2 a + b + 1 = ~(1? ~) 1 1 +?1 = ~(1? ~)(1 + )?1 : Our primary interest is in the consequences this parameter variation has on the observed Y i. The unconditional, beta-binomial distribution of each Y i, is: Pr(Y i = y i ja i ; b i ; n i ) = 0 B @ n i y i 1 C A (a i + y i ; b i + n i? y i ) : (a i ; b i ) or: Pr(Y i = y i j~ i ; i ; n i ) = 0 B @ n i y i 1 C Y A y i?1 (~ i + i j) j=0 Y n i?y i?1 j=0 Y n i?1 (1? ~ i + i j)= (1 + i j); j=0 The expectation and variance of Y i are n i a i (a i +b i )?1 or n i ~ i, and nab(a+b)?2 [1+(n?1)(a+ b+1)?1 ] or n(1?)[1+(n?1)(1+)?1 ]. Dening i = i (1+ i )?1 and rearranging we can see that one interpretation of i is as the within unit pairwise correlation coecient. Then i n i (n i? 1) times the binomial variance is the contribution from cross-terms that produces the extra-binomial variance. The expression for the variance in this form directly matches the general characterization of extra-binomial variation derived above. 14

V (Y i ) = ~ i (1? ~ i )[n i + n i (n i? 1) i ] Thus the beta-binomial distribution provides a fully parameterized likelihood basis for modeling non-binomial variance. When i goes to 0 and the beta-binomial approaches the simple binomial, the correlation of pairs of response variables within the unit goes to 0. Conversely, when i goes to innity and the distribution of i approaches a binary random variable with no conditional variation in the U ij, the correlation of pairs of response variables within the unit goes to 1 reecting the fact that conditional on i the U ij are either all 1s or all 0s with probability approaching 1. The eects of across-unit parameter heterogeneity cannot be separated from within-unit correlation, since they necessarily occur together in this model. In fact, capitalizing on the correlation induced by this two-step or hierarchical probability model, contagion eects can be modeled even though the simulation does not directly build in the correlation. 6 GLM and Quasi-Likelihood Alternative approachs to modeling non-binomial variance for binary variables derive from the generalized linear models (GLM) literature. Rather than assume that we know the full likelihood for the distribution of the Y i as in the beta-binomial model, more limited assumptions are made focusing on the relationship of the mean to the variance. The simplest version merely includes a \heterogeneity" factor which scales up the standard errors based on the estimated overdispersion. The standard test for overdispersion is simply the conventional Pearson's X 2 -statistic. For N binomial sums, Y i (i = 1; :::; N), each with n i components, observed p i, and predicted 15

^ i based on a model with k paramters, NX X 2 (y i? n i^ i ) 2 X N = i=1 n i^ i (1? ^ i ) = n i (p i? ^ i ) 2 i=1 ^ i (1? ^ i ) which is distributed as a 2 N?k statistic. 8 To the extent the estimated X 2 exceeds its expected value of N?k, and if other sources of this departure can be eliminated, then this can provide a test of overdispersion. Furthermore the heterogeneity factor,, is estimated by X 2 =(N?k). The estimate of the quantity we have called in the expression for V (Y i ) in previous sections would then be: ^ = X 2? (N? k) ( P n i =N? 1)(N? k) (see Allison 1987). This is an approximate expression that does not account fully for dierent values of n i across the observations. Note that we are constrained to estimate a common value for all Y i. Williams (1982) presents a more elaborate iterated weighted least squares method that does account for the varying n i. This technique, does however, also assume constant for all observations. 7 Random Eects Models In previous sections we discussed the addition of variation to the response probabilities after conditioning the i on the x i or, in Allison's (1987) terminology, we introduced a disturbance which is \external" to the systematic relation between i and the x i. Instead, we might introduce an \internal" disturbance. Above, the disturbance was in the metric of the i : i = F (x 0 i) + i ; E( i ) = 0; V ( i ) = 2 i 8 Collett 1991, 38 or McCullagh and Nelder 1989, 127). 16

An \internal" disturbance corresponds more closely to random eects models in other settings: i = F (x 0 i + i ); E( i ) = 0; V ( i ) = 2 i : Typically the i are assumed to be normally distributed. Now it is the logit equation itself that resembles the standard regression setup, writing i for logf i =(1? i )g: i = x 0 i) + i ; E( i ) = 0; V ( i ) = 2 i Allison (1987) describes an approximate method of estimating a random eects model for logistic regression. Williams (1982) discusses two other approximate methods. Lindsey (1995) describes a method based on Gauss-Hermite quadrature for estimating the random eects model. 8 Internally Variable Response Probabilities Up to now, the discussion of heterogeneity has dealt only with variation in the response probabilities among the observational units. Another kind of heterogeneity we might consider is internal to, or within, each unit. This section highlights some previously unappreciated or misunderstood aspects of nonuniform probabilities for the component Bernoulli variables, the U i, that are summed to create Y i. In this situation we have sums of binomials, not mixtures of them as in the previous discussion of the beta-binomial model. If the component response probabilities ij vary, then the random sum Y i is no longer even conditionally binomial. A sum of binomials is in turn binomial only if the component probabilities i j are equal (Feller 1968, 268, 282). The 17

result of this \internal heterogeneity" is to reduce, not increase the unconditional variance of the sums Y i. Note that these features of random sums of binomials should be distinguished from the poisson case. The sum of poisson variables with varying means, i, is in turn distributed as poisson with expectation and variance equal to P n i i=1 ij (Feller 1968, 268). In the poisson case, \internal heterogeneity" of this type produces neither over- nor under-dispersion. The more familiar result that heterogeneity in poisson models leads to a negative binomial distribution, applies to mixtures of poisson distributions for modeling cross-unit heterogeneity. We begin the discussion of sums of varying probabilities by considering the case of random, not xed i. This might seem like a complication, but it extends the previous discussion in a natural manner. Our conclusion above was that if the i had variance ~ i (1? ~ i ) then V (Y i ) = n i ~ i (1? ~ i ) + [1 + (n i? 1) i ]. We maintain the assumption of variable response probabilities with expectation ~ and variance ~ i (1? ~ i ), but stipulate that each Y i is the sum of J i \clusters" of component binary variables. Each cluster now has its own response probablity ij distributed as before. To determine the unconditional mean and variance of Y i we again make use of standard conditional probability theory. For ease of explication we assume that the number of components in each of the J i \clusters" is the same. Then we can use m = n i =J i for the cluster size. The average of the realized ij for an observed y i is written as ij. The full set of ij is denoted as i:, not to be comfused with the expected value of ij which we continue to denote as ~ i. First we need the conditional mean and variance: E(Y i j i: ) = XJ i j=1 m ij = mj i ij = n i ij 18

and V (Y i j i: ) = XJ i m ij (1? ij ) j=1 = X m ij? X m 2 ij = (mj i ij? mj i 2 X ij)? (m 2 ij? mj i 2 ij) = X mj i ij (1? ij )? m ( ij? ij ) 2 = X n i ij (1? ij )? m(j i? 1) ( ij? ij ) 2 =(J i? 1) Comparing this result to the conditional variance when there are no clusters, or more precisely where the whole sum is a single cluster, we see that this conditional variance is less by the amount of the second term. This expression is a more general result that includes the earlier one since if the whole sum is a single cluster this term drops out. This in turn is an example of the general phenomenon that the variance of a sum of binomials with varying probabilities must in all cases be less than the variance of a sum of binomials each with probability equal to the average (Feller 1968, 231). Now we can derive the unconditional mean and variance of Y i. The unconditional mean is unchanged by this new structure: E(Y i ) = E[E(Y i j i: )] = E(n i ij ) = n i E( ij ) = n i ~: The two components of the unconditional variance are E[V (Y i j i: )] = X E[mJ i ij (1? ij )? m(j i? 1) ( ij? ij ) 2 =(J i? 1)] = mj i ~ ij? mj i [ i ~ i (1? ~ i )=J i + ~ ij] 2? m(j i? 1) i ~ i (1? ~ i ) = n i ~ i (1? ~ i )(1? i ) 19

and V [E(Y i j i: )] = V (mj i ij ) = m 2 J 2 i i~ i (1? ~ i )=J i = mn i i ~ i (1? ~ i ): Putting them together we get nally V (Y i ) = n i ~ i (1? ~ i )[1 + (m? 1) i ]: Again this is a more general result that subsumes our earlier one. If there is one uniform response probability for the i'th observation and, hence, m equals n i we have the expression derived above of V (Y i ) = n i ~ i (1?~ i )[1+(n i?1) i ]. Note also the inclusion of the implication made above, that if the number of \clusters", J i, equals n i so that the the eective \cluster size," m, equals one, then the term involving i drops out and Y i is binomial again. So for example, if the data are produced by simple random sampling without clustering, then the variable Y i indeed binomial and standard results follow. 9 Underdispersion as a Result of Nonuniform Probabilities In the last section we considered response probabilities that varied within each Y i as a result of random draws of ij. Nonuniform probabilities might also, and more simply, be the result of what we might call xed parameter heterogeneity, i.e. dierent ij for each of the component U ij. Using the same algebraic steps we used to derive the conditional variance of Y i in the previouse section we get 20

Xn i V (Y i ) = ij (1? ij ) j=1 X = n i ij (1? ij )? (ij? ij ) 2 = n i ij (1? ij )[1? i ]; where P (ij? ij ) 2 i = n i ij (1? ij ) is a relative measure of the internal parameter heterogeneity that can vary from 0 to 1. Since i 0, V (Y i ) will be less than the binomial variance if there is any internal heterogeneity. Note that? i =(n i? 1) = i, where i equals the negative correlation which results between the component U ij. Inter-unit heterogeneity and positive dependence of the U ij components have long been recognized to be equivalent. As far as I know, no one has previously shown the connection between internal response probability variation (or intra-unit heterogeneity) and negative dependence among the component U ij. 21

References [1] Adams, Greg D. 1997. \Abortion: Evidence of an Issue Evolution." American Journal of Political Science. 41: 718-737. [2] Altham, Patricia M.E. Altham. 1978. \Two Generalizations of the Binomial Distribution." Applied Statistics. 27: 162-167. [3] Allison, Paul D. 1987. \Introducing a Disturbance into Logit and Probit Regression Models." Sociological Methods and Research. 15: 355-374. [4] Cameron, A. Colin, and Pravin K. Trivedi. 1998. Regression Analysis of Count Data. Cambridge Univ. Press. [5] Collett, David. 1993. Modeling Binary Data. [6] Baker, Andrew, and Ethan Scheiner. 1999. \Smart Parties in a Rigged System: Party Strategy and the Eects of Malapportionment under Japanese STV." Working paper. [7] Cox, Gary W., and Jonathan N. Katz. 1999. \The Reapportionment Revolution and Bias in U.S. Congressional Elections." American Journal of Political Science 43: 812-841. [8] Crowder, Martin J. 1978. \Beta-binomial Anova for Proportions." Applied Statistics. 27: 34-37. [9] Feller, William. [1950] 1968. An Introduction to Probability Theory and Its Applications. New York: John Wiley & Sons. 22

[10] Globetti, Suzanne. 1997. \What We Know About 'Don't Knows': An Analysis of Seven Point Issue Placements." Paper presented at the poster session of the 1997 Political Methodology Meeting in Columbus, Ohio, July 1997. [11] Haseman, J.K., and L.L. Kupper. 1979. \Analysis of Dichotomous Response Data from Certain Toxicological Experiments." Biometrics. 35: 281-293. [12] King, Gary. 1989. Unifying Political Methodology: The Likelihood Theory of Statistical Inference. Cambridge: Cambridge U. Pr. [13] Kleinman, Joel C. 1978. \Proportions with Extraneous Variance: Single and Independent Samples." Journal of the American Statistical Association. 68: 46-53. [14] Lindsey, J.K. 1995. Modelling Frequency and Count Data. Oxford: Oxford Univesity Press. [15] Lindsey, J.K. 1997. Applying Generalized Linear Models. New York: Springer-Verlag. [16] McCullagh, P., and J.A. Nelder. 1989. Generalized Linear Models. London: Chapman and Hall. [17] Futter, Stacy, and Mebane, Walter. 1996. \Developments in Rape Law Reform in the U.S. and Eects of Reform on Rape Reports and Arrests, 1970-1992." Working paper. [18] Prentice, R.L. 1986. \Binary Regression Using an Extended Beta-Binomial Distribution, With Discussion of Correlation Induced by Covariate Measurement Errors." Journal of the American Statistical Association. 81: 321-327. [19] von Mises, Richard. [1928] 1957. Probability, Statistics and Truth. New York: Dover Publications, Inc. 23

[20] Williams, D.A. 1975. \The Analysis of Binary Responses from Toxicological Experiments Involving Reproduction and Teratogeneity." Biometrics. 31: 949-952. 24