A Count Data Frontier Model

Size: px

Start display at page:

Download "A Count Data Frontier Model"

Suzan Taylor
6 years ago
Views:

1 A Count Data Frontier Model This is an incomplete draft. Cite only as a working paper. Richard A. Hofler (rhofler@bus.ucf.edu) David Scrogin Both of the Department of Economics University of Central Florida Orlando, FL There are many cases in which a count variable is being either maximized or minimized. We propose one method for estimating the extent of inefficiency of a maximizing process that produces a nonnegative integer variable. It is based on the beta binomial/negative binomial distribution model of Schmittlein at al., We show how this model can estimate the unobserved frontier (maximum potential) number of items for each observed count value and, consequently, estimate the extent of inefficiency for each observed count value. This model s estimation by ML is illustrated on a sample of data in which individuals are attempting to maximize the number of items assembled. 0

2 A Count Data Frontier Model I. Introduction There are many cases in which a discrete variable is being either maximized or minimized. Examples of the former case include number of new patents by firms, number of wins by a sports team, a person s years of education, number of weeks worked by an individual, etc. Conversely, it is reasonable to believe that minimization behavior occurs in situations involving number of accidents along a certain stretch of roadway, number of patient accidents in a medical care facility, number of failures in a new product, number of incorrect results (false positives and false negatives) from a medical test, number of errors in the air space around an airport (i.e., letting airplanes get too close to each other), the number of times a person is arrested during his or her lifetime etc. Given the plethora of such optimizing situations involving nonnegative discrete variables, it is natural to wonder about, and investigate, how well the decision-makers are doing in approaching the very best performance that they can attain. Since 1979 (Aigner, Lovell, and Schmidt and Meeusen and van den Broek) a large and continually expanding literature on stochastic frontier analysis has investigated the extents and causes of inefficient behaviors and developed many models for such investigations. 1 One of the features common to all of the publications in this literature is in addressing continuous variables that are being maximized or minimized. No one (to our knowledge) has 1 Data Envelopment Analysis (DEA) is an alternative method for exploring the extents and causes of inefficient behaviors. It is a mathematical programming approach whereas stochastic frontiers are econometric in nature. 1

3 proposed a model for count data frontier analysis when the count variable is being maximized. (Fe-Rodriguez (2007) has proposed a frontier model for minimizing a count.) It is into this unexplored territory that this paper ventures. Specifically, we propose one method for estimating the extent of inefficiency of a maximizing process that produces a nonnegative integer variable. In other words, we propose a count data frontier model. II. The Count Data Frontier Model In order to make some of the statements in the rest of this paper easier to follow, we must first explain our context: producing a count variable. By this we mean that an entity (individual or firm) is engaged in a (maximizing, in this model) process whose outcome is a count variable (e.g., number of patents, number of wins, number of weeks worked during a period, etc.) In this process, there is a latent maximum possible (frontier) number of items (patents, wins, etc.) that can be produced. However, due to inefficiency, some percentage of that unobserved frontier outcome number is not produced. The items that are produced are observed. The shortfall between the frontier output and the observed output is the extent of inefficiency. For instance, assume that a firm is attempting to generate as many patents as possible each year. Further, imagine that the maximum possible number of patents that it can produce in a year is 17. Suppose that the firm generates only 9 patents in that year. So, the frontier outcome is 17 patents. The observed (produced) output is 9 patents. The extent of inefficiency is 8 unobserved (not produced) patents. The four foundations (below) of this count data frontier model follow those listed in Schmittlein at al., (1985.) This paper is one in a relatively small literature on underreporting (or under counting) of discrete variables. We revise their context from imperfectly recording purchases into inefficiently maximizing a count variable. Thus, when this literature refers to 2

4 recorded purchases, we translate that concept into observed or produced output. Whereas this literature talks about the actual number of purchases (the sum of those that are recorded and those that are not), we discuss the frontier output or outcome. i. The unobserved (maximum potential = frontier) count for an entity during a specified time period is Poisson distributed with mean λ. λ n e λ (1) PN ( = n λ) = n= 0,1, 2,...; λ > 0 n! ii. (2) iii. The distribution of λ is a two-parameter gamma with pdf r r 1 λα αλ e f( λ) = λ > 0; α, r > 0 Γ() r With probability p, a count item is observed. That is, a specific count item (the first item, the second item, etc.) may be either produced (so it is observed) or not produced (it is not observed). Thus, the number of observed counts (x) is distributed binomial with pmf n x n x (3) PX ( = xnp, ) = p(1 p) x= 0, 1, 2,..., n; 0 < p< 1 x iv. (4) The distribution of p is beta with pdf 1 g p = p p < p< a b> Bab (, ) a 1 b 1 ( ) (1 ) 0 1;, 0 Points iii. and iv. reflect that fact that this model inherently takes the view that production has a binary dimension. First of all, there exists for each entity a frontier number of (count) items that can be produced. Starting from the first of those items, each one can be either produced (observed) or not. As a result, number iii. is a plausible way to model the binary situation of either producing an item (e.g., a patent, a win, etc.) or not. Number iv. captures the heterogeneity in production success (efficiency) across entities. 3

5 As is standard, combining assumptions (i) and (ii) gives us a negative binomial distribution (NBD). (5) r Γ ( r+ n) α 1 pn ( = n) = n= 0,1, 2,...; r, α > 0 Γ () r n! α + 1 α + 1 n For future reference, recall the usual result for the NBD that r (6) EN [ ] = α In this case, this NBD describes the distribution of frontier counts, which are the sum of observed and unobserved counts. The latter unobserved counts are the result of a process that is attempting to maximize a count variable but falls short by the amount of the unobserved count value. In other words, that unobserved count value reflects the inefficiency of the production process. Similarly, assumptions (iii) and (iv) give us a beta-binomial (BB) model for the distribution of the observed counts given the unobserved counts. (7) n B( α + x, β + n x) PX ( = xn ) = x= 0, 1, 2,..., nn ; = 0, 1, 2,..., x B( αβ, ) Finally, Schmittlein at al., (1985) derive the marginal distribution of observed counts (8) Γ ( r+ x) α 1 Γ ( a+ x) Γ ( a+ b) PX ( = x) = Γ () r x! α + 1 α + 1 Γ() a Γ ( a+ b+ x) 1 F r+ xba,, + b+ x, x= 0,1, 2,..., nabr ;,,, α > 0 α r x where 2 F 1 () is the Gauss hypergeometric function. 4

6 Schmittlein at al., (1985) recognize that this a beta-binomial/negative binomial distribution (BB/NBD) distribution. The mean of this distribution is ra (9) EX [ ] = α( a+ b) Finally, it can be shown (Fader and Hardie, 2000) that the distribution of unobserved (frontier) counts, conditional on the observed counts, is given by (10) Γ ( r+ n) 1 Γ ( a+ b+ x) Γ ( b+ n x) PN ( = nx = x) = Γ ( r+ x)( n x)! α + 1 Γ ( a+ b+ n) Γ( b) 1 1 2F1 r+ x, b, a+ b+ x, α + 1 n= 0,1, 2,..., ; x= 0, 1, 2,..., nabr ;,,, α > 0 n x The expected value of frontier counts, conditional on the observed counts, is given by (11) r+ x B( a+ x, b+ 1) EN [ X= x) = x+ α + 1 Ba ( + xb, ) 1 F r+ x+ 1, b+ 1, a+ b+ x+ 1, α F1 r+ x, b, a+ b+ x, α A model of maximizing behavior should possess the characteristic that the observed outcomes r can never be greater than the frontier outcomes. Recall from (6) and (9) that EN [ ] = and α ra a EX [ ] =. Thus, EX [ ] = EN [ ]. Since a > 0 and b > 0, it is clear that, on average, α( a+ b) a + b the observed outcomes are always less than the frontier outcomes. Furthermore, it can easily be shown by repeatedly evaluating (10) for values of x > n, that 5

7 P(N = n X=x) = 0.00 for all values of x > n. Based on these two pieces of evidence, it appears that this model possesses the required characteristic that the observed outcomes can never be greater than the frontier outcomes. III. Estimating the Count Data Frontier Model The parameters of this model can be estimated by maximum likelihood. Let us assume that we have data on the counts x i, i = 1, 2,..., I, where x i is the number of observed counts for entity i. Assuming that the observations are independent, the likelihood is the product of the probabilities P(X = x) over all observations and the log-likelihood is given by: (12) * x ln L( abr,,, α X) = ln PX ( = xabr,,, α) where x* = max{x 1, x 2,..., x n }. x= 0 See Fader and Hardie (2000) for more. IV. Empirical Illustration An electronics firm in the South asks job applicants for its assembly operation to take a test as part of the application process. Applicants are given some written instructions about how to assemble a certain item and then taken into a test room where they are faced with a large number of those items that are unassembled. They are told that they have a specified amount of time to assemble as many items as they can. Their performance will be assessed in two ways: (i) how many items they assemble and (ii) how well they complete each assembly. They are told that the more items they correctly assemble, the better will be their chance of getting a job offer. This phase of the application process is designed to test cognitive ability, dexterity, and the applicant's ability to handle pressure. 6

8 We have the counts of how many items were assembled by each of 80 randomly-selected applicants. The sample mean is 6.63, the sample mode is five, and the sample variance is The values of assembled items ranges from zero (two occurrences) up to 17 (one person.) Table 1 contains the estimation results. Table 1. MLE Results for Item Assembly Count Data Parameter Estimate Standard Error Significance a <.01 b <.01 r <.01 α <.01 log likelihood One immediate use to which these estimates can be put is to calculate several samplewide mean values. These are the mean frontier count, the mean shortfall of observed counts below the estimated frontier count, and the mean percentage inefficiency. First of all, the mean frontier count is obtained by evaluating (6) using the estimates. This gives a value of 8.49 items that could have been assembled by each applicant, on average. The actual mean number of items assembled is 6.63, yielding an average shortfall of assembled items equal to Finally, the mean applicant was 21.92% inefficient, which corresponds to the shortfall of 1.86 divided by In other words, that applicant could have assembled 1.86, or nearly 22%, more items than were actually assembled. These means, while informative, likely obscure deeper insights that can be gained by examining the numbers of frontier counts (N) for different observed count (X) values. Table 2 7

9 shows the values of X from 0 to 17 (the largest observed sample value), the value of E[N X = x] corresponding to each observed value, and the percentage inefficiency for each X value. One feature is immediately apparent when looking at this table. Inefficiency greatly varies across the values of number of items actually assembled. 2 The largest percentage inefficiency (other than the obvious 100% when no items are assembled) is 77.4% for those who assembled only one item. The smallest shortfall is 1.1 items at the upper end of the distribution. This is a 6.3% inefficiency rate. These applicants assembled the most items, yet still could have done better. Perhaps not surprisingly, the inefficiency rate declines monotonically as the actual number of items assembled rises. Table 2. Values of X, E[N X=x] and Percentage Inefficiency For Each X Value x E[N X=x] % Inefficiency % inefficiency is calculated from E[ ] values to four decimals (not rounded) 8

10 Even more can be learned by digging even deeper into these results. The values for the frontier counts, given the actual number of items assembled (E[N X = x] ) are, after all, the means of a distribution of potential number of items that each individual could have assembled. Figure 1, showing the conditional distributions for different observed item counts, contains two examples of additional information that can be gleaned from these data. We chose to display the distributions for an observed count of zero items assembled and for five items (the modal number of items assembled.) P[N=n X=0] P[N=n X = 5] Figure 1. Conditional Distributions for Two Observed Counts 9

11 The top panel of Figure 1shows the conditional distribution of potential completed items for those who assembled zero items. This distribution reveals that only a little under 12% of those who failed to complete any items were performing at their capability. That is, this conditional distribution has only approximately 12% of its values equal to zero, the number actually assembled by these applicants. The remaining 88% of those who did not assemble even one item could have assembled at least one. In fact, more than one-quarter of them could have assembled two or three items. Furthermore, approximately 20% of those who assembled no items could have completed seven or more. The information contained in this one conditional distribution shows the extent of the underachievement (inefficiency) exhibited by many of those in this group. Similarly, the conditional distribution for 5 items assembled shows a range of potential performances. About one-third of the applicants who completed five items performed up to their potential. Obviously, then, two-thirds did not. Fully 20% of those who completed this modal number of items (5) could potentially have assembled 9 or more. V. Conclusion This paper proposes one method for estimating the extent of inefficiency for cases in which a count variable is being maximized. We show how this model can estimate a number of values relating to inefficiency in producing counts. First, the researcher can calculate the samplewide mean extent of inefficiency and the mean shortfall of actual counts below frontier (maximum potential) counts. Second, one can determine the extent of inefficiency for every observed value of the count variable being maximized. Beyond that, you can derive and examine the distribution of the number of frontier counts for each value that was actually 10

12 produced. Thus, this model provides a rich and informative set of information about the frontier number of items that can be produced and various aspects of inefficiency. This model omits covariates in favor of representing heterogeneity in production and efficiency through the assumption of specific distributions for frontier counts, observed/produced counts and the probability of producing a count item. However, it seems that introducing covariates could be done in a straightforward manner if it appears that covariates would strengthen this model and add to its ability to inform researchers about efficiency in count variable maximizing processes. 11

13 References Fader, P.S. and B.G.S. Hardie A note on modeling underreported Poisson counts. Journal of Applied Statistics, 27(8), Fe-Rodriguez, E Exploring a stochastic frontier model when the dependent variable is a count. The School of Economics Discussion Paper Series, The University of Manchester. Schmittlein, D.C., A.C. Bemmaor, D.G.Morrison Why does the NBD model work? Robustness in representing product purchases, brand purchases and imperfectly recorded purchases. Marketing Science, 4(3),

Lesson B1 - Probability Distributions.notebook

Lesson B1 - Probability Distributions.notebook Learning Goals: * Define a discrete random variable * Applying a probability distribution of a discrete random variable. * Use tables, graphs, and expressions to represent the distributions. Should you