The BLP Method of Demand Curve Estimation in Industrial Organization

Size: px

Start display at page:

Download "The BLP Method of Demand Curve Estimation in Industrial Organization"

Claude Baker
6 years ago
Views:

1 The BLP Method of Demand Curve Estimation in Industrial Organization 27 February 2006 Eric Rasmusen Abstract This is an exposition of the BLP method of structural estimation. Dan R. and Catherine M. Dalton Professor, Department of Business Economics and Public Policy, Kelley School of Business, Indiana University, BU 456, 1309 E. 10th Street, Bloomington, Indiana, Office: (812) Fax: Copies of this paper can be found at I thank Fei Lu for her comments.

2 I. Introduction I came to know Professor Moriki Hosoe when he led the project of translating the first edition of my book, Games and Information, into Japanese in 1990 and visited him at Kyushu University. He realized, as I did, that a great manybnuseful new tools had been developed for economists, and that explaining the tools to applied modellers would be extraordinarily useful because of the difficulty of reading the original expositions in journal articles. Here, my objective is similar: to try to explain a new technique, but in this case the statistical one of the BLP Method of econometric estimation, named after Berry, Levinsohn & Pakes (1995). I hope it will be a useful contribution to the festschrift in honor of Professor Hosoe. I, alas, do not speak Japanese, so I thank xxx for their translation of this paper. I know from editing the 1997 Public Policy and Economic Analysis with Professor Hosoe how time-consuming it can be to edit a collection of articles, especially with authors of mixed linguistic backgrounds, and so I thank Isao Miura and Tohru Naito for their editing of this volume. The BLP Method is a way to estimate demand curves, a way that lends itself to testing theories of industrial organization. It combines a variety of new econometric techniques of the 1980 s and 90 s. Philosophically, it is in the style of structural modelling, in which empirical work starts with a rigorous theoretical model in which players maximize utility and profit functions, and everything, including the disturbance terms, has an economic interpretation, the style for which McFadden won the Nobel Prize. That is in contrast to the older, reduced-form, approach, in which the economist essentially looks for conditional correlations consistent with his theory. Both approaches remain useful. What we get with the structural approach is the assurance that we do have a self-consistent theory and the ability to test much finer hypotheses about economic behavior. What we lose is simplicity and robustness to specification error. Other people have written explanations of the BLP Method. Nevo (2000) is justly famous for doing this, an interesting case of a commentary on a paper being a leading article in the literature itself. (Recall, though, Bertrand s 1883 comment on Cournot or Hicks s 1937 Mr Keynes and the Classics. ) The present paper will be something of a commentary on a 2

3 commentary, because I will use Nevo s article as my foundation. I will use his notation and equation numbering, and point out typos in his article as I go along. I hope that this paper, starting from the most basic problem of demand estimation, will be useful to those, who like myself, are trying to understand modern structural estimation. II. The Problem Suppose we are trying to estimate a market demand curve. We have available t = 1,...20 months of data on a product, data consisting of the quantity sold, q t, and the price, p t. Our theory is that demand is linear, with this equation: q t (p t ) = α βp t + ε t. (1) Let s start with an industry subject to price controls. A mad dictator sets the price each month, changing it for entirely whimsical reasons. The result will be data that looks like Figure 1a. That data nicely fits our model in equation (1), as Figure 1b shows. 3

4 Figure 1: Supply and Demand with Price Controls Next, suppose we do not have price controls. Instead, we have a situation of normal supply and demand. The problem is that now we might observe data like that in Figure 2a. Quantity rises with price; the demand curve seems to slope the wrong way, if we use ordinary least squares (OLS) as in Figure 2b. Figure 2: Supply and Demand without Price Controls The solution to the paradox is shown in Figure 2c: OLS estimates the supply curve, not the demand curve. This is what Working (1927) pointed out. It could be that the unobservable variables ε t are what are shifting the demand curve in Figure 2. Or, it could be that it is some observable variable that we have left out of our theory. So perhaps we should add income, y t : q t (p t ) = α βp t + γy t + ε t. (2) Note, however, that if the supply curve never shifted, we still wouldn t be able to estimate the effect of price on quantity demanded. We need some observed variable that shifts supply. 4

5 Or, another approach would be to try to use a different kind of data. What if we could get data on individual purchases of product j? Our theory for the individual is different from our theory of the market. The demand curve looks much the same, except now we have a subscript for consumer i. q it (p t ) = α i β i p t + γ i y it + ε it. (3) But what about the supply curve? From the point of view of any one small consumer, the supply curve is flat. He has only a trivial influence on the quantity supplied and the market price, an influence we ignore in theories of perfect competition, where the consumer is a price-taker. Thus, we are in the situation of Figure 3, which is very much like Figure 1. Figure 3: The Individual s Demand Curve There is indeed one important difference: In Figure 3, all we ve done is estimate the demand curve for one consumer. That is enough, if we are willing to simplify our theory to assume that all consumers have identical demand functions: q it (p t ) = α βp t + γy it + ε it. (4) This is not the same as assuming that all consumers have the same quantity demanded, since they will still differ in income y it and unobserved 5

6 influences, ε it, but it does say that the effect of an increase in price on quantity is the same for all consumers. If we are willing to accept that, however, we can estimate the demand curve for one consumer, and we can use our estimate ˆβ for the market demand curve. Or, we can use data on n different consumers, to get more variance in income, and estimate ˆβ that way. But we would not have to use the theory that consumers are identical. There are two alternatives. The first alternative is to use the demand of one or n consumers anyway, arriving at the same estimate ˆβ that we would under the theory (1). The interpretation would be different,though it would be that we have estimated the average value of β, and interpreting the standard errors would be harder, since they would be affected by the amount of heterogeneity in β i as well as in ε it. (The estimates would be unbiased, though, unlike the estimates I criticize in Rasmusen (1989a,b), since p t is not under the control of the consumer.) Or, we could estimate the β i for all n consumers and then average the n estimates to get a market β, as opposed to running one big regression. One situation in which this would clearly be the best approach is if had reason to believe that the n consumers in our sample were not representative of the entire population. In that case, running one regression on all the data would result in a biased estimate, as a simple consequence of starting with a biased sample. Instead, we could run n separate regressions, and then compute a weighted average of the estimates, weighting each type of consumer by how common his type was in the population. Individual consumer data, however, is no panacea. For one thing, it is hard to get especially since it is important to get a representative sample of the population. For another, its freedom from the endogeneity problem is deceptive. Recall that we assumed that each individual s demand had no effect on the market price. That is not literally correct, of course every one of 900,000 buyers of toothbrushes has some positive if trivial effect on market sales. If one of them decides not to buy a toothbrush, sales fall to 899,999. That effect is so small that the one consumer can ignore it, and the econometrician could not possibly estimate it given even a small amount of noise in the data. The problem is that changes in individuals demand are unlikely to be statistically independent of each other. When 6

7 the unobservable variable ɛ t for consumer i = 900, 000 is unusually negative, so too in all likelihood is the unobservable variable ɛ t for consumer i = 899, 999. Thus, they will both reduce their purchases at the same time, which will move the equilibrium to a new point on the supply curve, reducing the market price. Price is endogenous for the purposes of estimation, even though it is exogenous from the point of view of any one consumer. So we are left with a big problem identification for demand estimation. The analyst needs to use instrumental variables, finding some variable that is correlated with the price but not with anything else in the demand equation, or else he must find a situation like our initial price control example where prices are exogenous. But in fact even price controls might not lead to exogenous prices. A mad dictator is much more satisfactory, at least if he is truly mad. Suppose we have a political process setting the price controls, either a democratic one or a sane dictator who is making decisions with an eye to everything in the economy and public opinion too. When is politics going to result in a higher regulated price? Probably when demand is stronger and quantity is greater. If the supply curve would naturally slope up, both buyers and sellers will complain more if the demand curve shifts out and the regulated price does not change. Thus, even the regulated price will have a supply curve. All of these problems arise in any method used to estimate a demand curve, whether it be the reduced-form methods just described or the BLP structural method. One thing about structural methods is that they force us to think more about the econometric problems. Your structural model will say what the demand disturbance is unobserved variables that influence demand. If you must build a maximizing model of where regulated prices come from, you will realize that they might depend on those unobserved variables. What does all this have to do with industrial organization? Isn t it just price theory, or even just consumer theory? Where are the firms? The reason this is so important in industrial organization is that price theory underlies it. Production starts because somebody demands something. An entrepreneur then discovers supply. That entrepreneur needs to organize 7

8 the supply, and so we have the firm. Other entrepreneurs compete, and we have an industry. How consumers react to price changes is fundamental to this. Natural extensions to this problem bring in most of how firms behave. Demand for a product depends on the prices of all firms in the industry, and so we bring in the theory of oligopoly. Demand depends on product characteristics, and so we have monopolistic competition and location theory. Demand depends on consumer information, and so we have search theory, advertising, adverse selection, and moral hazard. As we have seen, estimating demand inevitably brings in supply. Or, if you like, you could think of starting with the problem of estimating the supply curve in a perfectly competitive industry, a problem that can be approached in the same way as we approach demand here. III. The Structural Approach Let us now start again, but with a structural approach. We will not begin with a demand curve this time. Instead, we will start with consumer utility functions. The standard approach in microeconomic theory is to start with the primitives of utility functions (or even preference orderings) and production functions and then see how maximizing choices of consumers and firms result in observed behavior. Or, in game theory terms, we begin with players, actions, information, and payoffs, and see what equilibrium strategies result from players choosing actions to try to maximize their payoffs given their information. Suppose we are trying to estimate a demand elasticity how quantity demanded responds to price. We have observations from 20 months of cereal market data, the same 50 products of cereal for each month, which makes a total of 1,000 data points. We also have data on 6 characteristics of each cereal product and we have demographic data on how 4 consumer characteristics are distributed in the population in each month. Each type of consumer decides which product of cereal to buy, buying either one or zero units. We do not observe individual decisions, but we will model them so we can aggregate them to obtain the cereal product market shares that we do observe. Let there be I = 400 consumers. It may seem strange to specify a number of consumers, since we do not observe their individual behavior, but it will allow us to build our theoretical model up 8

9 from the fundamentals of individual consumer choice rather than starting with aggregate consumption in each quarter. The number will be important in the estimation itself, when we will randomly sample consumers using our knowledge of the distribution of types of consumers in a given month, and our sample will be closer to the population the larger is I. The fact that the frequency of different consumer types changes over time is the variance in the data that allows us to estimate how each of the 4 consumer characteristics affects demand. At this point, we could decide to estimate the elasticity of demand for each product and all the cross-elasticities directly, but with 50 products that would require estimating 2,500 numbers. Of course, theory tells us that many of the cross-partials equal each other, but the number of parameters would be too large to estimate even if we dropped half of them, particularly since we would not only need a large number of observations, but enough variability in the prices to see how each product s demand changes when the price of another product changes. Instead, we will focus on the product characteristics. There are only 6 of these, and there would only be 6 even if there were 500 products of cereal instead of 50. In effect, we will be estimating cross-elasticities between cereal characteristics, but if know know those numbers, we can build up to the cross-elasticities between products using the characteristic levels of each product. The Consumer Decision The utility of consumer i if he were to buy product j in month t is given by the following equation, denoted equation (1N) because it is equation (1) in Nevo (2001): u ijt = α i (y i p jt ) + x jt β i + ξ jt + ɛ ijt i = 1,..., 400, j = 1,..., 50, t = 1,..., 20, (1N) where y i is the income of consumer i (which is unobserved and which we will assume does not vary across time), p jt is the observed price of product j in month t, x jt is a 6-dimensional vector of observed characteristics of product j in month t, ξ jt (the letter xi ) is a disturbance scalar summarizing unobserved characteristics of product j in month t, and ɛ ijt is the usual unobserved disturbance with mean zero. The parameters to be estimated 9

10 are consumer i s marginal utility of income, α i, and his marginal utility of product characteristics, the 6-vector β i. I have boldfaced the symbols for vectors and matrices above, and will do so throughout the article. Consumer i also has the choice to not buy any product at all. We will model this outside option as buying product 0 and normalize by setting the j = 0 parameters equal to zero (or, if you like, by assuming that it has a zero price and zero values of the characteristics): u i0t α i y i + ɛ i0t (5) Equation (1N) is an indirect utility function, depending on income y i and price p jt as well as on the real variables x jt, ξ jt, and ɛ ijt. It is easily derived from a quasilinear utility function, however, in which the consumer s utility is the utility from his consumption of one (or zero) of the 50 products, plus utility which is linear in his consumption of the outside good. Quasilinear utility is not concave in income, so it lacks income effects, but if those are important, they can be modelled by indirect utility that is a function not of (y i p jt ) but of some concave function such as log(y i p jt ), as in BLP (1995). A consumer s utility depends both on a product s characteristics (x jt ) and directly on the product (ξ jt ), but in the characteristics approach we are using, we implicitly assume that the direct marginal utilities are the result of unobserved characteristics that are uncorrelated across products. Consumer characteristics do not appear in the utility function. They play a role later in the model, in determining β i, the marginal utility of product characteristics. Consumer i s income does appear, but it does not change over time, which may seem strange. That is because we are not going to follow consumer i over time as his income changes; he is a type of consumer, and what will change over time is the number of consumers with a given income. We will assume that ɛ ijt follows the Type I extreme-value distribution, which if it has mean zero and scale parameter one has the density and cumulative distribution f(x) = e x e e x F (x) = e e x. (6) 10

11 This is the limiting distribution of the maximum value of a series of draws of independent identically distributed random variables. Figure 4 illustrates the density, which is not dissimilar to the normal distribution, except that it is slightly asymmetric and has thicker tails. This distribution is standardly used in logit models because its cumulative distribution is related to the probability of x being larger than any other of a number of draws, which is like the random utility from one choice being higher than that from a number of other choices. This leads to a convenient formula for the probability that a consumer makes a particular choice, and thus for a product s market share, as we will see below. Figure 4: The Type I Extreme-Value Distribution (from the Engineering Statistics Handbook) Consumer i buys product j in month t if it yields him the highest utility of any product. What we observe, though, is not consumer i s decision, but the market share of product j. Also, though we do not observe β i, consumer i s marginal utility of product characteristics, we do observe a sample of observable characteristics. Even if we did observe his decision, we would still have to choose between regular logit and BLP s random-coefficients logit, depending on whether we assumed that every 11

12 consumer had the same marginal utility of characteristics β or whether β i depended on consumer characteristics. Simple Logit One way to proceed would be to assume that all consumers are identical in their taste parameters; i.e., that α i = α and β i = β, and that the ɛ ijt disturbances are uncorrelated across i s. Then we have the simple multinomial logit model (multinomial because there are multiple choices, not just two). The utility function reduces to the following form, which is (1N) except that the parameters are no longer i-specific. u ijt = α(y i p jt ) + x jt β + ξ jt + ɛ ijt i = 1,..., 400, j = 1,..., 50, t = 1,..., 20. (5N) Now the coefficients are the same for all consumers, but incomes differ. We can aggregate by adding up the incomes, however, since the coefficient on each consumer has the same value, α. Thus we obtain an aggregate utility function, u jt = α(y p jt ) + x jt β + ξ jt + ɛ jt, j = 1,..., 50, t = 1,..., 20. (7) If we assume that ɛ jt follows the Type I extreme-value distribution, then this is the multinomial logit model. Since ɛ jt follows the Type I extreme value distribution by assumption, it turns out that the market share of product j under our utility function is s jt = e x jtβ αp jt +ξ jt k=1 ex ktβ αp kt +ξ kt (6N) Equation (6N) is by no means obvious. The market share of product j is the probability that j has the highest utility, which occurs if ɛ jt is high enough relative to the other disturbances. The probability that product 1 has a higher utility than the other 49 products and the outside good (which 12

13 has a utility normalized to zero) is thus P rob(α(y p 1t ) + x 1t β + ξ 1t + ɛ 1t > α(y p 2t ) + x 2t β + ξ 2t + ɛ 2t ) P rob(α(y p 1t ) + x 1t β + ξ 1t + ɛ 1t > α(y p 3t ) + x 3t β + ξ 3t + ɛ 3t ) P rob(α(y p 1t ) + x 1t β + ξ 1t + ɛ 1t > α(y p 50t ) + x 50t β + ξ 50t + ɛ 50t ) P rob(α(y p 1t ) + x 1t β + ξ 1t + ɛ 1t > αy + ɛ 0t ) (8) Substituting for the Type I extreme value distribution into equation (8) and solving this out yields, after much algebra, equation (6N). Since αy appears on both sides of each inequality, it drops out. The 1 in equation (6N) is there because of the outside good, with its 0 utility, since e 0 = 1. Elasticities of Demand To find the elasticity of demand, we need to calculate s jt p kt for products k = 1,..., 50. It is helpful to rewrite equation (6N) by defining M j as M j e x jtβ αp jt +ξ jt. (9) so Then s jt p kt = s jt = M j p kt k=1 M k M j k=1 M. (10) k ( ) ( Mk ) M j + ( k=1 M k) 2 p kt (11) First, suppose k j. Then s jt p kt = k=1 M k ( ) M j + ( k=1 M ( αm k ) k) 2 ( ) ( ) M j = α 1 + M k 50 k=1 M k k=1 M k (12) = αs jt s kt 13

14 Second, suppose k = j. Then ( ) s jt αm j p jt = 1 + M j 50 k=1 M + k ( k=1 M ( αm j ) k) 2 = αs jt + αs 2 jt (13) = αs jt (1 s jt ) We can now calculate the elasticity of the market share: the percentage change in the market share of product j when the price of product k goes up: η jkt % s jt % p kt = s jt p kt pkt s jt = { αpjt (1 s jt ) if j = k αp kt s kt otherwise. (14) Problems with Multinomial Logit The theoretical structure of the elasticities in equation (14) is unrealistic in two ways. 1. If market shares are small, as is frequently the case, then α(1 s jt ) is close to α, so that own-price elasticities are close to αp jt. This says that if the price is lower, demand is less elastic, less responsive to price, which in turn implies that the seller will charge a higher markup on products with low marginal cost. There is no particular reason why we want to assume this, and in reality we often see that markups are higher on products with higher marginal cost, e.g. luxury cars compared to cheap cars. 2. The cross-price elasticity of product j with respect to the price of product k is αp kt s kt, which only depends on features of product k its price and market share. If product k raises its price, it loses customers equally to each other product. 1 This is a standard defect of multinomial logit, which McFadden s famous red-bus/blue-bus example illustrates. If you can choose among going to work by riding the red bus, the blue bus, or 1 An apparent third feature, that a high price of product k means it has higher crosselasticities with every other product, is actually the same problem as problem (1). 14

15 a bicycle, and the price of riding the red bus rises, are you equally likely to shift to the blue bus and to riding a bicycle? Of course not, but the multinomial logit model says that you will. Another way to proceed is to use nested logit. In the red-bus/blue-bus example, you would first decide whether the blue bus or the red bus had the highest utility, and then decide whether the best bus s utility was greater than the bicycle s. In such a model, if the price of the red bus rose, you might switch to the blue bus, but you would not switch to the bicycle. A problem with nested logit, however, is that you need to use prior information to decide how to construct the nesting. In the case of automobiles, we might want to make the first choice to be between a large car and small car, but it is not clear that this makes more sense than making the first choice to be between a high-quality car and a low-quality car, especially if we are forcing all consumers to use the same nesting. The Random Coefficients Logit Model An alternative to simple logit or nested logit is to assume that the parameters the marginal utilities of the product characteristics are different across consumers, and are determined by the consumer characteristics. Random coefficients is the name used for this approach, though it is a somewhat misleading name. The approach does not say that the consumer behaves randomly. Rather, each consumer has fixed coefficients in his utility function, but these coefficients are a function both of fixed parameters that multiply his observed characteristics and on unobservable characteristics that might as well be random. Thus, random coefficients really means individual coefficients. We will denote the average values of the parameters α i and β i across consumers as α and β, and assume the following specification: ( αi β i ) = = ( α β ( α β ) + ΠD i + Σν i ) + ( Πα Π β ) D i + ( ) Σα ( ) ν Σ iα, ν iβ β (2N) where D i is a 4 1 vector of consumer i s observable characteristics, ν i is a 15

16 7 1 vector of the effect of consumer i s unobservable characteristics on his α i and β i parameters; Π is a 7 4 matrix of how parameters (the α i and the 6 elements of β i ) depend on consumer observables, Σ is a 7 7 matrix of how those 7 parameters depend on the unobservables, and (ν iα, ν iβ ), (Π α, Π β ) and (Σ α, Σ β ) just split each vector or matrix into two parts. We will denote the distributions of D and ν by P D (D) and P ν(ν). Since we ll be estimating the distribution of the consumer characteristics D, you will see the notation ˆP D (D) show up too. We will assume that P ν(ν) is multivariate normal. 2 Utility in the Random Coefficients Logit Model Equation (1N) becomes 3 u ijt = α i (y i p jt ) + x jt β i + ξ jt + ɛ ijt = α i y i (α + Π α D i + Σ α ν iα )p jt + x jt (β + Π β D i + Σ β ν iβ ) + ξ jt + ɛ ijt ) = α i y i + ( αp jt + x jt β + ξ jt ) + (Π α D i + Σ α ν iα )p jt +x jt (Π β D i + Σ β ν iβ + ɛ ijt = α i y i + ( αp jt + x jt β + ξ jt ) + ( p jt, x jt ) (ΠD i + Σν i ) + ɛ ijt = α i y i + δ jt + µ ijt + ɛ ijt j = 1,..., 50, t = 1,..., 20. (15) What I have done above is to reorganize the terms to separate them into four parts. First, there is the utility from income, α i y i. This plays no part in the consumer s choice, so it will drop out. Second, there is the mean utility, δ jt, which is the component of utility from a consumer s choice of product j that is the same across all 2 Nevo variously uses P D (D), ˆP D (D), P ν(ν), and ˆP ν(ν) in his exposition. I have tried to follow him, but I may simply have misunderstood what he is doing. 3 There is a typographical error in the Nevo paper on p. 520 in equation (3N): u instead of µ in each line. 16

17 consumers. δ jt αp jt + x jt β + ξ jt (16) Third and fourth, there is a heteroskedastic disturbance, µ ijt, and a homoskedastic i.i.d. disturbance, ɛ ijt. µ ijt ( p jt, x jt ) (ΠD i + Σν i ) (17) Market Shares and Elasticities in the BLP Model If we use the Type I extreme value distribution for ɛ ijt, then the market share of product j for a consumer of type i turns out to be s ijt = e δ jt+µ ijt k=1 eδ kt+µ ikt. (18) Recall that we denote the distributions of D and ν by P D (D) and Pν(ν). Since we will be estimating the distribution of the consumer characteristics D, you will see the notation ˆP D (D) show up too. The overall market share of product j in month t is found by integrating the market shares picked by each consumer in equation (18) across the individual types, weighting each type by its probability in the population: s jt = s ijt dˆp D(D)dP ν(ν) ν D = ν D [ e δ jt+µ ijt k=1 eδ kt+µ ikt ] dˆp D(D)dP ν(ν) Equation (19) adds up the market shares of different types i of consumers based on how common that type i is in month t. The price elasticity of the market share of product j with respect to (19) 17

18 the price of product k is η jkt s jt p kt pkt s jt = p jt s jt ν p kt s kt ν D D α i s ijt (1 s ijt )dˆp D(D)dP ν(ν) α i s ijt s ikt dˆp D(D)dP ν(ν) if j = k otherwise. (20) This is harder to estimate than the ordinary logit model, whose analog of equation (19) is equation (6N), repeated below. s jt = e x jtβ αp jt +ξ jt k=1 ex ktβ αp kt +ξ kt (6N) The difficulty comes from the integrals in (19). Usually these need to be calculated by simulation, starting with our real-world knowledge of the distribution of consumer types i in a given month t and the characteristics of product j, and combining that with estimates of how different consumer types value different product characteristics. This suggests that we might begin with an initial set of parameter estimates, calculate what market shares that generates for each month, see how those match the observed market shares, and then pick a new set of parameter estimates to try to get a closer match. It is worth noting at this point that what is special about random-coefficients logit is not that it allows for interactions between product and consumer characteristics, but that it does so in a structural model. One non-structural approach would have been to use ordinary least squares to estimate the following equation, including product dummies to account for the ξ jt product fixed effects. s jt = x jt β αp jt + ξ jt (21) Like simple logit, the method of equation (21) implies that the market share depends on the product characteristics and prices but not on any interaction between those things and consumer characteristics. We can 18

19 incorporate consumer characteristics by creating new variables in a vector d t that represents the mean value of the 4 consumer characteristics in month t, and then interacting the 4 consumer variables in d t with the 6 product variables in x t to create a 1 24 variable w t. Then we could use least squares with product dummies to estimate s jt = x jt β αp jt + d t θ 1 + w t θ w + ξ jt (22) Equation (22) adds market shares directly from the 4 consumer characteristics and less directly via the 24 consumer-product characteristic interactions. Thus, it has some of the flexibility of the BLP model. It is not the result of a consistent theoretical model. Even aside from whether rational consumer behavior could result in a reduced form like (21) or (22), those equations do not take account of the relationships between the market shares of the different products the sum of all the predicted market shares should not add up to more than one, for example. The random-coefficient logit model has the advantage of consistency, though it is more difficult to perform the estimation. Before going on to estimate the random coefficients model, however, recall that regardless of how we specify the utility function logit, nested logit, random-coefficients logit, or something else we face the basic simultaneity problem of estimating demand and supply functions. Market shares depend on prices and disturbance terms, but prices will also depend on the disturbance terms. If demand is unusually strong, prices will be higher too. This calls for some kind of instrumental variables estimation, and the appropriate kind here is the generalized method of moments. The Method of Moments The generalized method of moments (GMM) of Hansen (1982) combines instrumental variables, generalized least squares, and a nonlinear specification with the added complexity of what to do when there are so many instruments that the equation is overidentified. It helps to separate these things out. We will go one step at a time, before circling back to the BLP method s use of GMM. What we are trying to do in econometrics is to estimate the parameters β of some theoretical model y = f(x; β) + ɛ given the values of 19

20 y and x that we observe. The two estimation approaches most used in economics are least squares and maximum likelihood. The principle of least squares is to find an equation that fits the data in the sense of minimizing the distance between the estimated equation and the actual data. OLS uses the sum of squared errors as its measure of distance, but if we used the sum of absolute deviations we would be estimating in the same spirit. The principle of maximum likelihood is to assume that the disturbance follows a particular distribution and to choose the equation parameters to maximize the likelihood that we observe the data we actually do see. The method of moments is closer to least squares in spirit. It does not make assumptions on the distribution function shape of the disturbance (though it might make assumptions on whether variables are correlated). What it does is start with how the disturbance term relates to the exogenous variables. The name method of moments is misleading. The first moment of a distribution is the average and the second moment is the variance. The typical method of moments in economics uses neither. It use a covariance condition instead. A better name would be method of analogy, because the method of moment works by trying to match sums of observed values to theoretical equations that the modeller specifies. 4 Jeffrey Wooldridge has an excellent example of how this works in his 2001 article in The Journal of Economic Perspectives. Suppose you were trying to estimate the mean of a population, µ. The method of moments would be to use the sample analog, the sample mean ˆµ 1 = x. But suppose you had additional information: that the population variance is three times the population mean, so σ 2 = 3µ. There would be an alternative way to use the method of moments, based on the sample variance s 2 : by making your estimate ˆµ 2 = s 2 /3. But these two estimates would be different, because each is computed differently from the sample data: ˆµ 1 ˆµ 2. Which should you use? Both estimates of µ are consistent that is, in large samples they will give reliable answers. The generalized method of moments, however, gives a way to combine both estimates, a way to produce a superior estimator as a weighted average of ˆµ 1 and ˆµ 2. The optimal weight depends 4 Even then, it is hard to differentiate this from the least squares approach. The obvious theoretical moment condition to use is Ey = Xβ or Ey Xβ = 0, the sample analog to which is y Xˆβ = 0, minimizing which is exactly the basic argument for OLS. So perhaps we can t make sense of the name. 20

21 on the variance of each estimate; the more variable single estimator should be weighted less heavily. This is rather like the way in which generalized least squares weights individual observations to generate an estimator that is more efficient than ordinary least squares, even though OLS is consistent thus, we have the generalized method of moments. We get the extra efficiency because GMM uses the extra information that σ 2 = 3µ. If we had still more information, we would want to include that too. In the application to BLP, the extra information will not be a priori information linking the mean and variance, but the extra instruments over the minimum needed to do instrumental variables estimation the overidentifying restrictions. Let us set up some notation to go over this in detail. Suppose we have T observations and we are trying to explain the observations on y, a T 1 vector, in terms of M observed variables and one unobserved variable. The observed explanatory variables are x 1,..., x M, where each x m is a T 1 vector and we can line them all up as X, a T M matrix. The unobserved variable is the disturbance term ɛ, which is a T 1 vector. We need to assume something about the functional form and its parameters. Let s assume linearity, so y = Xβ + ɛ, (23) where β is an M 1 vector of parameters.the expression Xβ is T M M 1 so it comes out to be T 1, the same as y. Most commonly (and practically by definition) we assume that the observed and unobserved variables are independent, which implies (but is not equivalent to) Ex mɛ = 0. m = 1,..., M, (24) The sample analog to this is X ˆɛ = 0, (25) where 0 is an M 1 vector of zeroes. That gives us M equations (one for each x m ) for m unknowns (the M parameters β m ). The value of our estimate of the disturbance, ˆɛ, is ˆɛ y Xˆβ, (26) 21

22 where the M 1 vector ˆβ contains our M estimated parameters. Substituting into (26), we get X (y Xˆβ) = 0, (27) so and X y = X Xˆβ (28) ˆβ = (X X) 1 X y (29) Thus, we get the familiar OLS estimator, but from the sample-theory analog principle rather than from minimizing squared errors. The properties are the same as the OLS estimator, since the estimator is identical, so it is the best linear unbiased estimator (BLUE) and so forth. It should not be too surprising that we can get the OLS estimator using different approaches. Recall that the OLS estimator is also the maximum likelihood estimator if the disturbances are normally distributed. We solved for ˆβ analytically here, but that is not an essential part of the method of moments fortunately not, because sometimes the moment condition (which was (26) here) has no exact solution. Something we could always do is to to turn the problem (using [27]) into Minimize ˆβ X (y Xˆβ) (30) We could, that is, try searching for the M values in ˆβ that minimize X ˆɛ. A computer could do that for us. We would give the computer an arbitrary starting value, β ˆ 1 and calculate the size of X ˆɛ 1 using ˆβ 1. We would also give the computer a rule for searching for a second value, β ˆ 2, and a third value, telling the computer to stop searching when X ˆɛ i is small enough. You could construct your own program using a computer language such as Basic or C, or you could use a minimization command from a computer language such as Matlab or Gauss. You would immediately run into a profound problem, however. What does size mean? X ˆɛ is an M 1 vector, so it contains M numbers to be minimized, not just 1. What if one set of ˆβ parameters yields 22

23 X ˆɛ = (0, 0, 0, 50) and another yields X ˆɛ = (20, 20, 20, 20)? Which has minimized X ˆɛ? We need a definition of the size of a vector. One definition you could use is to add up all the elements of the vector, in which case the two vectors would have sizes 50 and 80. Another is that you could add up the squares of the elements, in which case the sizes are 2,500 and 1,600, the opposite ranking. Here, we avoided the problem of defining size because we could find an analytic perfect solution to the minimization problem. Any sensible definition of size would say that (0, 0, 0, 0, 0) is the smallest 5 1 vector possible. In other method of moments situations, you will have to confront the problem directly. Before we do that, however, let s look at two other layers of complexity that we can put into the method of moments: making the disturbances not be i.i.d. and requiring instrumental variables. We will deal with these separately before first returning to the vector size problem and then combining everything at the end. Generalized Least Squares and the Method of Moments As I said above, when the method of moments approach gives us the same estimator as the least squares approach, it gives us the same properties, good and bad. In particular, if the disturbances are correlated across observations (serial correlation) or different observations have disturbances with different probability distributions (heteroskedasticity), OLS is still consistent, but it is inefficient and its standard errors are biased estimates of the standard deviations of the disturbances, so hypothesis testing is unreliable. There is nothing in the logic of least squares that tells us how to get to the GLS estimator which takes care of these problems. But the GLS estimator is, indeed, intuitive, both for serial correlation and for heteroskedasticity: it puts less weight on less informative data. If the disturbances in observations 1, 2, and 3 are highly correlated, then our estimator ought to weight those observations less, because they don t contain as much information as three observations with independent disturbances. If you re trying to estimate the temperature precisely, and 23

24 you measure it with 100 thermometers that were manufactured to be identical, so they all have the same measurement error, you don t get as precise an estimate as if you used 100 thermometers with independent errors. Similarly, if the disturbances in observations 1 to 40 are distributed with a variance of 10 and the disturbances in observations 41 to 80 are distributed with a variance of 500, then our estimator ought to weight observations 1 to 40 more heavily. They contain less noise. But these intuitions are not the intuitions of least squares, nor of the method of moments. They are statistical intuitions, justified by either a frequentist or bayesian approach. We can nonetheless tack them on top of the least squares or method of moments estimator. In the method of moments, we can alter our moment condition by inserting the inverse of the variance-covariance matrix, Φ 1, a T T matrix since it shows the covariance between any two of the T disturbances ɛ t. The theoretical moment condition becomes EX Φ 1 ɛ = 0, (31) The sample analog is or which solves to X ˆΦ 1ˆɛ = 0, (32) X ˆΦ 1 (y Xˆβ) = 0, (33) ˆβ = (X ˆΦ 1 X) 1 ˆΦ 1 y. (34) This estimator is identical to the GLS estimator used in the least-squares approach. I didn t say how to calculate the variance-covariance estimate ˆΦ 1, but all we need is a consistent estimator for it, and we could do that as in GLS by iterating between calculating ˆβ to get estimates for ˆɛ and using those estimates to calculate ˆΦ. Instrumental Variables and the Method of Moments Now let s put aside the issues concerning GLS and the variance-covariance estimates so we can look at a different problem: 24

25 endogeneity of the explanatory variables. What this means is that EX ɛ does not equal zero even in our theoretical model. Some of the x s are caused by the y, or something else jointly causes them. The solution to this problem is instrumental variables. We need to find some instruments, variables that (a) do not cause y, (b) are correlated with x, and (c) are uncorrelated with ɛ. For each x that is endogenous, we need to find at least one instrument. Requirement (a) says that our theory must be able to rule out the instruments from being part of the original equation we are estimating, i.e., we can t use one x m to instrument for another x m. The other two requirements can be interpreted as requiring us to set up a second theoretical equation, one in which x m is explained by the instruments, among other things, though it is not a problem if our second equation omits some relevant variables, since we won t care if the parameters we estimate in it are biased. Let s define a new matrix Z consisting of T observations on all of the x m variables which are exogenous plus at least one instrument for each x m that is endogenous. Thus, Z will be T N, where N M. For the moment, let s assume that N = M, which means we have one instrument for each endogenous x m and the system is exactly identified, not overidentified. Our system then consists of y = Xβ + ɛ X = ZΓ + ν, (35) where Γ is an N M matrix of coefficients so that we have M equations for how the N exogenous variables affect the M explanatory variables in the first equation. Our theoretical moment equation is different now. We are not assuming that the X are all independent of ɛ. Instead, we are assuming that the variables in Z are all independent of ɛ. Thus, the theoretical equation is The sample analog is where 0 is an M 1 vector of zeroes, or EZ ɛ = 0. (36) Z ˆɛ = 0, (37) Z (y Xˆβ) = 0. (38) 25

26 As before, we can solve this. The first step is and the second step is Z y = Z Xˆβ) (39) ˆβ = (Z X) 1 Z y. (40) Thus, we get the standard IV estimator, using the logic of the method of moments. Overidentification, Defining Vector Size, and the Method of Moments I purposely wrote out both algebra steps in the last two equations, because a problem comes up in going between them. We assumed that N = M. What if N > M? Well, then the second step won t work. It requires us to invert Z X, which is an N T T M matrix an N M matrix. We can invert that if it is a square M M, but not otherwise, because N M. A solution in the least squares approach is to use two-stage least squares. In the first stage, regress bfz on X to calculate the fitted values Then our estimator can be ˆX ZˆΓ (41) ˆβ = ( ˆX X) 1 ˆX y, (42) which is fine since ˆX is a T M matrix like X, even if Z is T N. GMM takes a different approach to get to the same answer. The reader will recall our earlier discussion of how GMM can still apply even if there is no analytic solution to the equations, because we can still minimize the sample analog moment condition if we define vector size. So that is what we will now do. Let s define the size of a T 1 vector w as w w the sum of squared elements of the vector. Then the size of the M 1 vector X ɛ will be the 1 T T M M T T 1 matrix (a scalar, actually) ɛ XX ɛ, and we will choose ˆβ so ˆβ = argmin β ˆɛ XX ˆɛ, (43) 26

27 Recall that earlier I said that any sensible size definition would yield the OLS estimator in the simple case or the IV estimator for an exactly identified system where N = M since the minimand equals zero at the solution then. We can verify that here, and find the OLS estimator analytically a different way: by using calculus to maximize the function f(ˆβ) = ˆɛ XX ˆɛ = (y Xˆβ) XX (y Xˆβ) (44) = y XX y ˆβ X XX y y XX Xˆβ + ˆβ X XX Xˆβ We can differentiate this with respect to ˆβ to get the first order condition f (ˆβ) = X XX y y XX X + 2ˆβ X XX X = 0 = 2X XX y + 2ˆβ X XX X = 0 (45) in which case and we are back to OLS. = 2X X( X y + ˆβ X X) = 0 ˆβ = (X X) 1 X y (46) If we want to use GLS, we can weight our size by the inverse of the variance-covariance matrix, ˆΦ 1, like this: ˆβ = argmin β ˆɛ XˆΦ 1 X ˆɛ, (47) If we want to use instrumental variables, then our GMM estimator will also incorporate a weighting matrix. This is analogous to two-stage least squares (2SLS), where we can regress the variables in X on a greater number of variables in Z. Now the weighting matrix the size definition will start to matter. We will make one final change to it. We will use (Z Z) 1 σ 2. To understand what this means, imagine that the different z n vectors are uncorrelated with each other and that N = 3. Then (Z Z) 1 = 1 z 1 z z 2 z z 3 z 3 27 σ 2 (48)

28 The σ 2 makes no difference, since it applies equally to all the equations. But if z m varies a lot, it is going to get less weight. If we want to use GLS and instrumental variables, our GMM estimator becomes ˆβ = argmin β ˆɛ ZˆΦ 1 Z ˆɛ, (49) We cannot solve this analytically if N > M, so we would use a computer to search for the best solution numerically. Thus, if our assumption on the population is that 5 then the GMM estimator is EZ ω(θ ) = 0, m = 1,..., M, (8N) ˆθ = argmin θ ω(θ) ZΦ 1 Z ω(θ), (9N) where Φ is a consistent estimator of EZ ɛɛ Z. We minimize the square of the sample analog because we want to minimize the magnitude, rather than generate large negative numbers by our choice, and squares are easier to deal with than absolute values (the same choice in OLS versus minimizing absolute values of errors). We weight by Φ because we want to make heavier use of observations that contain more independent information. This includes the serial correlation and heteroskedasticity corrections. The method of moments, like ordinary least squares but unlike maximum likelihood, does not require us to know the distribution of the disturbances. In this context, though, we will still have to use the assumption that the ɛ ijt follow the extreme value distribution, because we need it to calculate the integrals of market shares aggregated across consumer types. Note that GMM does not rely on linearity. We may not be able to find analytic solutions if the theoretical equation is nonlinear, but we can still minimize the discrepancy between the sample moment condition and the (8N). 5 I think there is a typo in Nevo here, on page 531, and z m should replace Z m in equation 28

29 theoretical. (The least squares approach can handle nonlinearity too, actually. You could specify a nonlinear functional form, and minimize the sum of squared errors in estimating it.) Back to the Logit Model We now have discussed the logit model of consumer behavior and the generalized method of moments way to estimate a model. Next we will combine them. The tricky part of the theory is in choosing the function ω(θ ) that goes into the moment condition. Here is the BLP approach (the numbers of the steps are from Nevo [2000], Appendix, p. 1). (-1) Select arbitrary values for (δ, Π, Σ) as a starting point. Recall that δ from (16) is a vector of the mean utility from each of the products, and that Π, Σ is the vector showing how consumer characteristics and product characteristics interact to generate utility. (0) Draw random values for (ν i, D i ) for i = 1,...n s from the distributions P ν(ν) and ˆP D (D) for a sample of size n s, where the bigger you pick n s the more accurate your estimate will be. (1) Using the starting values and the random values, and using the assumption that the ɛ ijt follow the extreme-value distribution, approximate the integral for market share that results from aggregating across i by the following smooth simulator : ( 1 s jt = n s ( 1 = n s ) ns i=1 ) ns s ijt [ i=1 e [δ jt+σ 6 k=1 xk jt (σ kν k i +π k1d i1 + +π k4 D i4 )] m=1 e[δmt+p 6 k=1 xk mt (σ kν k i +π k1d i1 + +π k4 D i4 )] (11N) where (ν 1 i,..., ν 6 i ) and (D i1,..., D i4 ) for i = 1,... n s are those random draws from the previous step. Thus, in step (1) we obtain integrals with predicted market shares. ], 29

30 (2) Use the following contraction mapping, which, a bit surprisingly, converges. Keeping (Π, Σ) fixed at their starting points, find values of δ by the following iterative process. δ h+1 t = δ h t + (ln(s t ) ln(s t )), (12N) where S t is the observed market share. and s t is the predicted market share from step (1) that uses δ h+1 t as its starting point. Start with the arbitrary δ 0 of step (-1). If the observed and predicted market shares are equal, then δ h+1 t = δ h t and the series has converged. In practice, keep iterating until (ln(s t ) ln(s t )) is small enough for you to be satisfied with its accuracy. Thus, in step (2) we come out with values for δ. (2.5) Pick some starting values for (α, β). (3) Figure out the value of the moment expression using the starting values and your δ estimate. First, define ω jt = δ jt (x jt β + αp jt ) (13N) Second, figure out the value of the moment expression, ω ZΦ 1 Z ω (50) You need the matrix Φ 1 to do this. Until step (5), just use Φ 1 = Z Z as a starting point. 6 (4) Do a minimization search, trying nearby values of (α, β, δ, Π, Σ) until the value of the moment expression is close enough to zero. (5) Take your converged estimates and use them to compute a new ω. Use that to compute a new value for Φ, Z ωω Z. Then go back to step (1), using your latest parameter estimates as your starting values. Nevo notes that you could then iterate between estimating parameters (step 4) and estimating the weighting matrix (step 5). Both methods are 6 Nevo s text says little about this problem, but he discusses it on page 5 of his appendix. 30

The BLP Method of Demand Curve Estimation in Industrial Organization

The BLP Method of Demand Curve Estimation in Industrial Organization 27 February 2006 Eric Rasmusen 1 IDEAS USED 1. Instrumental variables. We use instruments to correct for the endogeneity of prices,