Practical Regression: Noise, Heteroskedasticity, and Grouped Data

DAVID DRANOVE 7-112-006 Practical Regression: Noise, Heteroskedasticity, and Grouped Data This is one in a series of notes entitled Practical Regression. These notes supplement the theoretical content of most statistics texts with practical advice on solving real world empirical problems through regression analysis. Introduction to Noisy Variables A variable is noisy if does not exactly equal the variable of interest (the one that best fits what the theory demands) or if it is mismeasured. Here are some examples: You want to measure the impact of product-level advertising on product sales. You have data on firms total advertising budgets. To estimate product-level budgets, you divide the total budget by the number of products. Your measure of product-level advertising is noisy. You want to determine if inventory turnaround is faster in firms that use just-in-time (JIT) inventory techniques. You survey logistics managers to get information about inventory turnaround. The busy managers provide rough, and therefore noisy, estimates. Continuing this investigation of inventory turnaround, you next study whether turnaround times differ by nation. Using the survey responses, you compute the average turnaround in each nation. The number of survey respondents ranges from seventy-five in the United States to two in Chile. Based on the law of large numbers, you know that the seventy-five U.S. firms in the sample are fairly representative of the United States as a whole. But you feel that the two Chilean responses may not accurately reflect all Chilean firms. Your measure of nation-level turnaround times is noisy, especially for nations with few sample respondents. The first part of this note describes the implications of noisy variables and suggests possible ways to deal with them. The second part of this note discusses problems that arise when the error term does not satisfy the ordinary least squares (OLS) assumptions of homoskedasticity and independence. 2012 by the Kellogg School of Management, Northwestern University. This technical note was prepared by Professor David Dranove. Technical notes are developed solely as the basis for class discussion. Technical notes are not intended to serve as endorsements, sources of primary data, or illustrations of effective or ineffective management. To order copies or request permission to reproduce materials, call 847-491-5400 or e-mail cases@kellogg.northwestern.edu. No part of this publication may be reproduced, stored in a retrieval system, used in a spreadsheet, or transmitted in any form or by any means electronic, mechanical, photocopying, recording, or otherwise without the permission of the Kellogg School of Management.

TECHNICAL NOTE: NOISE, HETEROSKEDASTICITY, AND GROUPED DATA 7-112-006 Implications of Noisy Variables It is not always easy to determine which variables are noisy. After all, the best way to know if a variable is noisy is to compare it with an accurate measure. But if you had an accurate measure, you would not need the noisy one. It is sometimes possible to apply statistical common sense to determine whether variables are noisy. For now, we will suppose that we know when a variable is noisy and discuss what that means for the analysis. Noisy Dependent Variables Here are two key facts: Coefficients obtained from OLS regressions with noisy dependent variables are unbiased. This implies that your predictions are also unbiased. Coefficients obtained from OLS regressions with noisy dependent variables are estimated less precisely (i.e., the standard errors increase). Thus, your predictions are less accurate. These statements are readily confirmed. Suppose that the true model relating X to Y is: (1) Y = B 0 + B 1 X + y where y is normally distributed. Suppose further that you do not have an accurate measure of Y. Instead, you have: (2) Z = Y + z where z is a normally distributed noise term that is independent of y. 1 Substituting from (2) into (1) yields: (3) Z = B 0 + B 1 X + ( y + z ) This is a regression equation. In fact, the only difference between equations (1) and (3) is that the error term is larger in equation (3) (( y + z ) versus y ). 2 This implies that the standard errors on B 0 and B 1 are larger when you use Z as the dependent variable. This causes the standard errors of any predictions to increase as well. Noisy Predictor Variables Things are a bit different when the predictor variables are noisy. Let s see what happens when X is noisy. Suppose that the true model is: 1 In general, you do not know the precise nature of the noise. If you assume that it is normally distributed, it is usually a good approximation and makes the math much easier. 2 Recall that the sum of two normally distributed variables is also normal. Thus, ε x + ε y is normal, so that equation (2) is a standard OLS regression model. 2 KELLOGG SCHOOL OF MANAGEMENT

7-112-006 TECHNICAL NOTE: NOISE, HETEROSKEDASTICITY, AND GROUPED DATA (4) Y = B 0 + B 1 X + y Suppose that you cannot measure X with precision. Instead, you measure: (5) Q = X + x where x is normally distributed and independent of y. We won t derive it here, but the estimated B 1 will tend toward the following value: (6) Estimated B 1 = (True B 1 )/(1 + 2 x/ 2 y) where 2 x and 2 y are the variances of ε x and ε y, respectively. Noting that the denominator is larger than 1, we conclude that the estimate of B 1 is biased toward zero. 3 This is known as attenuation bias. The degree of attenuation bias depends on the relative values of 2 x and 2 y. If 2 x is large relative to 2 y (i.e., X is measured with a lot of noise relative to the regression error), then the bias can be quite large. Most of the time, you should not be overly concerned about attenuation bias. It is inevitable that you will measure some predictor variables with error. If the measurement errors are relatively small, the bias is small as well. Moreover, if you are mainly interested in hypothesis testing as opposed to examining magnitudes, then the bias is of the right type. That is, if the estimated B 1 is statistically significant when you have measurement error, then the true B 1 would be larger and likely more significant if you could eliminate that error. Heteroskedasticity A key assumption of OLS regression is that the errors for all observations are distributed identically. In other words, you expect the model to give equally precise predictions for all observations. Recalling that the OLS regression residuals are unbiased estimates of the error in the underlying regression equation, we expect that any variation in the residuals from one observation to another must be completely random. This requirement is violated if the magnitude of the residuals is correlated with some factor Z. 4 It does not matter whether Z is in your model. For example, your residual may be large in magnitude whenever Z is large, and your residual may be small in magnitude whenever Z is small. If the magnitude of the residuals is correlated with any factor Z, then your model suffers from heteroskedasticity. When you have heteroskedasticity, the OLS standard errors are incorrect. 3 If the true value of B 1 is positive, the computer will report an estimate of B 1 that is a smaller positive number. Similarly, if the true value is negative, the computer will report a smaller (in magnitude) negative number. 4 Remember, the error is the ε in the underlying model. The residual is the difference between the actual and predicted values. The two are not the same due to the randomness of the process that generates your data. Even so, the residual is your best estimate of the actual error. KELLOGG SCHOOL OF MANAGEMENT 3

TECHNICAL NOTE: NOISE, HETEROSKEDASTICITY, AND GROUPED DATA 7-112-006 Testing for Heteroskedasticity Heteroskedasticity can arise in any number of ways, and there is no universal test for it. We can illustrate the problem by examining the relationship between sales and price of yogurt. Our data contains eighty-eight weeks of sales and pricing information on yogurt. The key dependent variable, labeled sales1 in the data set, gives the number of yogurt containers sold in a week. The key predictor, price1, is the price of yogurt in dollars per ounce. The variable promo1 indicates whether the yogurt is promoted in a special display case. Here is the result when we run regress sales1 price1 promo1: One way to test the assumption of heteroskedasticity is by performing an interocular (eyeball) test. Plot the residuals against the predicted values, or plot the absolute values of the residuals against the predicted values: 4 KELLOGG SCHOOL OF MANAGEMENT

7-112-006 TECHNICAL NOTE: NOISE, HETEROSKEDASTICITY, AND GROUPED DATA The residuals seem to show greater variance when the predicted values are larger (look at the wide range of residuals when the fitted values are around 9,000). This is evidence of heteroskedasticity. Eyeballing the data makes us suspicious of heteroskedasticity, but we can also perform statistical tests. Specifically, we can test specific hypotheses about the residuals. Remember, heteroskedasticity can arise in countless ways, so there are countless tests we can perform. In practice, econometricians limit their toolkits to just a few tests. The most common test for heteroskedasticity is the Breusch-Pagan (BP) test. To perform the BP test, you should regress the squared values of the residuals on the predictor variables in the original regression and then perform a joint (F) test of all the predictors in the second regression Step 1: Perform regression: Y = B 0 + B x X + e Step 2: Regress squares residuals on predictors cover residuals: e 2 = γ 0 + γ x X Step 3: Perform joint (F) test on γ x Fortunately, Stata performs this test in one command following the regression: The test statistic reveals that we can reject the null hypothesis of constant variance of the residuals, which is tantamount to rejecting the null hypothesis of homoskedasticity. Thus, the regression suffers from heteroskedasticity. KELLOGG SCHOOL OF MANAGEMENT 5

TECHNICAL NOTE: NOISE, HETEROSKEDASTICITY, AND GROUPED DATA 7-112-006 Fixing the Problem So you have heteroskedasticity and you don t know what to do about it. You certainly can t ignore it. Stata (and all regression software) computes standard errors and performs t-tests under the assumption of homoskedastic residuals. If you have heteroskedasticity, your standard errors and significance tests are incorrect. Most often, the cause of heteroskedasticity is a misspecified model. Misspecification can occur if we have omitted predictors or if we should have transformed the dependent and/or independent variables, for example by taking logs. We have discussed omitted variables in a previous technical note. 5 Sometimes regressions remain heteroskedastic despite the best specifications, and your standard errors are still incorrect. (If your model is well-specified, however, your coefficients are unbiased.) The standard solution to heteroskedasticity in well-specified regressions is the White correction (attributed to Halbert White). The White correction adjusts the variance-covariance matrix that is used to compute the standard errors. The technical details are not important, but there are a few things you should note: The White method corrects for heteroskedasticity without altering the regression coefficients. If the data are homoskedastic, the White-corrected standard errors are equivalent to the standard errors from OLS. (In practice, there are always at least small differences.) There are numerous modifications to the White correction, so different software packages may yield slightly different results. The White correction can be applied to models estimated via maximum likelihood techniques. Getting White-corrected standard errors (sometimes known as whitewashing ) is very simple in Stata. Just repeat your regression and add,robust: 5 David Dranove, Practical Regression: Introduction to Endogeneity: Omitted Variable Bias, Case #7-112-004 (Kellogg School of Management, 2012). 6 KELLOGG SCHOOL OF MANAGEMENT

7-112-006 TECHNICAL NOTE: NOISE, HETEROSKEDASTICITY, AND GROUPED DATA Note the following: The regression coefficients are unchanged; thus, the regression R 2 is unchanged. The corrected standard errors (now called robust standard errors ) are different from the original regression. The corrected standard errors in this case are larger than before. This is typical, but does not always occur. KELLOGG SCHOOL OF MANAGEMENT 7

TECHNICAL NOTE: NOISE, HETEROSKEDASTICITY, AND GROUPED DATA 7-112-006 One final point: The simple regression I have been working with is badly misspecified. Before I did any test for heteroskedasticity, I should have worked harder to correct the specification by considering additional predictors, using fixed effects (for the grocery store), and using a logged dependent variable. Using Weighted Least Squares to Correct Heteroskedasticity The Breusch-Pagan test regresses the squared residuals on all of the predictors in the model. Sometimes the squared residuals are a function of a single predictor in the model; the BP test might capture this. Because the BP regression includes all the regressors in your model, however, it is a weaker test than if you had just regressed the squared residuals on a single predictor. 6 Because you hypothesize that only a single predictor matters, your test should be limited to that predictor. A potentially more severe problem with the BP test is that the Z factor that causes heteroskedasticity might not be in your regression. If this is the case, the BP test will fail. This problem occurs in a wide class of regressions and can be fixed by using weighted least squares (WLS). One important class of applications for which the weighting factor is easy to identify and essential to use occurs whenever the left hand side (LHS) variable is drawn from individual survey data that is aggregated up to the market level. That is a rich sentence with lots of content, so let s break it down. 1. You need to have survey data. 2. The survey data is used to construct the LHS variable. 3. The LHS variable is computed by aggregating individual responses to create a market-level mean. If all three conditions hold (and they often do), then WLS is indicated. The following example should make things clearer. Suppose that you are studying determinants of television viewing in different cities. You survey lots of viewers in lots of cities to find out about their viewing habits. Your unit of analysis is the city, so you compute citywide average viewing levels. In some cities, you may have just one or two responses. In others, you have fifty or one hundred responses. Simple statistics tells you that in those cities with a higher number of responses, the citywide averages you compute are pretty close to the actual averages for those cities (assuming you have a representative sample). In those cities with only one or two responses, however, the averages you compute may be very different from the citywide averages. Because the sample sizes are rather small in many cities, your LHS variable estimated citywide viewership is noisy. But there is something predictable about the magnitude of the noise. A bit of statistics will show that if n i is the number of respondents in city i, and e i is the regression residual for city i, then the magnitude of e i is proportional to 1/n i. This implies that you have heteroskedasticity, as e i is systematically related to some factor (in this case, n i ). 6 By adding insignificant predictors to the BP test regression, you decrease your chances of getting a signficant result. 8 KELLOGG SCHOOL OF MANAGEMENT

7-112-006 TECHNICAL NOTE: NOISE, HETEROSKEDASTICITY, AND GROUPED DATA The BP test will not pick this up because sample size (n i ) is not a regressor and is therefore excluded from the test. A variant of the test can be used instead: simply regress the squared residuals on n i. To illustrate weighting, let s look at data on managed care organization (MCO) penetration. The dependent variable is the percentage of physician revenues derived from managed care insurers in each of 294 metropolitan areas. Predictors include the income, education (percentage with college degrees), and hospital concentration in each market. We will regress MCO penetration on income, percentage of population with college education, and a measure of hospital concentration in the market. Note that the dependent variable is derived from survey data and there are a different number of survey respondents in each market. I perform the standard tests for heteroskedasticity: KELLOGG SCHOOL OF MANAGEMENT 9

TECHNICAL NOTE: NOISE, HETEROSKEDASTICITY, AND GROUPED DATA 7-112-006 Neither the plot nor the BP test indicate any problem with heteroskedasticity. But because I have survey data, I have my doubts. First, I plot the residual against the number of survey respondents in each market: Note the classic funnel shape; residuals get smaller as the number of respondents gets larger. Now, perform the modified BP test: I have heteroskedasticity. To eliminate this problem, I need to get the computer to pay less attention to those cities with fewer respondents. Specifically, I will weight each observation by n and run a simple OLS regression. Weighting by n means that you multiply each and every value in your data set by n before running the regression. 10 KELLOGG SCHOOL OF MANAGEMENT

7-112-006 TECHNICAL NOTE: NOISE, HETEROSKEDASTICITY, AND GROUPED DATA Here is why this works. If you multiply everything by n, then the error term for each observation is also multiplied by n. This in turn implies that the squared errors are multiplied by n. Recall that OLS works hardest to fit the observations that contribute the most to the sum of squared errors. By multiplying the squared errors by n, you force the computer to do a good job of fitting the observations with the largest n s, which is exactly what you want. But be careful. If the residuals are not exactly proportional to the square root of the sample size, then WLS is not exactly the correct solution. Still, it is probably better to use a simple solution like WLS when it is theoretically sound than to perform ex post data picking in search of the best-fitting solution (i.e., picking out a significant result after the fact, then coming up with a theory to explain that result). Be warned, however, that widespread use of WLS can cause more problems than it solves. Needless to say, there is a simple way to perform WLS in Stata. Note the addition of [w=number_surveyed] at the end of the regression command. Some observations: The results can be interpreted like any OLS results. The R 2 is lower than before. Ignore this. The original regression was heteroskedastic, so the standard errors used to compute the R 2 in the original regression were incorrect. The WLS model can still have general heteroskedasticity that can be detected using the BP test and corrected using the White correction. REVIEW: A GUIDE TO WEIGHTING You may want to put more weight on some observations than others. This is certainly the case if the errors are systematically smaller for some observations; these observations deserve more weight (e.g., when you aggregate survey data). KELLOGG SCHOOL OF MANAGEMENT 11

TECHNICAL NOTE: NOISE, HETEROSKEDASTICITY, AND GROUPED DATA 7-112-006 You can check whether you need to do WLS by correlating the absolute value of the residuals with the potential weighting factor n, or better still, doing hettest n. WLS multiplies the LHS and right hand side (RHS) variables by n, where n is the weighting factor. This weights the squared errors by n, which is what you want. It is easy to perform WLS in Stata; just add [w=n] at the end of your regression statement. Avoid using WLS unless absolutely justified by the nature of the data. Summarizing Heteroskedasticity You have heteroskedasticity if the magnitude of the standard errors is correlated with some unmeasured factor. Heteroskedasticity biases the standard errors, but the coefficients are unbiased. You can test for heteroskedasticity using the hettest command in Stata. The,robust option corrects the standard errors in heteroskedastic OLS regressions. A common source of heteroskedasticity is the use of aggregated survey data. This can be corrected by using WLS. WLS and,robust are not mutually exclusive. Grouped Data Another critical assumption of OLS is that all the observations are independent. This assumption is frequently violated in practice. A prime example is regression with grouped data. For example, you may run a regression of profits for firms in a variety of industries. It seems plausible that profits will be correlated for firms within any given industry. Here is a more extreme example. Suppose you want to know if redheads are more popular than brunets. You have two friends named John and Paul. John has brown hair and Paul has red hair. At 1:00 p.m., you poll the class to see how many classmates like John more than Paul. You find that forty-five prefer John and fifteen prefer Paul. You repeat this poll at 1:10, 1:20, and so on. Your data looks as follows: 12 KELLOGG SCHOOL OF MANAGEMENT

7-112-006 TECHNICAL NOTE: NOISE, HETEROSKEDASTICITY, AND GROUPED DATA Name Hair Color Popularity John B 45 Paul R 15 John B 46* Paul R 15 John B 46 Paul R 15 John B 46 Paul R 15 John B 46 Paul R 14* *One student arrived late and another student left class early to go to a job interview. Given these ten observations, you regress popularity on an indicator variable for hair color, where Hair=1 if the hair is brown and Hair=0 if red. The resulting coefficient is B hair 30.5, which is statistically significant, thanks to the ten observations and the apparent nine degrees of freedom. Do you conclude that people with brown hair are more popular? Of course not. The reason you get a significant coefficient on hair color is that you do not have ten independent observations; you have two observations that are each repeated five times. The computer has no reason to know this; it thinks you have lots of experiments and computes the standard errors accordingly. This is an extreme example of groupiness in the data. If you do not account for the groupiness of your data, you will overstate the true degrees of freedom in your model, and the reported standard errors will be artificially small. You run a great risk of tricking yourself into thinking that you have significant findings when in reality you do not. One way to deal with grouped data is to estimate fixed effects. In fixed effects models, the computer ignores within-group variation when estimating the coefficients. Thus, only acrossgroup variation matters. (To determine the effect of hair color in the prior example, either John or Paul would need to change theirs from brown to red, or vice versa.) There are times when you do not want to estimate fixed effects models. This is especially true if you do not have much within-group variation. For example, suppose you want to study the effect of market demographics on yogurt sales. The demographics of the communities surrounding the stores will change little over time. If you include store fixed effects, you will not have sufficient within-store variation. You will have to omit the store dummies and rely on across-store variation. (You now run a heightened risk of omitted variable bias, of course, but if you have a rich set of demographics, this risk is minimized.) Suppose you go ahead and omit the store fixed effects. It is now likely that the standard errors across observations within each store are no longer independent. You have grouped data and if you don t account for it, your standard errors will be biased. The technique for adjusting the standard errors to account for groupiness is preprogrammed into Stata. Continuing the example, if you have data on the income of each store s local community, you could estimate the following regression: regress sales1 price1 promo1 income KELLOGG SCHOOL OF MANAGEMENT 13

TECHNICAL NOTE: NOISE, HETEROSKEDASTICITY, AND GROUPED DATA 7-112-006 To correct the standard errors for possible groupiness, just use the cluster subcommand in Stata. In this case, the groupiness comes from the variable store, so you type: regress sales1 price1 promo1 income, cluster(store) Coping with Grouped Data Use common sense as a guide to determine if your data fall naturally into groups. As an alternative, examine the error terms for observations within specific groups. Are they systematically positive or negative? If so, then you may not have independent observations. As a result, your standard errors are too small. You can estimate a fixed effects model to avoid the resulting bias in the standard errors, but you will be unable to examine the effects of variables that vary only between groups. If you want to preserve between-group action but avoid biased standard errors, use the,cluster(groupname) option in Stata. 14 KELLOGG SCHOOL OF MANAGEMENT

7-112-006 TECHNICAL NOTE: NOISE, HETEROSKEDASTICITY, AND GROUPED DATA An Unexpected Problem (Math Optional) Suppose your initial model is: Y = B 0 + B 1 X + y You decide that you want to divide both Y and X by some other variable V. An example might be when you express both variables in per capita amounts, where V is the size of the population. If you divide Y by V, then you must divide the RHS by V to keep the equation correct. This means you are effectively regressing: Y/V = B 0 /V + (B 1 X)/V + y /V Note that the error term is now clearly larger when V is smaller that is, when the dependent and independent variables are smaller. This is blatant heteroskedasticity. One excuse for keeping the model this way is that the underlying model is in fact: Y/V = B 0 + (B 1 X)/V + y OLS appears to be safe here. If you think this is the correct model, you are almost safe. A word of caution is still necessary. Suppose the true model is: Y/V = B 0 + (B 1 X)/V + y but you do not have a precise measure of V. Instead, you have U = V + u. Thus, you are actually regressing: Y/(V + u ) = B 0 + (B 1 X)/(V+ u ) + y Note that whenever u is positive, both the dependent variable (Y/(V + u )) and the predictor variable (X/(V+ u )) in the regression are smaller in magnitude than the corresponding variables in the true model. Similarly, if u is negative, both variables are larger than they are supposed to be. This implies that the two variables move together in the data, not because the variables are causally related but because of noisy measurement of V. This will bias the estimate of B 1 upward it is more positive than the true B 1. This bias emerges whenever you divide the dependent and predictor variables by the same variable and the divisor is a noisy variable. Many empirical researchers feel that such bias is inevitable and suggest that you restate the regression in such a way as to avoid dividing both the LHS and RHS by the same variable. I generally side with this skeptical group, although I think it important to determine whether the divisor accurately measures the theoretical construct. For example, I am less worried about dividing the LHS and RHS by population (to obtain per capita values) than I am about dividing by other variables that might be measured with considerable noise (or be noisy measures of the underlying theoretical construct). KELLOGG SCHOOL OF MANAGEMENT 15