Minimax-Regret Sample Design in Anticipation of Missing Data, With Application to Panel Data. Jeff Dominitz RAND. and

Size: px

Start display at page:

Download "Minimax-Regret Sample Design in Anticipation of Missing Data, With Application to Panel Data. Jeff Dominitz RAND. and"

Linda Martha Wiggins
5 years ago
Views:

1 Minimax-Regret Sample Design in Anticipation of Missing Data, With Application to Panel Data Jeff Dominitz RAND and Charles F. Manski Department of Economics and Institute for Policy Research, Northwestern University May 2018 Abstract Missing data problems are ubiquitous in data collection. In surveys, these problems may arise from unit response, item nonresponse, and panel attrition. Building on the Dominitz and Manski (2017) study of choice between two or more sampling processes that differ in cost and quality, we study minimax-regret sample design in anticipation of missing data, where the collected data will be used for prediction under square loss of the values of functions of two variables. The analysis imposes no assumptions that restrict unobserved outcomes. Findings are reported for prediction of the values of linear and indicator functions using panel data with attrition. We also consider choice between a panel and repeated cross sections. We are grateful for the comments of Max Tabord-Meehan.

2 1 1. Introduction Missing data problems are ubiquitous in data collection. In surveys, missing data may arise from unit response, item nonresponse, and panel attrition. Researchers who want to minimize the mean square error of estimates in surveys with missing data should be concerned with both bias and variance, as recommended in the literature on total survey error. However, statisticians have focused on variance, as explained by Groves and Lyberg (2010): The total survey error format forces attention to both variance and bias terms Most statistical attention to surveys is on the variance terms largely, we suspect, because that is where statistical estimation tools are best found (p. 868). Dominitz and Manski (2017) provided tools for making sample design choices that explicitly account for both variance and bias while imposing no assumptions that restrict the unobserved outcomes. The analysis used the Wald framework of statistical decision theory to study choice between two or more sampling processes that differ in the cost of data collection and the quality of the data obtained, where data quality is determined by the response rate. The study focused on minimax-regret sample design for prediction of a real-valued outcome under square loss that is, design which minimizes maximum mean square error when a reasonable and tractable predictor will be used. The ideal, but unknown, best predictor in this setting is the population mean outcome. It is computationally challenging to determine the value and maximum regret of the predictor that minimizes maximum regret. Seeking an approach that is both tractable and reasonable, we studied prediction using the midpoint of a sample analog estimate of the identification region for the population mean. This midpoint predictor is easy to compute and its maximum regret has a simple and sensible analytical form. If the identification interval for the population mean were known rather than estimated, then its midpoint would be the minimax-regret prediction, which we find to be another appealing aspect of this midpoint predictor. We now build on this framework to study minimax-regret sample design in anticipation of missing data, where the collected data will be used for prediction under square loss of functions of two variables.

3 2 Relative to our previous study, addressing this expanded prediction problem requires attention to additional dimensions of data cost and quality. We specifically study choice of sample size for a two-period panel with attrition. That is, we consider longitudinal data collection with a 100-percent response rate in period 1 and some nonresponse in period 2. Some findings apply as well to collection of data on two household members and to cross-sectional surveys with item nonresponse. Section 2 summarizes key elements of and findings from our previous study and calls attention to some complications that must be addressed in the expanded prediction problem. Analysis may be impacted not only by the higher dimension of the data but also by the form of the function whose value is to be predicted. As in our previous study, one must take a stand on how the data will be used before making sample design decisions. Section 3 studies the maximum regret of sample designs for prediction of two types of functions. The cases we study are prediction of the value of linear and indicator functions. In both cases, we presume knowledge of the response rate and use of a midpoint predictor akin to that posed in the previous study. Again, the attractions of midpoint predictors are that they are easy to compute and have sensible analytical forms for regret. Again, it is computationally challenging to determine the value and maximum regret of the predictor that minimizes maximum regret when the identification region for the population mean must be estimated. Section 4 compares the maximum regret of predictions with panel data to maximum regret of prediction with repeated cross-sectional (RCS) data. We find that RCS data collection often yields smaller maximum regret than a panel with equivalent sample size and cost when interest centers on prediction of linear functions. A particularly striking result is that RCS yields smaller maximum regret than a panel with complete response. The reason is that RCS draws an independent random sample each period, but panel observations may be correlated across periods. However, collection of panel data is typically more informative than RCS when the problem is to predict the value of an indicator function. Section 5 discusses extensions of the analysis.

4 3 2. Best Prediction under Square Loss of Functions of Two Variables Consider best prediction under square loss of a bounded real function f(y 1, y 2 ), where (y 1, y 2 ) take values in a bounded interval on R 2, normalized to be the unit square [0, 1] [0, 1]. Let P(y 1, y 2 ) be the probability distribution of (y 1, y 2 ) in a population that is a continuum. Then the best predictor is E[f(y 1, y 2 )]. The regret of a predictor based on sample data is its mean square error. The subscripts 1 and 2 may refer to time periods in a panel study, a husband and a wife in a study of households, or two different variables associated with each individual in a cross-section. Suppose a random sample is drawn from P(y 1, y 2 ), but there may be missing data. Let z t = 1 if y t is observed and z t = 0 if y t is missing for t = 1, 2. In a cross-sectional survey, unit nonresponse means that z 1 = z 2 = 0, whereas item nonresponse means that either (z 1 = 0, z 2 = 1) or (z 1 = 1, z 2 = 0). In a panel with full response in the first period but some attrition in the second, attrition means that (z 1 = 1, z 2 = 0). We focus on this case. Thus, we assume P(z 1 = 1) = 1, but permit P(z 2 = 1) < 1. We assume knowledge of the response rate in the second period but no knowledge of the composition of nonresponse. Although we focus on collection of panel data, our analysis applies as well to household surveys in which a researcher always interviews a specified spouse, but interviews only a subset of the other spouse. It also applies to surveys of individuals in which all sample members respond to one question but only a subset responds to another question. For example, the first question may ask about a non-sensitive matter such as age or education, while the second asks about a sensitive matter such as income or drug use. In general, survey response may vary with the process used to collected data. For example, a person may agree to provide data in an internet survey but not to be interviewed face-to-face. To make the dependence of response on the survey process explicit, we could denote the process by q and the missingdata indicators by (z q1, z q2 ). For simplicity, we keep the q notation implicit in most of the paper. 2.1 Previous Findings

5 4 Dominitz and Manski (2017) studied minimax-regret sample design for best prediction under square loss of a function of one variable, when a high-cost/high-quality sampling process accurately measures the outcome of each sample member and a low-cost/low-quality sampling process has nonresponse. The analysis assumed knowledge of the response rate, but it imposed no restrictions on the values of the data missing due to nonresponse. Using the present notation, one chooses between two processes for measuring y 1. The high-cost process has P(z 1 = 1) = 1, whereas the low-cost process has known P(z 1 = 1) < 1. Therefore, the high-cost process point-identifies E(y 1 ), whereas the low-cost process partially identifies E(y 1 ). With a predetermined budget, the minimax-regret choice between these two sampling processes is easy to determine when it is assumed that specific reasonable predictors will be used. We assumed that a sample-average predictor will be calculated based on the high-cost data and a midpoint predictor will be calculated based on the low-cost data. The midpoint predictor is the middle of a sample analog estimate of the identification region for the population mean. Among other findings, we showed that the maximum regret of the midpoint predictor is smaller than that of sample-average predictors that have been commonly used by researchers who face missing data problems. The latter predictors use ignorability assumptions to impute missing values or to motivate discarding these sample members. The analysis generalizes to designs that combine low-cost and high-cost sampling processes, under the assumption that the observed outcomes will be pooled. Further, when the budget is not predetermined, the analysis shows how to choose a budget sufficient to achieve an ε-optimal design; that is, a budget sufficient to make maximum regret less than a specified ε > Application to Functions of Two Variables Predicting the value of a function of two variables requires attention to additional dimensions of cost and quality, as well as to the form of the function. The general approach to analyzing the problem,

6 5 however, is unchanged. First, we determine the identification region for the best predictor under the assumed data generating process. Second, we define for any sample size a midpoint predictor based on a sample analog of this identification region. Finally, we solve for the maximum regret of this predictor in the case of indicator functions and an informative upper bound on maximum regret in the case of linear functions. When f(y 1, y 2 ) is a general function, the analysis of Dominitz and Manski (2017) yields an outer bound on E[f(y 1, y 2 )]; that is, a bound that holds but need not be sharp. The midpoint predictor studied there is applicable, but it may not be the best possible. Suppose that one knows the rate of complete response; that is, P(z 1 = z 2 = 1). The outer bound is obtained by (a) noting that the value of f(y 1, y 2 ) is observed if both y 1 and y 2 are observed and (b) considering the value of f(y 1, y 2 ) to be missing otherwise. The bound obtained in this manner may not be sharp because observation of either but not both of y 1 and y 2 may constrain the value of f(y 1, y 2 ). Section 3 studies two classes of functions in which this occurs. 3. Linear and Indicator Functions of Two Variables The main new contributions of this paper are to study prediction of the values of two classes of functions whose structure is such that observation of y 1 alone may be informative about the value of f(y 1, y 2 ). We consider prediction of the value of linear functions in Section 3.1 and indicator functions in Section 3.2. We assume complete response in period 1 and that one knows the response rate in period 2. We suppose throughout that the midpoint of a sample analog of the identification region for E[f(y 1, y 2 )] is used to predict the function value. To motivate the classes of functions we analyze, consider studies of employment dynamics, such as those that have utilized the Panel Study of Income Dynamics for the past 50 years. Let y 1 and y 2 denote the fraction of the year that a person works in years 1 and 2. Interest may center on employment change

7 6 (y 2 y 1 ) or average employment (y 1 + y 2 )/2. Section 3.1 covers such linear functions. Alternatively, one may want to predict the occurrence of an event, such as employment growth. Then the objective is prediction of the value of the indicator function 1[y 2 > y 1 ]. Section 3.2 covers prediction of indicator functions Linear Functions Let, for known values of (a, b, c). For ease of exposition, we consider functions where b 0 and c 0. The analysis may be extended to other cases. The best predictor is,. The identification region for, with panel data is the interval (1) 1P 1, 1P 1 P 0 =, P 0. This interval may be derived by applying the Law of Iterated Expectations. The lower bound obtains when 0 is degenerate at the value 0, the lower limit of the support of P(y 2 ). The upper bound obtains when 0 is degenerate at the value 1, the upper bound of the support of P(y 2 ). The width of this interval is P Panel Data Midpoint Predictor Let m t be the sample average of the observed values in period t; that is, and. A sample-analog midpoint predictor is (2) P 0.

8 7 The regret of the predictor is its mean square error, the sum of squared bias and variance. To find the squared bias of (2) with known period-2 response rate P(z 2 = 1), use the Law of Iterated Expectations to write the best predictor as, 1. Under random sampling, and. Bias arises from deviation between 1 and the midpoint predictor s assigned value of ½ P 0. Squared bias is therefore 0 0. The variance of the predictor is (3) P 0 2,. Random sampling implies that. We show in an Appendix that (4a) (4b), Hence, (5) P

9 8. To maximize regret, one can consider the variance and bias terms separately. This is so because bias depends only on E(y 2 z 2 = 0), which can vary independently of all the quantities that determine variance. Squared bias is maximized if P(y 2 z 2 = 0) is degenerate at 0 or 1, in which case maximum squared bias is 0. Maximum variance across all states of nature has a simple form in the polar cases of no response and complete response in period 2. When 1 0, the variance of the predictor reduces to. is maximized at ¼ when P(y 1 ) is Bernoulli with mean ½. Hence, maximum variance is. When 1 1, the variance of the predictor reduces to 1 1 2, 1 The variance and covariance terms can each take a maximum value of ¼ when (y 1, y 2 ) take values in [0, 1] [0, 1]. That is, (a) max. (b) 1. (c) max., 1 The maxima in (b) and (c) are both achieved when P(y 1, y 2 z 2 = 1) is bivariate Bernoulli with mean (½, ½) and covariance ¼. Finally, with P(y 1, y 2 z 2 = 1) bivariate Bernoulli with mean (½, ½), the maximum in (a) is achieved when P(y 1 z 2 = 0) is also Bernoulli with mean ½.

10 9 It appears difficult to determine maximum variance across all states of nature in non-polar cases. However, we can determine the maximum of an informative upper bound on the variance of the midpoint predictor. The Appendix shows that (6) P , Using the same argument as for the case with 1 1, the maximum of the upper bound on the variance is Putting the maximum of the upper bound on variance and maximum squared bias together, the upper bound on the maximum regret of the panel data midpoint predictor is (7) Inspection of (7) reveals that the upper bound on maximum regret is decreasing in the initial sample size N 1, holding the response rate fixed. This holds because the upper bound on maximum variance declines while maximum squared bias is unchanged. Observe that the upper bound (7) reduces to in the polar case where P(z 2 = 1) = 1. In this case, the upper bound is achieved when P(y 1, y 2 z 2 = 1) is bivariate Bernoulli with mean (½, ½) and covariance ¼. In the polar case of P(z 2 = 1) = 0, (4) reduces to. In this case, the upper bound is achieved when P(y 1 ) is Bernoulli with mean ½ and P(y 2 z 2 = 0) is degenerate at 0 or 1.

11 Indicator Functions Suppose now that the objective is best prediction of the event that (y 1, y 2 ) take values in some set A [0, 1] [0, 1]. Then f(y 1, y 2 ) is the indicator function 1[(y 1, y 2 ) A] and the best predictor is P[(y 1, y 2 ) A]. Examples include: (a), 1, for some values, 0, 1, (b), 1 for some, 0, 1, (c), 1, and (d), 1 for some 0,1. For ease of exposition, we focus on settings in which observation of y 1 may imply that (y 1, y 2 ) A but cannot imply that (y 1, y 2 ) A. This holds, for example, when, 1,. Then y 1 j implies that (y 1, y 2 ) A, but y 1 = j does not imply that (y 1, y 2 ) A. It also holds when, 1 and γ > ½. Then y 1 2γ 1 implies that (y 1 + y 2 )/2 < γ, but y 1 > 2γ 1 does not imply that (y 1 + y 2 )/2 > γ. The analysis may be extended to settings where observation of y 1 implies that (y 1, y 2 ) A. To obtain the identification region for,, define the binary random variable u as follows: u = 1 if z 2 = 0 and the observed value of y 1 implies (y 1, y 2 ) A, u = 0 otherwise. The identification region for P[(y 1, y 2 ) A] is the interval (8),y A, 1,,y A, 1 0,0. This interval may be derived by applying the Law of Total Probability and recalling that, when z 2 = 0, the event (y 1, y 2 ) A can occur only when u = Panel Data Midpoint Predictor A midpoint predictor based on (8) is the midpoint of its sample analog, namely

12 11 (9). 1,y A, 1 1 0, 0 To solve for the maximum regret of this predictor, we first derive its squared bias and variance. We then maximize the sum. To shorten the notation, we define two Bernoulli random variables 1,y A, 1 and 1 0,0. Then we rewrite (9) as (9 ). Squared Bias To find the squared bias, use the Law of Total Probability to write the best predictor (10), =,, 1, A, 0. Under random sampling,,y A, 1 and 0,0. Therefore, bias arises from deviation between,, 0 and 0,0/2. The event,, 0 implies that u = 0. Hence,,, 0,, 0,0 and squared bias may be expressed as follows: (11),y A 0,01/2 0,0. Variance To find the variance, note that, under random sampling, the midpoint predictor (9) is a linear function of the bivariate Bernoulli random variable, whose realizations are independent and

13 12 identically distributed across individuals i. Let,, 0, 1, 0, 1. Note that,, 1, and 1. Analysis of bivariate Bernoulli random variables in Dai et al. (2013), equation (2.12) shows that,. We also know that 1 1, 1 0, and 1,10. It now follows that o o o o o o o , 1 Thus, the variance of the midpoint predictor (9) can be written as follows: (12) 2, 1 1. Maximum Regret Summing (11) and (12), the regret of the midpoint predictor is (13) 1 1,y A 0,01/2. Note that 0,0 is found in both the variance and the squared bias components of regret. Hence, in contrast to the case with linear functions, the maximum regret of the predictor of an indicator function cannot be determined by separately maximizing variance and squared bias.

14 13 Setting,y A 0,0 = 0 or 1 maximizes squared bias for any feasible value of, and does not affect variance. 1 Hence, maximum regret for a given value of, is (14) 1 1. The problem is to maximize (14) over the feasible range 1 and 0. Fix p 01 at any feasible value and differentiate (14) with respect to. The derivative 1 2 is decreasing in. Hence, the maximum occurs at the interior solution if this is a feasible value of p 10 and at the boundary 0 otherwise. Considering first the interior solution for, plug into (14) and solve the concentrated optimization problem (15) max s.t. 0, 0. The derivative is increasing in. Therefore, with an interior solution for, regret is maximized at the boundary where either 0 or 0, in which case or 1, respectively. Inspection of the possible boundary solutions shows that, when 1 1 1/N 1, maximum regret occurs where 0 and 1. It follows that the maximum regret of the midpoint predictor (9) in this typical setting is (16) Note that,y A 0,0 is only defined if 0. However, if 0, then there is no missing data problem.

15 14 Inspection of (16) shows that maximum regret is decreasing in the initial sample size N 1, holding the response rate fixed. Differentiation of (16) with respect to the response rate shows that, holding the sample size fixed, maximum regret is decreasing in the response rate when 11 1/2 1 ) Outer-Bound Midpoint Predictor To apply the Dominitz and Manski (2017) midpoint predictor to indicator functions of two variables, once again consider the value of f(y 1, y 2 ) to be missing when y 2 is missing. Then the identification region for E[f(y 1, y 2 )] is the interval (17),y A, 1,,y A, 1 0. A midpoint predictor based on this outer bound on E[f(y 1, y 2 )] is (18) 1,y A, Note that (18) differs from (9) only in the arguments of the second indicator function; that is, 1 0 versus 1 0, 0. Maximum regret of midpoint predictor (18) is identical to maximum regret of midpoint predictor (9), because maximum regret of (9) arises where 0, and, therefore, , 0 for all i. Equivalence of the two midpoint predictors with respect to maximum regret does not imply that the two are equivalent in all states. In fact, midpoint predictor (9) dominates (18). That is, its regret is less than that of (18) in states where first-period data may be informative and equals that of (18) in the "worstcase" states where the first-period data are always uninformative. The latter are states in which z = 0 always implies that u = 0, so observation of the period-1 outcome is not informative about the value of E[f(y 1, y 2 )].

16 15 Maximum regret occurs in states when the first-period data are always uninformative. Hence, the maximum regret of (9) equals the maximum regret of (18) Choice of Sample Design Suppose that a set Q of sampling processes are feasible. Each q Q has a cost π q per initial sample member and a vector of response rates P(z q1 = i, z q2 = j), where i and j equal 0 or 1. We assume for simplicity that the cost per sample member does not depend on whether a person responds in period 2. Also for simplicity, we consider choice between two designs, the lower-cost design having higher attrition. Let N 1L and N 1H be the two initial sample sizes. The low-cost/low-quality design L has total cost π L N 1L and response rate L in period 2, whereas the high-cost/high-quality design has total cost π H N 1H and response rate H in period 2, with 0 < π L < π H and L < H < 1. In principle, one may have additional information about the sample design beyond cost and attrition rate that would impact the calculation of maximum regret and choice among designs. For example, the composition of those who choose to respond or not to respond in period 2 could be known to vary across designs and even if they have identical response rates. Thus, it may be that yet,,. We assume that no such information is available prior to data collection Allocation of a Predetermined Budget Suppose that the objective is to predict the value of a linear or indicator function, using the relevant midpoint predictor. Suppose that the planner has a predetermined budget B and must choose between one of the two designs. The feasible sample sizes are N 1L = INT(B/π L ) for low-cost sampling and N 1H = INT(B/π H ) for high-cost sampling. We henceforth ignore for simplicity the fact that sample sizes must be

17 16 integers and take the feasible low-cost sample size to be N 1L = B/π L and the feasible high-cost sample size to be N 1H = B/π H. Consider first best prediction of the value of the linear function, for known values of (a, b, c). Using the midpoint predictor (2), the feasible low-cost and high-cost designs yield upper bounds on maximum regret of / 1 / and / 1, respectively. Hence, the low-cost design has a / smaller upper bound on maximum regret when the budget is less than a certain threshold and the high-cost design does otherwise. The threshold budget is (19). Consider now best prediction of the value of indicator function 1[(y 1, y 2 ) A]. Using midpoint predictor (9), the feasible low-cost and high-cost designs yield maximum regret / 1 and / 1, respectively. The low-cost design has smaller maximum regret when the budget is less than a certain threshold and the high-cost design is better otherwise. The threshold budget is (20) Choice of Budget to Achieve ε-optimal Prediction Now suppose that budget is a choice variable. In principle, the planner should perform a benefitcost analysis. Devoting a larger budget to data collection improves prediction of outcomes but diverts resources from other uses. The planner must resolve this tension. Adapting arguments in Manski and Tetenov (2016) regarding sample size selection to enable ε-

18 17 optimal treatment decisions, Dominitz and Manski (2017) consider choice of a design so that the maximum MSE of the midpoint predictor is no larger than a specified ε > 0. If the objective is prediction of the value of linear functions, the analysis above shows that a budget of size B suffices to achieve this objective if (21) min{ / / } ε. 1, / 1 / Similarly, for indicator functions, a budget of size B suffices to achieve this objective if (22) min{ / 1, / 1 } ε. These budget sizes suffice for ε-optimality but may not be necessary. The smallest budgets that enable ε-optimal prediction occur when one uses MMR predictors rather than the tractable midpoint predictors studied in Sections 3.1 and 3.2. In the absence of knowledge of the MMR predictors, we can provide sufficient budget sizes but not necessary ones. 4. Panel Data versus Repeated Cross Sections This section uses our analysis to guide sample design choice between panel data and repeated crosssectional (RCS) data, with complete response in each cross-section. Continuing the two-period framework utilized above, RCS data are generated by a sampling process in which two independent random samples are drawn, with (z 1 = 1, z 2 = 0) in one sample and (z 1 = 0, z 2 = 1) in the other; thus, P(z 1 = 1, z 2 = 1) = 0. With repeated observations on individual sample members, panel data with full response point-identify the

19 18 joint distribution P(y 1, y 2 ), whereas RCS data point-identify only the period-specific marginal distributions P(y 1 ) and P(y 2 ). Much attention has been paid to estimation of dynamic models that are point-identified by RCS data; see, for example, the review in Verbeek (2005). In early work on this topic, Deaton (1985) restricted attentions to linear models that may include an additive fixed effect to be differenced out. Then the outcome of interest is the linear function, with b = c. Moffitt (1993) extended the approach to some nonlinear models, focusing on binary choice models where the outcome of interest in each period is an indicator function. Moffitt emphasized that estimation of dynamic models with RCS data is made difficult by the general lack of information on lagged dependent and independent variables and the consequent unobservability of the intertemporal covariances needed to identify and estimate dynamic models (Moffitt, 1993, p. 99). This line of research, which replaces individual observations with cohort means and uses additional assumptions to identify the models, is often motivated by cases in which panel data are not available. However, it has also been noted that panel data are often inferior to the available cross-sections in some respects (Moffitt, 1993, p. 100), such as smaller sample sizes in each time period, lower rates of response arising from attrition, and large and persistent errors of measurement (Deaton, 1985, p. 110). Rather than try to identify conditions under which existing panel data should be preferred to existing RCS data or vice versa, here we consider how one should design longitudinal data collection before commencing it. The answer to this question depends crucially on how the data will be used Linear Functions Under random sampling with no missing data, RCS data point-identify the expectations of linear functions of y 1 and y 2. As above, let, and let m t be the sample average of the N t observed values in period t. The RCS sample-average predictor is. Observe that

20 19,,,, and, 0. Squared bias equals 0 and variance is maximized if P(y 1, y 2 ) is bivariate Bernoulli with mean (½, ½). Maximum regret is (23). We may compare (23) with the maximum regret of the panel-data midpoint predictor. To make the comparison precise, let N 1 be the period-1 sample size for both RCS and panel data collection. Let N 2 = N 1 be the period-2 sample size in each case as well. N 2 is a new random sample in the RCS case and is the sample of period-2 responders in the panel case with no attrition. Let both designs have the same cost per observation. Thus, we compare designs that yield the same numbers of observations in each period and have the same cost, but differ in the composition of the period-2 observations. As previously noted, the maximum regret of the panel data midpoint predictor is when P(z 2 = 1) = 1. Comparison with (23), when N 2 = N 1, reveals that the maximum regret of the panel data predictor exceeds that of the RCS predictor by. This difference is attributable to the potential covariation between the period-1 and period-2 sample averages in a panel. This finding effectively turns a common argument in favor of a panel over RCS on its head. Unlike RCS, a panel yields information on the joint distribution of outcomes across periods. But this information has no value when the objective is to predict a linear function under square loss. Moreover, the possibility of covariation of outcomes across periods increases the maximum variance of the panel data predictor relative to the RCS predictor, which draws an independent random sample each period. The comparison is not as straightforward when there is attrition, because we only have an analytical upper bound on the maximum regret of the panel data midpoint predictor. Nonresponse in period-2 may reduce the impact of covariation of outcomes across periods on the maximum variance of the panel data

21 20 midpoint predictor relative to the RCS predictor, but nonresponse also increase the predictor s maximum squared bias. In contrast, the RCS predictor is unbiased under the maintained assumptions Indicator Functions Suppose now that the objective is best prediction of the event that (y 1, y 2 ) take specified values. The best predictor is, for values, 0, 1. With panel data, we found in (8) the identification region to be the following interval of width 0,0:,y, 1,,y, 1 0,0 With RCS data, the Frechet bound on a joint probability using knowledge of the marginals gives the identification region as the interval (24) max0, 1,min,. This interval has maximum width ½, which obtains when and. Suppose one were to know the marginal probabilities and. Then the minimax-regret predictor would be the midpoint of (24). Without additional prior information, the maximum regret of this midpoint predictor equals its maximum squared bias of 1/16, which obtains when and. Suppose instead that one uses information from a finite sample to estimate the marginal probabilities. Maximum regret of a predictor using finite-sample information on the probabilities cannot be less than maximum regret of the minimax-regret predictor using knowledge of the probabilities. Therefore, maximum regret of a sample-analog RCS midpoint predictor must be no less than 1/16.

22 21 Recall that maximum regret of the panel data midpoint predictor is 1 0. Thus, the lower bound on the maximum regret of a finite-sample RCS midpoint predictor exceeds the maximum regret of the panel data midpoint predictor when 1 0. When 1 > ½, there exists a threshold sample size such that this inequality holds for all samples larger than the threshold. RCS data may be even less informative when predictors of other indicator functions discussed in Section 3 are of interest. Consider the event [y 2 > y 1 ], whose best predictor is. It is possible that the marginal distributions P(y 1 ) and P(y 2 ) identified by RCS data are compatible with both arbitrarily close to 0 and arbitrarily close to 1, as would be the case when (y 1, y 2 ) are continuously distributed on [0, 1] with P(y 1 ) = P(y 2 ). Suppose one were to know the distributions P(y 1 ) and P(y 2 ). Then the minimax-regret predictor would be the midpoint of the identification region. Without additional information, this identification region is the open unit interval. The minimax-regret prediction is therefore ½, and maximum regret of this midpoint predictor equals its maximum squared bias of ¼. Thus, the RCS data are potentially uninformative and a finite-sample RCS midpoint predictor must have maximum regret no less than ¼, whereas maximum regret of the panel data midpoint predictor is again Conclusion This paper continues our effort to encourage increased use of statistical decision theory to inform the design of data collection when data quality is a decision variable. Building on our previous study, we demonstrate how the framework may be applied to more complex design problems. A notable general finding is that, when collecting panel data with attrition, prediction of the value

23 22 of a function of two variables is more subtle than prediction of a function of one variable. The reason is that observation of the outcome in the first period may constrain the value of the function when the outcome in the second period is not observed. The nature of the constraint, if any, depends on the form of the function being predicted. Juxtaposition of linear and indicator functions demonstrates this relationship well. The form of the function being predicted also is important when considering choice between collection of panel data and RCS. In the absence of restrictions on the joint distribution of outcomes, RCS data are well-suited for prediction of linear functions but not for prediction of nonlinear function such as indicator functions. When predicting linear functions, choice between panel data and RCS may depend on the relative magnitudes of recruitment and retention costs, as well as the relationship between retention costs and the attrition rate. The framework adopted in the study should be useful for addressing these matters and related questions. For instance, what is the optimal length of time between interview waves, when, all else equal, an increase in period length should increase retention costs and/or attrition? To what extent can a rotating panel be used to optimally combine the best elements of RCS and panel? Finally, how can retrospective questions in RCS be used in addition to or in lieu of a (rotating) panel and how does this answer depend on the length of time between interviews given the relationship between the length of this time span and retrospective reporting errors? We should re-emphasize that our analysis assumes knowledge of response rates but uses no information on how the sample design affects the composition of respondents. Such information may affect minimax-regret choice among designs. In particular, one may be able to lower maximum regret by combining complementary designs that tend to attract different segments of the population of interest. Appendix Derivation of V(m 2 )

24 23 Recall that. Random sampling implies that. Observe that 1 1 and 1 1. Thus, = Hence, Derivation of C(m 1, m 2 ) Recall that,. Random sampling implies that, 1 1, 1,,

25 24,,. Observe that 1 1 and 1 1. Thus,, = It follows that, Derivation of upper bound on the variance of the midpoint predictor The variance of the predictor is 2 P

26 25 Outcomes are normalized to lie in the unit interval. Therefore, we can derive an upper bound on 1 1 1, as follows: = V V We can also derive an upper bound on 1 1. By the Law of Iterated Expectations, 1 1 = , , , 1 0 Inserting these upper bounds gives this upper bound on the variance of the midpoint predictor: 2 P

27 26 References Dai, B., S. Ding, and G. Wahba (2013), Multivariate Bernoulli Distribution, Bernoulli, 19, Deaton, A. (1985), Panel Data from Time Series of Cross Sections, Journal of Econometrics 30, Dominitz, J., and C. Manski (2017), "More Data or Better Data? A Statistical Decision Problem," Review of Economic Studies, 84, Groves, R. (2006), Nonresponse Rates and Nonresponse Bias in Household Surveys, Public Opinion Quarterly, 70, Groves, R. and L. Lyberg (2010), Total Survey Error: Past, Present, and Future, Public Opinion Quarterly, 74, Manski, C. and A. Tetenov (2016), Sufficient Trial Size to Inform Clinical Practice, Proceedings of the National Academy of Sciences, 113, Moffitt, R. (1993), Identification and Estimation of Dynamic Models with a Time Series of Repeated Cross-Sections, Journal of Econometrics, 59, Verbeek, M. (2008), Pseudo-Panels and Repeated Cross-Sections, in L. Matyas and P. Sevestre (eds.) The Econometrics of Panel Data, Berlin: Springer-Verlag,

Chapter 1 Introduction. What are longitudinal and panel data? Benefits and drawbacks of longitudinal data Longitudinal data models Historical notes

Chapter 1 Introduction. What are longitudinal and panel data? Benefits and drawbacks of longitudinal data Longitudinal data models Historical notes Chapter 1 Introduction What are longitudinal and panel data? Benefits and drawbacks of longitudinal data Longitudinal data models Historical notes 1.1 What are longitudinal and panel data? With regression