To Hold Out or Not. Frank Schorfheide and Ken Wolpin. April 4, University of Pennsylvania

Size: px

Start display at page:

Download "To Hold Out or Not. Frank Schorfheide and Ken Wolpin. April 4, University of Pennsylvania"

Caren Hudson
5 years ago
Views:

1 Frank Schorfheide and Ken Wolpin University of Pennsylvania April 4, 2011

2 Introduction Randomized controlled trials (RCTs) to evaluate policies, e.g., cash transfers for school attendance, have become a prominent methodology in applied economics. Limitation: one cannot extrapolate outside of the treatment variation in the particular experiment. Given their cost, RCTs cannot be used to perform ex ante policy evaluation over a wide range of policy alternatives. Extrapolation to new treatments requires developing models that embed behavioral and statistical assumptions. It is thus important to have methods for assessing the relative credibility of models.

3 Introduction In practice researchers often hold out data from estimation to use for external validation, e.g., Wise (1985), Todd and Wolpin (2006), Duflo, Hanna and Ryan (2009). Further References Although having intuitive appeal, the use of holdout samples is puzzling from a Bayesian perspective, which prescribes using the entire sample to form posteriors. Our contributions: 1 Provide a formal, albeit stylized, framework in which data mining poses an impediment to the implementation of the ideal Bayesian analysis. 2 Provide a numerical illustration of potential costs of data mining and potential benefits of holdout samples designed to discourage data mining. We measure losses relative to the ideal Bayesian solution. (Structural) Data Mining: Process by which a modeler tries to improve the fit of a structural model during estimation, e.g., change functional forms, allow for unobserved heterogeneity, add latent state variables.

4 To Fix Ideas... A Working Example Evaluate impact of monetary subsidy to low-income households based on school attendance of their children. No direct tuition cost of schooling. Structural Models M i : Household solves (a = 1 means attend school): max U i (c, a; x, ɛ, ϑ i ) a {0,1} s.t. c = y + w(1 a) Decision rule a = ϕ i (y, w; x, ɛ, ϑ i ). Attendance subsidy s modifies budget constraint c = y + w(1 a) + sa = (y + s) + (w s) (1 a). }{{}}{{} ỹ w Optimal attendance choice in presence of subsidy is a = ϕ i (ỹ, w; x, ɛ, ϑ i ).

5 Example Continued Social experiment: a randomly selected treatment sample has been offered a subsidy, s = s; a randomly selected control sample, has not been offered the subsidy, s = 0. Policy maker would like to have an estimate of how sensitive the outcome is to varying the subsidy level. It is too costly to vary subsidy in experiment.

6 Example Continued Change of notation: Y is outcome; S subsidy; X i, i = 1, 2, are characteristics such as income and wage. Assumptions: n observations, 50% control and 50% treatment sample. Let X = [X 1 X 2]. Then: [ 1 n X X p σ1 2 ρσ 1σ 2 Γ = ρσ 1σ 2 σ2 2 ]. The treatment is determined independently of the covariates: 1 n X S p 0.

7 Two Modelers Policy maker engages 2 modelers in this endeavor: M i, i = 1, 2. Structural models embody restrictions that allow the extrapolation of policy effects even though no variation in the policy instrument has been observed ( extrapolation by theory ). Approximation/simplification of the attendance function ϕ i ( ): Write model as linear regression: M i : Y = X i β i + Sθ + U, U (X, S) N(0, I ), i = 1, 2. (Structural) model restriction: θ = β i. Cross-coefficient restriction rules out need of treatment sample for identification. ( 1 Prior: θ N 0, ). nλ 2

8 Policy Maker Goal: predicting the effects of a counterfactual subsidy level s s. Assumption: no counterfactual policy predictions with reduced form model. The policy maker can estimate a simple reduced form model: M pol : Y = Sθ + V ˆθ(Y, S) M pol provides a consistent estimate of treatment effect. But, M pol is unable to answer the question of interest. For model selection/averaging the particular counterfactual policy s is irrelevant: either policy maker weights models based on fit or on their ability to predict effect of actual subsidy level s.

9 Ideal Case: Full Bayesian Analysis Policy maker assigns prior probabilities π i,0 to M i. From the policy maker s perspective the overall posterior distribution of the treatment effect is given by the mixture p(θ Y, X, S) = π i,n p(θ Y, X, S, M i ), i=1,2 Model weights are based on marginal likelihood p(y X, S, M i ) = p(y θ, X, S, M i )p(θ M i )dθ. θ Θ Treatment effect estimates conditional on full sample: p(θ Y, X, S, M i ) = p(y X, S, θ, M i)p(θ M i ). p(y X, S, M i ) π 1,n π 2,n = π 1,0 π 2,0 p(y X, S, M 1 ) p(y X, S, M 2 )

10 Remark: Full Bayesian Analysis The assumption that θ = β i N ( 0, 1/(nλ 2 ) ) implies that models remain asymptotically difficult to distinguish: log posterior odds of M 1 and M 2 are not divergent as n. In reality policy makers are confronted with multiple models that are potentially consistent with the observed data.

11 Impediments to Full Bayesian Analysis The policy maker is concerned that the modelers engage in data mining and do not report the marginal data densities p(y X, S, M i ) associated with their models truthfully. The policy maker has the option of providing modelers with only a subset of the outcome data: Partition Y = [Y r, Y p], here r stands for regression; p stands for prediction (holdout sample). We assume researchers have access to the full data vectors X 1, X 2, S. and to request a predictive density for the holdout sample p(y p Y r, X, S, M i ), a predictive distribution for the PM s estimate of the treatment effect ˆθ(Y, S) M pol : p(ˆθ([y r, Y p], S) Y r, X, S, M i ). Next step: characterize behavior of modelers if they have access to the full sample Y (Case 1); the sub-sample Y r (Case 2).

12 Case 1: Modeler Has Access to Full Sample Y Our stylized representation of data mining = data-based modification of prior distribution 1 break link between β and θ by introducing an additional parameter ψ such that θ = β i + ψ; 2 center prior at maximum likelihood estimate. Step 1: Write the model as Y = X i (θ ψ) + Sθ + U = X i θ X i ψ + U, where X i = X i + S. Let M Xi = I X i ( X i X i ) 1 X i. Then: ψ = (X i M X i X i ) 1 X i M X i Y. Data-miner subsequently imposes the relationship θ = β i + ψ.

13 Case 1: Modeler Has Access to Full Sample Y Step 2: Modified Model M i : Ỹ i = X i θ + U, with Ỹ i = Y + X i ψ. Maximum likelihood estimator: θ i = ( X i X i ) 1 X i Ỹ i. Data-mining prior: ( θ M i N θ i, (κ X i X i ) ). 1

14 Case 1: Modeler Has Access to Full Sample Y Modeler is able raise the marginal likelihood from: p(y X, S, M i ) = (2π) n/2 λ X i X i /n + λ 2 1/2 { exp 1 } 2 [Y (I X i ( X i X i + nλ 2 ) 1 X i )Y ] ; to: ) 1/2 ( p(y X, S, M i ) = (2π) n/2 κ κ + 1 { exp 1 } 2 [Ỹ i (I X i ( X i X i ) 1 X i )Ỹ i ]. Penalty term is eliminated. In-sample-fit term Ỹ i (I X i ( X i X i ) 1 X i )Ỹi corresponds to unrestricted regression: Y = X i β i + Sθ + U. Policy maker ends up computing distorted model posteriors based on p(y X, S, M i ).

15 Case 2: Modeler Only Has Access to Subsample Y r Modeler is asked to report a predictive density for Y p. Modeler contemplates reporting p(y p Y r, X, S, M i ) instead of p(y p Y r, X, S, M i ). By Jensen s inequality, the expected log ratio of the predictive likelihoods is: [ p(y p Y r, X, S, ln M ] i ) p(y p Y r, X, S, M i )dy p 0 p(y p Y r, X, S, M i ) Deduce: the use of predictive densities for a holdout sample makes it optimal for the modeler to reveal p(y p Y r, X, S, M i ).

16 Case 2: Modeler Only Has Access to Subsample Y r However, we allow the modeler to consider a reference model M i0 that takes the form (similar to above) M i0 : Y = β i X i + θs + U, β i N(0, nλ 2 ), θ N(0, nλ 2 ). Modeler computes posterior probabilities for M i and M i0. Predictive distribution for hold-out sample: p i (Y p Y r, X, X ) = π i0,r p(y p Y r, X, S, M i0 ) +π i,r p(y p Y r, X, S, M i ) Behavioral implication (approximately): If the modeler finds M i rejected against M i0 (π i0,r 1), he reports p(y p Y r, X, S, M i0 ): data mining on predictive density. Otherwise, he reports p(y p Y r, X, S, M i ).

17 So far: From the Policy Maker s Perspective... If modelers are provided with entire sample Y, they data-mine and report results from model Mi. If modelers are provided with a subsample Y r, they can potentially assess their restrictions θ = β i and either report results from their actual model M i or the reference model M i0, depending on the relative fit. If Y r contains no information from the treatment sample, then modelers have no evidence against θ = β i and always reveal M i. In the case of a holdout sample, the policy maker could either use predictive distributions for Y p or ˆθ(Y p, ) to weight competing models.

18 When you come to a fork in the road take it. (Yogi Berra) Post-model-averaging Model weights based on Model weights based on estimation Y p Y r pred. density ˆθ Yr pred. density based only on r = 0 implements Y r sample Bayesian model weights (clearly dominated) based on r = 0 implements full Y sample Bayesian solution (see illustration) Model building without data? Reporting high-dimensional predictive densities for Y p? Current practice in treatment effect literature comes closest to choosing model weights based on the ˆθ-predictive density. lim r 0 p(y 1 r Y r, X, S, M i ) = p(y X, S, M i ).

19 Numerical Illustration First, we present results conditional on M i and/or (θ, β i ). Second, we present results under the marginal distribution of the data p(y, X, S) = 1 2 p(y, X, S M 1) p(y, X, S M 2), where p(y, X, S M i ) = p(x )p(s) p(y θ, β, X, S, M i )p(β, θ M i )d(β, θ). τ is fraction of observations from the treatment group in the regression sample Y p. We consider: τ = τ min, where τ min = 0 for r 0.5 and then converges to 0.5 as r 1. τ = 0.5 Rather than conducting model averaging, we consider degenerate model weights that are either 0 or 1 (model selection).

20 Parameterization Observable characteristics X : σ 2 1 = σ2 2 = σ2 = 2, ρ = 0.2; Treatment: s = 2; Sample size: n = 1, 000 (we have a well defined limit distribution). Policy maker: prior probability 0.5 for M 1 and M 2. Modelers: prior probability of 0.52 for M i and 0.48 for M i0 ; λ 1 = λ 2 = 1; Implication of experimental design: Probability of highest posterior probability model being the true model: Integrated: 0.68 Conditional on θ = (5 prior stdd): 1.00 Conditional on θ = (0.2 prior stdd): 0.51

21 Policy Experiment and Loss Function Raise subsidy from s = 2 to s = 4. Predict outcome for an individual whose relevant characteristic x i = σ and whose irrelevant characteristic x i = ρσ. Loss function is quadratic: L(y, ŷ) = (y ŷ) 2. Optimal predictor is posterior mean; we consider posterior mean conditional on highest posterior probability model: ŷ bayes = ˆβX i + ˆθS = ˆθ bayes (σ + s ) We report the expected value of (ŷ ŷ bayes ) 2 under the marginal density of Y (integrated risk differential).

22 Policy Experiment and Loss Function Suppose M 1 is the highest posterior probability model. The following outcomes are possible. Full sample data mining if modelers have access to full sample Y : Modelers introduce ψ; estimates of β and θ deviate from ˆθ bayes. Data mining on predictive density: 1 Modeler 1 is honest and M 1 is selected: ŷ = ŷ bayes. 2 Modeler 1 is not honest and policy maker ends up selecting M 1,0. Misses restriction θ = β i 3 Modeler 2 is honest and M 2 is selected. Uses wrong x. 4 Modeler 2 is not honest and policy maker ends up selecting M 2,0. Uses wrong x, misses restriction. Example: r = 0.5, τ = 0.5 Case 1 Case 2 Case 3 Case 4 Probability Cond. E[(ŷ j ŷ bayes ) 2 ]

23 Composition of Estimation Sample Y r, n = 1, 000 τ = τ min τ = 0.5 Control Treatment Control Treatment r = r = r = r =

24 Probability that Modeler 1 is Honest Cond. on M 1 M1 is solid τ = τ min is blue, τ = 0.5 is red

25 Probability that Modeler 2 is Honest Cond. on M 1 M2 is dashed τ = τ min is blue, τ = 0.5 is red

26 Probability that Modelers are Honest Cond. on M 1 M1 is solid, M2 is dashed τ = τ min is blue

27 Probability that Modelers are Honest Cond. on M 1 M1 is solid, M2 is dashed τ = 0.5 is red

28 Probability that Modelers are Honest Cond. on M 1 M1 is solid, M2 is dashed τ = τ min is blue, τ = 0.5 is red

29 Probability that Modelers are Honest r 0.5, τ min = 0: the modelers have no information that allows them to test the restriction of their model. In turn, they are honest with probability 1. τ = 0.5 and θ is small: even for small values of r the modelers find their restrictions rejected with some probability. For large values of θ modeler M 1 does not find his restrictions rejected, whereas modeler M 2 does with probability 0.6 for r = For small values of θ both modelers find their restrictions rejected with approximately equal probability.

30 Prob. PM Finds Best Model Cond. on M 1 and θ ˆθ-density-based selection τ = τ min is blue, τ = 0.5 is red

31 Prob. PM Finds Best Model Cond. on M 1 and θ The figure confounds the probability that the modelers are honest and the probability that the predictive-density-based selection find the highest posterior probability model. Large value of θ: τ = τ min dominates τ = 0.5. Inverted U-shape. For r 0.5 policy maker finds highest prob model almost with certainty. Conjecture: small r suffers from imprecise estimate of θ; large r from short evaluation sample Y p. Small value of θ: Policy maker finds highest posterior probability model with at most probability 1/2. For τ = 0.5 and r < 0.5 there is a visible effect of predictive data mining, i.e. the use of M i0 instead of M i.

32 Risk Cond. on M 1 and θ ˆθ-density-based selection τ = τ min is blue, τ = 0.5 is red Data mining on full sample is green

33 Risk Cond. on M 1 and θ Results mirror the probability of PM finding the highest posterior probability model. For large values of θ the policy maker can with r = 0.5 and τ min = 0 obtain a risk differential that is essentially zero. The risk associated with full sample data mining is large for both small and large values of θ.

34 Integrated Probability that Modelers are Honest M1 is solid, M2 is dashed τ = τ min is blue, τ = 0.5 is red

35 Integrated Probability that Modelers are Honest Blue vs. red lines: if r 0.5, then τ min = 0. Thus, the modelers have no information that allows them to test the restriction of their model. In turn, they are honest with probability 1. Blue vs. red lines: if τ = 0.5, then even for small values of r the modelers find their restrictions rejected with some probability. Blue vs. red For large values of r the difference between τ = 0.5 and τ = τ min vanishes as τ min 0.5. Solid versus dashed lines: conditional on M 1, the probability that M 2 finds his model rejected is higher than that of M 1 and vice versa.

36 Integrated Probability that PM Finds Best Model and Risk ˆθ-density-based selection τ = τ min is blue, τ = 0.5 is red Data mining on full sample is green

37 Relationship to Existing Literature Stone (1976): Cross-validation emphasizes that model validation on pseudo-holdout samples can generate a measure of fit that penalizes model complexity. Leamer (1984): Effect of specification searches on inference in non-experimental setting. Data Snooping: Lo and MacKinlay (1990): correcting tests of asset pricing theories based on data-snooped portfolios. White (2000): correcting standard errors for tests of no predictive superiority for specification searches. Discussion: In our framework the researcher has no access to Y p before the policy maker weights the models. Cross validation does not rule out our kind of data mining. In the context of structural modeling it is not feasible to mimic the data mining / specification search on samples that could have been observed.

38 Extensions Model misspecification: include a third model, M, such that the policy maker entertains the possibility that neither M 1 or M 2 are correct. Specification search versus data mining: modelers could discover that restrictions hold conditional on additional regressors. Non-random hold-out samples.

39 Conclusion We develop a framework that allows us to characterize potential costs of data mining and potential benefits of holdout samples designed to discourage data mining. In our numerical illustration we find that a model weighting based on a predictive density for the subsidy effect estimate that the policy maker can generate on the full sample is preferable to a selection based on full sample marginal likelihoods that are contaminated by data mining. In our setup the best results are obtained if the holdout sample consists purely of observations from the control group.

40 Literature: Examples of Random Holdout Samples Back Wise (1985) - housing rent subsidy experiment Todd and Wolpin (2006) - student attendance subsidy experiment Duflo, Hanna and Ryan (2009) - teacher attendance subsidy experiment

41 Literature: Examples of Non-random Holdout Samples Back Lumsdaine, Stock, and Wise (1992) - effect of introducing a pension window on retirement: estimation sample - pre-window period holdout sample - post-window period Kaboski and Townsend (2007) - effect of Thai Million Baht Program, a transfer to 80,000 villages to start village banks, on village investment estimation sample - pre-program period holdout sample - post-program period Keane and Wolpin (2007) - effect of welfare on female schooling, labor supply, marriage, fertility and take-up estimation sample - individuals in five states: California, Michigan, New York, North Carolina, Ohio holdout sample - Texas (very low welfare state)

NBER WORKING PAPER SERIES TO HOLD OUT OR NOT TO HOLD OUT. Frank Schorfheide Kenneth I. Wolpin. Working Paper

NBER WORKING PAPER SERIES TO HOLD OUT OR NOT TO HOLD OUT Frank Schorfheide Kenneth I. Wolpin Working Paper 19565 http://www.nber.org/papers/w19565 NATIONAL BUREAU OF ECONOMIC RESEARCH 1050 Massachusetts