Using data to inform policy

Size: px

Start display at page:

Download "Using data to inform policy"

Josephine Fitzgerald
5 years ago
Views:

1 Using data to inform policy Maximilian Kasy Department of Economics, Harvard University Maximilian Kasy (Harvard) data and policy 1 / 41

2 Introduction The roles of econometrics Forecasting: What will be? (needs no notion of causality) Testing theories: Is statement A correct? (usually involves some notion of causality or structural invariance) Informing policy: What should we do? (requires notion of causality / counterfactuals, and normative evaluation of counterfactuals) Maximilian Kasy (Harvard) data and policy 2 / 41

3 Informing policy Introduction How to use data when choosing tax rates insurance rates, class size, class composition, dosage of a drug, vaccination rates,... Proposed framework data update expectations about social welfare optimal policy: maximize expected social welfare Yields tractable mapping: observed data optimal policy experimental designs which maximize ex-ante expected welfare Maximilian Kasy (Harvard) data and policy 3 / 41

4 Introduction The literatures this talk is based on 1 Statistical decision theory evaluates estimators, tests, experimental designs based on expected loss 2 Optimal policy theory evaluates policy choices based on social welfare 3 this paper policy choice as a statistical decision statistical loss social welfare. objectives: 1 anchoring econometrics in economic policy problems. 2 anchoring policy choices in a principled use of data. Maximilian Kasy (Harvard) data and policy 4 / 41

5 Roadmap Introduction 1 Brief review Statistical decision theory Optimal policy theory 2 General setup 3 Theoretical results Frequentist asymptotics Optimal experimental design 4 Application to the RAND health insurance experiment data Maximilian Kasy (Harvard) data and policy 5 / 41

6 Introduction Decision theory as foundation of econometrics Observed data X A statistical decision a A state of the world θ A loss function L(a, θ) (the negative of utility) A statistical model f (X θ) A decision function a = δ(x ) A risk function R(δ, θ) = E θ [L(δ(X ), θ)] Bayes risk R(δ, π) = R(δ, θ)dπ(θ) Most common applications: 1 Estimation with squared loss: L(a, θ) = (a µ(θ)) 2 2 Testing of H 0 : θ Θ 0 : 1 if a = 1, θ Θ 0 L(a, θ) = c if a = 0, θ / Θ 0 0 else. Maximilian Kasy (Harvard) data and policy 6 / 41

7 Optimal policy problems Optimal insurance and optimal taxation optimal (unemployment) insurance (Baily, 1978; Chetty, 2006) Y {0, 1}: state of an individual, Y = 1 bad state t = x: size of transfer to individuals in the bad state Y x = g(x, ɛ): counterfactual state of an individual m(x) = E[g(x, ɛ)]: share of the population in the bad state, given x λ: marginal value of income to individuals in the bad state, relative to the marginal value of government funds (normative choice!) similar: Optimal top tax rate (Saez, 2001) Y : taxable income t: top tax rate m: tax base Maximilian Kasy (Harvard) data and policy 7 / 41

8 Optimal policy problems Effect of a marginal change dt of t on social welfare: 1 λm dt: mechanical effect on private welfare 2 m dt: mechanical effect on government funds 3 tm (t) dt: behavioral effect on government revenues 4 0: behavioral effect on private welfare (envelope condition) Summing up gives u (t) = (λ 1) m(t) t m (t) = λm(t) (t m(t)). (1) t Integrating t u(t) = λ m(x)dx t m(t). (2) 0 Maximilian Kasy (Harvard) data and policy 8 / 41

9 Optimal policy problems Optimal choice of inputs to a production process Thus: x X R dx : inputs Y = g(x, ɛ): outcome e.g.: Y student long run outcomes, X class size, teacher training,... m(x) = E[g(x, ɛ)]: average production function λ: willingness to pay for Y (normative choice!) objective: max λ E[Y ], net of costs p t u(t) = λ m(t) p t. (3) Maximilian Kasy (Harvard) data and policy 9 / 41

10 General framework The setup 1 policy maker: expected social welfare maximizer 2 u(t): social welfare for policy choice t T R dt u is unknown 3 u = L m + u 0, L and u 0 are known L: linear operator C 1 (X ) C 1 (T ) 4 m(x) = E[g(x, ɛ)] (average structural function) X, Y = g(x, ɛ) observables, ɛ unobserved expectation over distribution of ɛ in target population 5 experimental setting: X ɛ m(x) = E[Y X = x] 6 Gaussian process prior: m GP(µ, C) 7 posterior expectations: best linear predictors in Y Maximilian Kasy (Harvard) data and policy 10 / 41

11 General framework Applied to the two optimal policy problems u = L m + u 0 1 Optimal insurance: u 0 (t) = 0 (Lm)(t) = λ 2 Optimal choice of inputs: t 0 m(x)dx t m(t). (4) u 0 (t) = p t (Lm)(t) = λ m(t). (5) Maximilian Kasy (Harvard) data and policy 11 / 41

12 General framework Questions 1 How to choose t optimally given observations X i, Y i, i = 1,..., n? 2 How to choose design points X i given sample size n? How to choose sample size n? mathematical characterization: 1 How does the optimal choice of t (in a Bayesian sense) behave asymptotically (in a frequentist sense)? (conditional vs. average behavior!) 2 How can we characterize the optimal design? Maximilian Kasy (Harvard) data and policy 12 / 41

13 Theoretical results What we need to derive: 1 posterior expectations 2 optimal policy t = argmax û(t) m(x) := E[m(x) X, Y ] û(t) = E[u(t) X, Y ] 3 optimal design X = argmax υ(x ) where υ(x ) = E[max û] What the paper shows: 1 Frequentist asymptotics asymptotic normality of t estimator of its variance 2 Optimal design effect of perturbations on υ FOC for optimal design asymptotic simplification of FOC Maximilian Kasy (Harvard) data and policy 13 / 41

14 Theoretical results Posterior expectation of m m GP(µ, C) - denote µ i = µ(x i ), C i,j = C(X i, X j ), and C i (x) = C(x, X i ). (6) µ, C, C(x) corresponding vectors / matrix posterior expectations are best linear predictors in Y m(x) :=E[m(x) X, Y ] =E[m(x) X ] + Cov(m(x), Y X ) Var(Y X ) 1 (Y E[Y X ]) =µ(x) + C(x) [C + σ 2 I ] 1 (Y µ). (7) Maximilian Kasy (Harvard) data and policy 14 / 41

15 Theoretical results Alternative representation of m 1 Penalized regression m = argmin ˇm(.) [ 1 σ 2 ] (Y i ˇm(X i )) 2 + ˇm µ 2 C i (8) m µ 2 C : reproducing kernel Hilbert space norm defined by C(x 1,.), C(x 2,.) := C(x 1, x 2 ) 2 Equivalent kernel: approximate discrete F n (x) by continuous F (x) m(x) w 0 (x) + 1 w(x, X i ) Y i n i [ ] 1 w(., x ) = argmin ˇw(x) 2 df(x) + σ2 ˇw(.) 2 2n ˇw 2 C ˇw(x ) (9) Maximilian Kasy (Harvard) data and policy 15 / 41

16 Theoretical results Posterior expectation of u Let D(t, x) : = Cov(u(t), m(x)) = Cov((Lm + u 0 )(t), m(x)) = L x C(x, x). (10) Then û(t) = E[u(t) X, Y ] = E[u(t) X ] + Cov(u(t), Y X ) Var(Y X ) 1 (Y E[Y X ]) = ν(t) + D(t) [C + σ 2 I ] 1 (Y µ). (11) Maximilian Kasy (Harvard) data and policy 16 / 41

17 Optimal policy Theoretical results Expected utility maximization t = t (X, Y ) argmax t First order condition: 0 =! t û(t) = t E[u(t) X, Y ] where û(t) (12) = ν (t) + B(t) [C + σ 2 I ] 1 (Y µ) (13) ( B(t, x) := Cov t = Cov ) u(t), m(x) ( t (Lm + u 0)(t), m(x) = t L x C(x, x). (14) ) Maximilian Kasy (Harvard) data and policy 17 / 41

18 Theoretical results Optimal experimental design maximizes ex-ante expected welfare υ(x ) through choice of design points X assuming that the policy t is chosen (ex-post) as t (X, Y ) where X argmax X υ(x ) (15) υ(x ) := E[max û(t) X ] t = E[û( t ) X ] [ = E max t (ν(t) + D(t) [C + σ 2 I ] 1 (Y µ) )] (16) Maximilian Kasy (Harvard) data and policy 18 / 41

19 Theoretical results Main mathematical results 1 Explicit expression for optimal policy t 2 Frequentist asymptotics of t 1 asymptotically normal, slower than n 2 distribution driven by distribution of û (t ) 3 confidence sets 3 Optimal experimental design design density f (x) increasing in (but less than proportionately) 1 density of t 2 the expected inverse of the curvature u (t) algorithm to choose design based on prior, objective function Maximilian Kasy (Harvard) data and policy 19 / 41

20 Asymptotics Frequentist asymptotics (under regularity conditions) 1 t is asymptotically normal 2 rate of convergence: slower than n 3 asymptotic variance: proportional to 1 asymptotic variance of û (t ) 2 squared inverse of u (t ) 4 confidence sets for t 5 regret u(t ) u( t ): asymptotically generalized χ 2 distribution Remark: By construction: t optimal w.r.t. average risk across m These results: performance of t conditional on m Maximilian Kasy (Harvard) data and policy 20 / 41

21 Outline of proof Asymptotics 1 t as an M-estimator, t = argmax t û(t) Asymptotic normality follows from 1 consistency of û in C 2 2 asymptotic normality of û (t ) 2 Asymptotic normality of û (t ) follows from asymptotic normality of m by linearity of mapping m û similarly for C 2 consistency Maximilian Kasy (Harvard) data and policy 21 / 41

22 Asymptotics 1 t as an M-estimator FOC for t : Thus 2 û and m public finance application: 0 = û ( t ) = û (t ) + û ( t ) ( t t ) production function application: t t = u (t ) 1 û (t ) + t û (t) = (λ 1) m(t) t m (t) (17) û (t) = λ m (t) p (18) in either case, the asymptotic distribution of û is dominated by the distribution of m. Maximilian Kasy (Harvard) data and policy 22 / 41

23 Asymptotics Consistent variance estimator Let ( V = l 2 û ( t ) 1 V n,m û ( t ) 1), where V n,m = i w t ( t, X i ) 2 (Y i m(x i )) 2. Then V / Var( t X, θ) p 1. Maximilian Kasy (Harvard) data and policy 23 / 41

24 Asymptotics Asymptotics of m 1 General conditions for consistency of m: Ghoshal and van der Vaart (2013) 2 m asymptotically equivalent to a nonparametric kernel regression bandwidth parameter that depends on the design density f (x) Silverman (1984); Sollich and Williams (2005) 3 asymptotic normality, rates of convergence for m, and of m : follow from standard results for kernel regression Maximilian Kasy (Harvard) data and policy 24 / 41

25 Optimal design Characterizing the optimal experimental design continuous approximation easier & more elegant to characterize choose a design density f (x), then observe function Y Y (x) = m(x) + σ nf (x) dw (x). dw is standard white noise process independent of m will characterize 1 posterior expectation û = E[u Y ] 2 expected social welfare υ(f ) := E[max t û(t)] 3 how û, υ are affected by a perturbation to the design density f : f + δ f 4 FOC for f which maximizes υ(f ) 5 Asymptotic simplification of this FOC Maximilian Kasy (Harvard) data and policy 25 / 41

26 Optimal design Remarks 1 white noise, Y (.): generalized stochastic processes. Y (x) is not a well defined random variable, but Y (x)g(x)dx is (for g continuous, bounded). 2 posterior expectation of m given Y : Wiener filtering 3 continuous equivalent of discrete setting larger design density increasing number of observations decreasing the variance of noise. 4 continuous model rationalizes the equivalent kernel Maximilian Kasy (Harvard) data and policy 26 / 41

27 Optimal design Operator notation (cf. matrix notation) Identify kernel C(x, x ) with integral operator g Cg, (Cg)(x) := C(x, x )g(x )dx. (19) Similarly for the noise process (Eg)(x) := where F is the operator (Fg)(x) = f (x) g(x). In this notation σ2 σ2 g(x) = nf (x) n F 1, (20) Var(Y ) = C + E Cov(m, Y ) = C Cov(u, Y ) = LC. Maximilian Kasy (Harvard) data and policy 27 / 41

28 Optimal design Posterior means posterior expectation is best linear predictor W is the equivalent kernel. conditions for BLP Thus linearity of expectations m = µ + W (Y µ) (21) Cov ( m m, Y ) =W Var(Y ) Cov(m, Y ) =W (C + E) C = 0 W = C(C + E) 1. (22) û = E[u Y ] = E[Lm + u 0 Y ] = L m + u 0 = ν + LC(C + E) 1 (Y µ). (23) Maximilian Kasy (Harvard) data and policy 28 / 41

29 Optimal design The prior distribution of the posterior mean û Var(Y ) = C + E û = ν + LC(C + E) 1 (Y µ) Therefore û GP(ν, V ), where V := Var(û) = LC(C + E) 1 (LC) t (24) and E = σ2 n F 1. Now consider perturbed design f + δ f where E(δ) = σ2 n (F + δ F ) 1. V (δ) = LC(C + E(δ)) 1 (LC) t (25) Maximilian Kasy (Harvard) data and policy 29 / 41

30 Optimal design Perturbed design continued derivative at δ = 0; denoting W := LW where special case: f unit point mass at x V δ is covariance kernel of where ɛ N(0, 1). V δ = W E δ W t, (26) E δ = σ2 n F 1 F F 1. (27) v(x) := σ nf (x ) w(x, x ) ɛ, (28) Maximilian Kasy (Harvard) data and policy 30 / 41

31 Optimal design Perturbations to the design and expected social welfare Theorem Consider expected social welfare, υ(f + δ f ) at the perturbed design distribution f + δ f, where f is a point mass at the point x. Then δ υ(f + δ f ) = 1 2 nf 2 (x ) E [ tr [ û ( t ) 1 (w t ( t, x )w t ( t, x ) t)]] σ 2 Maximilian Kasy (Harvard) data and policy 31 / 41

32 Optimal design Sketch of proof: 1 Perturbed design equivalent to considering the random function û(δ) := û + δ v, where v(x) = σ nf (x ) w(x, x ) ɛ, (29) and ɛ N(0, 1), ɛ û. 2 Taylor expansions for t and û(δ) yield û(δ, t (δ)) = û(t ) + δv( t ) + δ 1 2 v ( t ) t û ( t ) 1 v ( t ) + o p (δɛ 2 ). (30) 3 Now take expectations. Maximilian Kasy (Harvard) data and policy 32 / 41

33 Optimal design The optimal experimental design f = argmax f υ(f ) Corollary...is characterized by 1 f 2 (x) E [ tr [ û ( t ) 1 (w t ( t, x)w t ( t, x) t)]] = κ. (31) κ is a Lagrange multiplier on the constraint f = 1. Equation (31) holds for all x. Sketch of proof: f solves s.t. max υ(f ) f (x)dx = 1. (32) FOC: δ υ(f + δ f ) = κ f (x)dx = κ. Maximilian Kasy (Harvard) data and policy 33 / 41

34 Application: health insurance The RAND health insurance experiment (cf. Aron-Dine et al., 2013) Between 1974 and 1981 representative sample of 2000 households in six locations across the US families randomly assigned to plans with one of six consumer coinsurance rates 95, 50, 25, or 0 percent 2 more complicated plans (I drop those) Additionally: randomized Maximum Dollar Expenditure limits 5, 10, or 15 percent of family income, up to a maximum of $750 or $1,000 (I pool across those) Maximilian Kasy (Harvard) data and policy 34 / 41

35 Application: health insurance Table : Predicted spending for different coinsurance rates, OLS (1) (2) (3) (4) Share with Spending Share with Spending any in $ any in $ Free Care (0.006) (78.76) (0.006) (72.06) 25% Coinsurance (0.013) (130.5) (0.012) (115.2) 50% Coinsurance (0.018) (273.7) (0.016) (279.6) 95% Coinsurance (0.011) (95.40) (0.009) (88.48) family x month x site X X X X fixed effects covariates X X N Maximilian Kasy (Harvard) data and policy 35 / 41

36 Application: health insurance Assumptions 1 Model: The optimal insurance model as presented before 2 Prior: Gaussian process prior for m, squared exponential in distance, uninformative about level and slope 3 Relative value of funds for sick people vs contributors: λ = 1.5 (crucial parameter!!!) 4 Pooling data: across levels of maximum dollar expenditure Under these assumptions we find: Optimal copay equals 18% (But free care is almost as good) Maximilian Kasy (Harvard) data and policy 36 / 41

37 Application: health insurance Posterior: Spending for different coinsurance rates m m t Maximilian Kasy (Harvard) data and policy 37 / 41

38 Application: health insurance Posterior: Social welfare for different coinsurance rates 600 ûû t Maximilian Kasy (Harvard) data and policy 38 / 41

39 Application: health insurance First order condition for optimal policy; confidence band û t Maximilian Kasy (Harvard) data and policy 39 / 41

40 Summary Application: health insurance This paper combines statistical decision theory and optimal policy theory Model: 1 objective u = L m + u 0 2 m is the ASF m(x) = E[g(x, ɛ)] 3 prior m GP(µ, C) Questions: 1 How to choose t optimally given observations? 2 How to choose design points X i? Main results: 1 Explicit expression for optimal policy t 2 Frequentist asymptotics of t 3 Characterization of optimal experimental design Maximilian Kasy (Harvard) data and policy 40 / 41

41 Application: health insurance Thanks for your time! Maximilian Kasy (Harvard) data and policy 41 / 41

What is to be done? Two attempts using Gaussian process priors

What is to be done? Two attempts using Gaussian process priors Maximilian Kasy Department of Economics, Harvard University Oct 14 2017 1 / 33 What questions should econometricians work on? Incentives of