What is to be done? Two attempts using Gaussian process priors

Size: px

Start display at page:

Download "What is to be done? Two attempts using Gaussian process priors"

Jocelyn Manning
6 years ago
Views:

1 What is to be done? Two attempts using Gaussian process priors Maximilian Kasy Department of Economics, Harvard University Oct / 33

2 What questions should econometricians work on? Incentives of the publication process: Appeal to referees from the same subfield. Danger of self-referentiality, untethering from external relevance. Versus broader usefulness: Tools useful for empirical researchers, policy makers. Anchored in substantive applications, broader methodological considerations. One way to get there: Well defined decision problems. 2 / 33

3 Decision problems Objects to carefully choose: Objective function. Space of possible decisions / policy alternatives. Identifying assumptions. Prior information. Features the priors should be uninformative about. Once these are specified, coherent and well-behaved solutions can be derived. Useful tool for tractable solutions without functional form restrictions: Gaussian process priors. 3 / 33

4 Outline of this talk Brief introduction to Gaussian process regression Application 1: Optimal treatment assignment in experiments. Setting: Treatment assignment given baseline covariates General decision theory result: Non-random rules dominate random rules Prior for expectation of potential outcomes given covariates Expression for MSE of estimator for ATE to minimize by treatment assignment Application 2: Optimal insurance and taxation. Economic setting: Co-insurance rate for health insurance Statistical setting: prior for behavioral average response function Expression for posterior expected social welfare to maximize by choice of co-insurance rate 4 / 33

5 References Williams, C. and Rasmussen, C. (2006). Gaussian processes for machine learning. MIT Press, chapter 2. Kasy, M. (2016). Why experimenters might not always want to randomize, and what they could do instead. Political Analysis, 24(3): Kasy, M. (2017). Optimal taxation and insurance using machine learning. Working Paper, Harvard University. 5 / 33

6 Gaussian process regression Brief introduction to Gaussian process regression Suppose we observe n i.i.d. draws of (Y i,x i ), where Y i is real valued and X i is a k vector. Y i = f(x i ) + ε i ε i X,f( ) N(0,σ 2 ) Prior: f is distributed according to a Gaussian process, f X GP(0,C), where C is a covariance kernel, Cov(f(x),f(x ) X) = C(x,x ). We will leave conditioning on X implicit. 6 / 33

7 Gaussian process regression Posterior mean The joint distribution of (f(x),y ) is given by where ( ) ( ( )) f(x) C(x,x) c(x) N 0, Y c(x) C + σ 2, I n c(x) is the n vector with entries C(x,X i ), and C is the n n matrix with entries C i,j = C(X i,x j ). Therefore Read: f( ) = E[f( ) Y ] E[f(x) Y ] = c(x) (C + σ 2 I n ) 1 Y. is a linear combination of the functions C(,X i ) with weights ( C + σ 2 I n ) 1 Y. 7 / 33

8 Gaussian process regression Both applications use Gaussian process priors 1. Optimal experimental design How to assign treatment to minimize mean squared error for treatment effect estimators? Gaussian process prior for the conditional expectation of potential outcomes given covariates. 2. Optimal insurance and taxation How to choose a co-insurance rate or tax rate to maximize social welfare, given (quasi-)experimental data? Gaussian process prior for the behavioral response function mapping the co-insurance rate into the tax base. 8 / 33

9 Experimental design Application 1 Why experimenters might not always want to randomize Setup 1. Sampling: random sample of n units baseline survey vector of covariates X i 2. Treatment assignment: binary treatment assigned by D i = d i (X,U) X matrix of covariates; U randomization device 3. Realization of outcomes: Y i = D i Yi 1 + (1 D i )Yi 0 4. Estimation: estimator β of the (conditional) average treatment effect, β = 1 n i E[Yi 1 Yi 0 X i,θ] 9 / 33

10 Experimental design Questions How should we assign treatment? In particular, if X i has continuous or many discrete components? How should we estimate β? What is the role of prior information? 10 / 33

11 Experimental design Some intuition Compare apples with apples balance covariate distribution. Not just balance of means! We don t add random noise to estimators why add random noise to experimental designs? Identification requires controlled trials (CTs), but not randomized controlled trials (RCTs). 11 / 33

12 Experimental design General decision problem allowing for randomization General decision problem: State of the world θ, observed data X, randomization device U X, decision procedure δ(x,u), loss L(δ(X,U),θ). Conditional expected loss of decision procedure δ(x, U): R(δ,θ U = u) = E[L(δ(X,u),θ) θ] Bayes risk: R B (δ,π) = Minimax risk: R mm (δ) = R(δ,θ U = u)dπ(θ)dp(u) maxr(δ,θ U = u)dp(u) θ 12 / 33

13 Experimental design Theorem (Optimality of deterministic decisions) Consider a general decision problem. Let R equal R B or R mm. Then: 1. The optimal risk R (δ ), when considering only deterministic procedures δ(x), is no larger than the optimal risk when allowing for randomized procedures δ(x,u). 2. If the optimal deterministic procedure δ is unique, then it has strictly lower risk than any non-trivial randomized procedure. 13 / 33

14 Experimental design Proof Any probability distribution P(u) satisfies u P(u) = 1, P(u) 0 for all u. Thus u R u P(u) min u R u for any set of values R u. Let δ u (x) = δ(x,u). Then Similarly R B (δ,π) = u min u R mm (δ) = u min u R(δ u,θ)dπ(θ)p(u) R(δ u,θ)dπ(θ) = minr B (δ u,π). u maxr(δ u,θ)p(u) θ maxr(δ u,θ) = minr mm (δ u ). θ u 14 / 33

15 Experimental design Bayesian setup Back to experimental design setting. Conditional distribution of potential outcomes: for d = 0, 1 Gaussian process prior: Y d i X i = x N(f(x,d),σ 2 ). f GP(µ,C), E[f(x,d)] = µ(x,d) Cov(f(x 1,d 1 ),f(x 2,d 2 )) = C((x 1,d 1 ),(x 2,d 2 )) Conditional average treatment effect (CATE): β = 1 n i E[Y 1 i Y 0 i X i,θ] = 1 n i f(x i,1) f(x i,0). 15 / 33

16 Experimental design Notation Covariance matrix C, where C i,j = C((X i,d i ),(X j,d j )) Mean vector µ, components µ i = µ(x i,d i ) Covariance of observations with CATE, C i = Cov(Y i,β X,D) = 1 n j (C((X i,d i ),(X j,1)) C((X i,d i ),(X j,0))). 16 / 33

17 Experimental design Posterior expectation and risk The posterior expectation β of β equals β = µ β + C (C + σ 2 I) 1 (Y µ). The corresponding risk equals R B (d, β X) = Var(β X,Y ) = Var(β X) Var(E[β X,Y ] X) = Var(β X) C (C + σ 2 I) 1 C. 17 / 33

18 Experimental design Discrete optimization The optimal design solves Possible optimization algorithms: 1. Search over random d 2. greedy algorithm 3. simulated annealing max d C (C + σ 2 I) 1 C. 18 / 33

19 Experimental design Special case linear separable model Suppose f(x,d) = x γ + d β, γ N(0,Σ), and we estimate β using comparison of means. Bias of β equals (X 1 X 0 ) γ, prior expected squared bias (X 1 X 0 ) Σ (X 1 X 0 ). Mean squared error [ 1 MSE(d 1,...,d n ) = σ ] + (X 1 X 0 ) Σ (X 1 X 0 ). n 1 n 0 Risk is minimized by 1. choosing treatment and control arms of equal size, 2. and optimizing balance as measured by the difference in covariate means (X 1 X 0 ). 19 / 33

20 Optimal insurance Application 2 Optimal insurance and taxation using machine learning Economic setting Population of insured individuals i. Y i : health care expenditures of individual i. T i : share of health care expenditures covered by the insurance 1 T i : coinsurance rate; Y i (1 T i ): out-of-pocket expenditures Behavioral response to share covered: structural function Y i = g(t i,ε i ). Per capita expenditures under policy t: average structural function m(t) = E[g(t,ε i )]. 20 / 33

21 Optimal insurance Policy objective Insurance provider s expenditures per person: t m(t). Mechanical effect of increase in t (accounting): m(t)dt. Behavioral effect of increase in t (key empirical challenge): t m (t)dt. Utility of the insured: Mechanical effect of increase in t (accounting): m(t)dt. Behavioral effect: None, by envelope theorem. effect on utility = equivalent variation = mechanical effect Assign relative value λ > 1 to a marginal dollar for the sick vs. the insurer. 21 / 33

22 Optimal insurance Marginal effect of a change in t on social welfare: u (t) = (λ 1) m(t) t m (t) = λ m(t) (t m(t)). (1) t Integrating and imposing the normalization u(0) = 0: u(t) = λ t 0 m(x)dx t m(t). (2) Special case λ = 1: Harberger triangle (not the relevant case) 22 / 33

23 Optimal insurance Observed data and prior n i.i.d. draws of (Y i,t i ) T i was randomly assigned in an experiment, so that T i ε i, and E[Y i T i = t] = E[g(t,ε i ) T i = t] = E[g(t,ε i )] = m(t). Y i is normally distributed given T i, Y i T i = t N(m(t),σ 2 ). Gaussian process prior for m( ), m( ) GP(µ( ),C(, )). 23 / 33

24 Optimal insurance Prior moments Linear functions of normal vectors are normal. Linear operators of Gaussian processes are Gaussian processes. Prior moments: ν(t) = E[u(t)] = λ D(t,t ) = Cov(u(t),m(t ))) = λ t Var(u(t)) = λ 2 2λ t 0 t µ(x)dx t µ(t), C(x,t )dx t C(t,t ), 0 t t 0 0 t 0 C(x,x )dx dx C(x,t)dx + t 2 C(t,t). 24 / 33

25 Optimal insurance Posterior expectation of u( ) Covariance with data: D(t) = Cov(u(t),Y T ) = Cov(u(t),(m(T 1 ),...,m(t n )) T ) = (D(t,T 1 ),...,D(t,T n )). Posterior expectation of u(t): û(t) = E[u(t) Y,T ] = E[u(t) T ] + Cov(u(t),Y T ) Var(Y T ) 1 (Y E[Y T ]) = ν(t) + D(t) [C + σ ] 2 1 I (Y µ). 25 / 33

26 Optimal insurance Optimal policy choice Bayesian policy maker aims to maximize expected social welfare (note: different from expectation of maximizer of social welfare!) Thus First order condition t = t (Y,T ) argmax t t û( t ) = E[u ( t ) Y,T ] û(t). = ν ( t ) + B( t ) [C + σ 2 I ] 1 (Y µ) = 0, where B(t) = (B(t,T 1 ),...,B(t,T n )) and ) B(t,t ) = Cov( t u(t),m(t ) = t D(t,t ) = (λ 1) C(t,t ) t t C(t,t ). 26 / 33

27 Optimal insurance The RAND health insurance experiment (cf. Aron-Dine et al., 2013) Between 1974 and 1981 representative sample of 2000 households in six locations across the US families randomly assigned to plans with one of six consumer coinsurance rates 95, 50, 25, or 0 percent 2 more complicated plans (we drop those) Additionally: randomized Maximum Dollar Expenditure limits 5, 10, or 15 percent of family income, up to a maximum of $750 or $1,000 (we pool across those) 27 / 33

28 Optimal insurance Table: Expected spending for different coinsurance rates (1) (2) (3) (4) Share with Spending Share with Spending any in $ any in $ Free Care (0.006) (78.76) (0.006) (72.06) 25% Coinsurance (0.013) (130.5) (0.012) (115.2) 50% Coinsurance (0.018) (273.7) (0.016) (279.6) 95% Coinsurance (0.011) (95.40) (0.009) (88.48) family x month x site X X X X fixed effects covariates X X N / 33

29 Optimal insurance Assumptions 1. Model: The optimal insurance model as presented before 2. Prior: Gaussian process prior for m, squared exponential in distance, uninformative about level and slope 3. Relative value of funds for sick people vs contributors: λ = Pooling data: across levels of maximum dollar expenditure Under these assumptions we find: Optimal copay equals 18% (But free care is almost as good) 29 / 33

30 Optimal insurance / 33

31 Optimal insurance / 33

32 Optimal insurance Conclusion Explicit decision problems are useful to focus econometric research. Carefully choose: Objective function. Space of possible decisions / policy alternatives. Identifying assumptions. Prior information. Features the priors should be uninformative about. Gaussian process priors allow for tractable solutions. Two examples: 1. Optimal experimental design. 2. Optimal insurance and taxation. 32 / 33

33 Optimal insurance Thank you! 33 / 33

Using data to inform policy

Using data to inform policy Maximilian Kasy Department of Economics, Harvard University Maximilian Kasy (Harvard) data and policy 1 / 41 Introduction The roles of econometrics Forecasting: What will be?