Confidence Sets Based on Shrinkage Estimators

Confidence Sets Based on Shrinkage Estimators Mikkel Plagborg-Møller Harvard University June 2017

Shrinkage estimators in applied work ˆβ shrink = argmin β { ˆQ(β) + λc(β) } Shrinkage/penalized estimators popular in economics: Random effects. High-dimensional prediction. Smoothing jagged functions. Shiller (1973); Hodrick & Prescott (1981); Breitung & Roling (2015); Barnichon & Brownlees (2017) Estimating fixed effects. Chetty et al. (2014); Chamberlain (2016) Shrinking toward theory. Hansen (2016); Fessler & Kasy (2017) Shrinkage parameter λ often data-dependent. 2

Challenges of shrinkage inference How to calculate SEs for shrinkage estimators? With data-dependent shrinkage parameter λ, asy. distribution often discontinuous in true parameters. Example For finite-dim parameters, impossible to estimate CDF of ˆβ shrink uniformly consistently. Leeb & Pötscher (2005) Standard bootstrap typically doesn t work. Beran (2010) Applied researchers often just undersmooth (i.e., SE for usual point estimator). Not always valid. 3

This project Class of generalized ridge regression estimators: Vinod (1978) ˆβ M,W (λ) = argmin β R n { β ˆβ 2 W + λ Mβ 2}. Shrinkage parameter λ selected by unbiased risk estimate. Gaussian location model: ˆβ N n (β, Σ), known Σ. Conditional QLR test for linear hypothesis on β. Exact size. Conditional QLR confidence set by test inversion. Simulations show favorable average length/area of CSs. Uniform asymptotic validity even when data is non-gaussian. 4

Relationship to literature Large stats lit uses analytically convenient transformations and priors. Casella & Hwang (1982, 1984, 1987, 2012); Tseng & Brown (1997) My starting point: How to calculate SEs for given ridge estimator? Arbitrary correlation structure, arbitrary shrinkage hypothesis. CSs tied to (and always contain) meaningful point estimator. Tests/CSs have Empirical Bayes (random effects) interpretation. But I do not start from decision-theoretic first principles. Impossible to uniformly dominate expected volume of Wald ellipsoid for 1-D or 2-D problems. Stein (1962); Brown (1966); Joshi (1969) 5

Other related literature Shrinkage: Stein (1956); James & Stein (1961); Bock (1975); Oman (1982); Casella & Hwang (1987) Unbiased risk estimate: Mallows (1973); Stein (1973, 1981); Berger (1985); Claeskens & Hjort (2003); Hansen (2010) Asymptotics for shrinkage: Leeb & Pötscher (2005); Hansen (2016) Uniformity: Andrews, Cheng & Guggenberger (2011); McCloskey (2015) Post-regularization inference: Chernozhukov, Hansen & Spindler (2015) Conditional inference: Andrews & Mikusheva (2016) Adaptive confidence sets: Pratt (1961); Brown, Casella & Hwang (1995); Wasserman (2006); Armstrong & Kolesár (2016) 6

Outline 1 Shrinkage estimators and unbiased risk estimate 2 Testing 3 Confidence sets 4 Simulation study 5 Uniform asymptotic validity 6 Applications 7 Summary

Gaussian location model For now, consider finite-sample Gaussian location model β R n unknown. Σ symmetric p.d. and known. ˆβ N n (β, Σ). Will later consider asymptotic framework for which the Gaussian model is the right limit experiment. Plug in consistent estimator ˆΣ. 7

General shrinkage estimator class { ˆβ M,W (λ) = argmin β ˆβ 2 W + λ Mβ 2} = Θ M,W (λ) ˆβ, β R n Θ M,W (λ) = (I n + λw 1 M M) 1. M R m n, W R n n symmetric p.d. Example: M = Penalizes jaggedness. 1 2 1 1 2 1......... 1 2 1 R(n 2) n. Whittaker (1923); Shiller (1972); Hodrick & Prescott (1981); Wahba (1990) 8

8 6 response, basis points 4 2 0-2 -4-6 -8-10 0 6 12 18 24 30 36 42 48 horizon, months y t : GZ excess bond premium. x t : high-freq. FFF shock. Controls: 2 lags of y t, x t, log(cpi), log(ip), 1yrTreas. Sample: 1991 2012.

Projection shrinkage Shrinkage particularly tractable when W = I n and M = P R n n is orthogonal projection matrix: P = P = P 2. Projection shrinkage towards linear subspace span(i n P). Stein (1956); Oman (1982a,b); Bock (1985); Casella & Hwang (1987) ˆβ P (λ) = argmin { β ˆβ 2 + λ Pβ 2} β R n = 1 1 + λ P ˆβ + (I n P) ˆβ. Example: I n P = proj. matrix from regression onto basis functions. 10

5 response, basis points 0-5 -10 0 6 12 18 24 30 36 42 48 horizon, months y t : GZ excess bond premium. x t : high-freq. FFF shock. Controls: 2 lags of y t, x t, log(cpi), log(ip), 1yrTreas. Sample: 1991 2012.

MSE risk criterion: Unbiased risk estimate R M,W (λ; β ) = E β Unbiased risk estimate (URE): ( ) ˆβ M,W (λ) β 2 W. Bias/var. Mallows (1973); Stein (1973, 1981); Berger (1985); Hansen (2010) ˆR M,W (λ) = ˆβ M,W (λ) ˆβ 2 W + 2 tr{w Θ M,W (λ)σ}. Define ˆλ M,W = argmin λ 0 ˆR M,W (λ). May equal. lim λ ˆβM,W (λ) well defined if M full rank or proj. 12

1 estimated MSE, normalized 0.8 0.6 0.4 0.2 ˆR P ( x 1 x ), x [0, 1) 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 λ/(1+λ) y t : GZ excess bond premium. x t : high-freq. FFF shock. Controls: 2 lags of y t, x t, log(cpi), log(ip), 1yrTreas. Sample: 1991 2012.

Optimal projection shrinkage For projection shrinkage, can minimize URE in closed form: ˆβ P (ˆλ P ) = ( 1 tr(σ ) P) P ˆβ 2 + P ˆβ + (I n P) ˆβ, James-Stein shrinkage towards linear subspace. Stein (1956); James & Stein (1961); Oman (1982a,b); Bock (1985) Proposition (Hansen, 2016): If tr(σ P ) > 4ρ(Σ P ), E β Σ P = PΣP. ( ˆβ P (ˆλ P ) β 2) ( < E β ˆβ β 2) for all β. Necessary cond n: rk(p) > 4. E.g., if I n P is projection onto p basis functions, then need n > p + 4. 14

Outline 1 Shrinkage estimators and unbiased risk estimate 2 Testing 3 Confidence sets 4 Simulation study 5 Uniform asymptotic validity 6 Applications 7 Summary

Hypothesis testing in shrinkage applications R R r n full row rank. No UMP test exists. H 0 : Rβ = b, H 1 : Rβ b. Wald test is UMP unbiased (r = 1), UMP invariant, and admissible. If we re already using shrinkage point estimator, might want hypothesis test tied to this estimator as well. Obtain CS by inversion. My proposed test is biased+noninvariant, so may achieve higher power than usual Wald test for some DGPs. 15

Empirical Bayes quasi-likelihood ratio test Base hypothesis test on (negative) quasi-log-likelihood ˆL M,W (β) = β ˆβ 2 W + ˆλ M,W Mβ 2. Empirical Bayes (random effects) interpretation: β data N ( ˆβ M,W (ˆλ M,W ), (W + ˆλ M,W M M) 1). QLR test statistic of Rβ = b: min β : Rβ=b ˆL M,W (β) min ˆL M,W (β) β = R ˆβ M,W (ˆλ M,W ) b 2 (R(W +ˆλ M,W M M) 1 R ) 1 16

Null distribution impractical LR M,W (b) = R ˆβ M,W (ˆλ M,W ) b 2 (R(W +ˆλ M,W M M) 1 R ) 1 Assume Var(RZ MZ) nonsingular, Z N n (0, I n ). Then LR well defined even when ˆλ M,W =. Holds in many cases. If Var(RZ MZ) singular, can use ad hoc LR statistic LR M,W (b) = R ˆβ M,W (ˆλ M,W ) b 2 (RW 1 R ) 1. Practical problem: Null distribution of LR statistic depends on Mβ. Solution: Condition on sufficient statistic for n r nuisance param s. 17

Sufficient statistic for nuisance parameters Define ζ = ΣR (RΣR ) 1 R n r and P = ζr R n n. Statistic ˆν = (I n P) ˆβ is S-ancillary wrt. Rβ : ˆβ ˆν F Rβ,Σ, ˆν F (In P)β,Σ. It would be uncontroversial to condition on ˆν in the absence of prior information linking Rβ and (I n P)β. In practice, the prior information Mβ 1 may not substantially constrain the relationship between Rβ and (I n P)β. Then conditioning wastes little information. Severini (1995) I condition on ˆν. Later: connection to Empirical Bayes HPD set. 18

Critical value by simulation Conditional QLR test rejects H 0 if LR M,W (b) > q 1 α,m,w (b, ˆν). Conditional critical value given ˆν = ν: q 1 α,m,w (b, ν) = quantile 1 α ( R β( λ; U) b 2 (R(W + λ(u)m M) 1 R ) 1 ), where U N r (b, RΣR ), β(λ; U) = Θ M,W (λ)(ζu + ν) for all λ 0, { } λ(u) = argmin β(λ; U) (ζu + ν) 2 W + 2 tr(w Θ M,W (λ)σ). λ 0 By design, conditional (and thus unconditional) size = 1 α. 19

Outline 1 Shrinkage estimators and unbiased risk estimate 2 Testing 3 Confidence sets 4 Simulation study 5 Uniform asymptotic validity 6 Applications 7 Summary

Confidence set by test inversion Invert CQLR test to obtain CS for b = Rβ : Ĉ M,W = { b R r : LR } M,W (b) q 1 α,m,w (b, ˆν). Do this by grid search. Simulate quantile at each point. Feasible in one or two dimensions (proj. shrinkage fast). Uniform band If M full rank or proj., can compute simple, finite upper bound on critical value. More Ĉ M,W contained in bounded ellipsoid centered at R ˆβ M,W (ˆλ M,W ). Limits grid search. 20

Properties of shrinkage confidence set 1 ĈM,W always contains shrinkage point estimate. 2 Generally not symmetric around point estimate. 3 Not always convex. 4 Converges a.s. to usual Wald ellipsoid as Mβ, M fixed. 5 Expected volume depends on β only through Mβ. Appears difficult to characterize expected volume. Even for projection shrinkage, conditional power of CQLR test depends on 6 parameters. 21

Empirical Bayes HPD set ˆL M,W (β) = β ˆβ 2 W + ˆλ M,W Mβ 2, β data N ( ˆβ M,W (ˆλ M,W ), (W + ˆλ M,W M M) 1). Empirical Bayes 1 α Highest Posterior Density set for Rβ : Ĉ EB = Doesn t control frequentist coverage. { b R r : LR } M,W (b) χ 2 r,1 α. Like shrinkage CS, but non-random critical value. 22

Minimum coverage discrepancy with EB HPD set Symmetric set difference: A B = (A B)\(A B). Proposition (following Andrews & Mikusheva, 2016) Let C be any similar confidence set for Rβ (like ĈM,W ): P β ( Rβ C ) = 1 α for all β R n. Then P β ( ) ( ) Rβ ĈM,W ĈEB P β Rβ C ĈEB for all β R n. Proof 23

Outline 1 Shrinkage estimators and unbiased risk estimate 2 Testing 3 Confidence sets 4 Simulation study 5 Uniform asymptotic validity 6 Applications 7 Summary

Illustration: bivariate shrinkage toward average Bivariate model, projection shrinkage toward average: Lindley (1962) ˆβ = ( ˆβ 1 ˆβ 2 ) e 1 ˆβ P (ˆλ P ) = ˆβ 1 + ˆβ 2 2 Parameter of interest: β 1. N 2 (( β 1 β 2 ), ( 1 ρ ρ 1 ( ) 2(1 ρ) + 1 ( ˆβ 1 ˆβ 2 ) 2 + )), ˆβ 1 ˆβ 2. 2 Both MSE of shrinkage estimator and expected length of shrinkage CI depend on DGP only through β 2 β 1 and ρ. 24

Illustration: bivariate shrinkage toward average 1.2 1.1 RMSE 1 0.9 0.8 0 1 2 3 4 5 6 7 8 3.8 avg. length of 90% CI 3.6 3.4 3.2 3 = 0.0 = 0.3 = 0.9 2.8 0 1 2 3 4 5 6 7 8 25

Simulation study of confidence intervals β i = ˆβ N n (β, Σ), 1 i 1 n 1 if K = 0, sin 2πK(i 1) n 1 if K > 0, Σ ij = σ i σ j κ i j, σ i = σ 0 ( 1 + (i 1) ϕ 1 n 1 Consider projection shrinkage toward quadratic polynomial. Lower bound on expected length relative to Wald CI: Pratt (1961) ). (1 α)φ 1 (1 α) + (2π) 1/2 e 1 2 (Φ 1 (1 α)) 2 Φ 1 (1 α/2) 0.808 for α = 0.1. 26

Simulation study of confidence intervals MSE ˆβ(ˆλ) Length Ĉ n K κ σ 0 ϕ Tot 1st Mid 1st Mid 10 0.5 0.5 0.25 1 0.63 0.95 0.56 0.97 0.85 25 0.5 0.5 0.25 1 0.34 0.69 0.29 0.88 0.86 50 0.5 0.5 0.25 1 0.19 0.46 0.16 0.83 0.88 25 0 0.5 0.25 1 0.34 0.68 0.29 0.87 0.86 25 1 0.5 0.25 1 0.93 1.29 0.77 1.10 0.88 25 2 0.5 0.25 1 0.96 0.93 0.86 0.98 0.90 25 0.5 0 0.25 1 0.16 0.35 0.13 0.83 0.84 25 0.5 0.9 0.25 1 0.81 1.11 0.76 1.05 0.91 25 0.5 0.5 0.5 1 0.34 0.66 0.28 0.88 0.86 25 0.5 0.5 0.25 3 0.35 1.19 0.30 0.96 0.85 MSE relative to ˆβ, average length relative to Wald. Conf. level = 90%. 1st = β 1, Mid = β 1+[n/2]. 27

Simulation study of 2-D confidence sets Same design, but now construct 2-D confidence set for (β 1, β 1+[n/2] ). Lower bound on expected area relative to Wald ellipse: Pratt (1961); Brown, Casella & Hwang (1995) 2 0 r Φ ( Φ 1 (1 α) r ) dr χ 2 1 α,2 0.565 for α = 0.1. 28

Simulation study of 2-D confidence sets Area n K κ σ 0 ϕ Ĉ Ĉ adhoc 10 0.5 0.5 0.25 1 0.91 0.88 25 0.5 0.5 0.25 1 0.86 0.76 50 0.5 0.5 0.25 1 0.81 0.70 25 0 0.5 0.25 1 0.84 0.76 25 1 0.5 0.25 1 1.01 1.02 25 2 0.5 0.25 1 0.90 0.94 25 0.5 0 0.25 1 0.69 0.70 25 0.5 0.9 0.25 1 1.32 1.05 25 0.5 0.5 0.5 1 0.85 0.76 25 0.5 0.5 0.25 3 1.20 0.86 Average area relative to Wald. Conf. level = 90%. 29

Takeaways from simulations Shrinkage CS works well when shrinkage point estimator works well. Shrinkage may be harmful when... 1 Mβ conveys little info about Rβ. 2 Mβ neither small nor large. 3 Correlations are high. 4 Variance of MLE of nuisance parameters large relative to variance of MLE of parameter of interest (e.g., small n). 30

Outline 1 Shrinkage estimators and unbiased risk estimate 2 Testing 3 Confidence sets 4 Simulation study 5 Uniform asymptotic validity 6 Applications 7 Summary

Uniform asymptotic size control CQLR test achieves uniform asymptotic size control, provided ˆβ is uniformly asy. normal, and ˆΣ is uniformly consistent for Σ. Uniform frequentist validity contrasts with other approaches. Undersmoothing: Pretend λ is small, ignore bias of shrinkage estimator as well as variability in λ. Switching rule: Use Wald SE if M ˆβ > c, otherwise use asymptotics under assumption Mβ = 0. Random effects: Treat random effects assumption as part of the DGP rather than just a prior. Size control wrt. random effects distribution. 31

Assumption: Preliminary estimator well-behaved Assumption Define S = {A S n + : c 1/ρ(A 1 ) ρ(a) c} for fixed c, c > 0. The distribution of the data F T for sample size T is indexed by three parameters β B R n, Σ S, and γ Γ. The estimators ( ˆβ, ˆΣ) R n S n + satisfy the following: For every sequence {β T, Σ T, γ T } T 1 B S Γ and every subsequence {k T } T 1 of {T } T 1, there exists a further subsequence { k T } T 1 such that k T ˆΣ 1/2 ( ˆβ β kt ) d N n(0, I n ), F k T (β k T,Σ k T,γ k ) T (ˆΣ Σ kt ) p 0, as T. F k T (β k T,Σ k T,γ k ) T S n + = set of symmetric positive definite n n matrices. 32

Shrinkage test is uniformly valid Let LR and ˆq 1 α denote CQLR test statistic and quantile obtained by plugging in T 1 ˆΣ in place of Σ. (Suppress M, W.) Proposition Let the previous assumption hold. Assume either rk(m) = m or M = P. Assume also Var(RZ MZ) is nonsingular, Z N n (0, I n ). Then ( lim inf inf Prob F T T (β,σ,γ) LR(Rβ) ˆq 1 α (Rβ, ˆν)) = 1 α. (β,σ,γ) B S Γ Caveat: I have only written down the full proof for proj. shrinkage. I believe I have the arguments worked out for the general case. Proof idea: Consider drifting parameters β T... 1 If T Mβ T, we converge to non-shrinkage case. 2 If T Mβ T is bounded, we re in the Gaussian model in the limit. 33

Outline 1 Shrinkage estimators and unbiased risk estimate 2 Testing 3 Confidence sets 4 Simulation study 5 Uniform asymptotic validity 6 Applications 7 Summary

Treatment effect heterogeneity NSW job training experiment. Lalonde (1986); Dehejia & Wahba (1999) Outcome: earnings (absolute $) 3 years after treatment assignment. 297 treated, 425 control. Bin subjects by age decile. 52 98 subjects per bin. ˆβ R 10 : ATE estimate by bin. Projection shrinkage toward average ˆβ. 34

Treatment effect heterogeneity: confidence intervals 6000 4000 2000 0-2000 -4000 17-18 19 20-21 22 23-24 25-26 27-28 29-32 33+ 6000 4000 2000 0-2000 -4000 17-18 19 20-21 22 23-24 25-26 27-28 29-32 33+ Conf. level = 90%. Vertical axis = ATE ($), horizontal axis = age (years). 35

Treatment effect heterogeneity: 2-D confidence set 5000 4000 3000 ages 33+ 2000 1000 0-1000 -2000-4000 -3000-2000 -1000 0 1000 2000 ages 17- Conf. level = 90%. Axes = ATE ($). Ad hoc QLR statistic. 36

MIDAS forecasting Predict monthly PCE inflation using daily commodity prices, 1991:2 2017:2. MIDAS specification (lag lengths chosen by AIC): 6 25 p PCE,t = µ + γ l p PCE,t l + β j z t,j + ε t. l=1 j=1 z t,j : j-th daily observation of log Bloomberg commodity price index (BCOM) on or after 1st day of month t. ˆβ R 25 : least-squares estimator. Projection shrinkage toward straight line. Breitung & Roling (2015) 37

MIDAS forecasting: confidence intervals 0.08 0.06 0.04 0.02 0-0.02-0.04-0.06 0 5 10 15 20 25 0.08 0.06 0.04 0.02 0-0.02-0.04-0.06 0 5 10 15 20 25 Conf. level = 90%. Vertical axis = inflation (log points), horizontal axis = lags (days). 38

MIDAS forecasting: 2-D confidence set 0.1 0.08 0.06 0.04 0.02 0-0.02-0.04-0.06-0.08-0.1 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 Conf. level = 90%. Axes = inflation (log points). 39

Outline 1 Shrinkage estimators and unbiased risk estimate 2 Testing 3 Confidence sets 4 Simulation study 5 Uniform asymptotic validity 6 Applications 7 Summary

Summary Considered setting where generalized ridge regression point estimator is of interest: smoothing, shrinking toward average, etc. Proposed conditional QLR test based on same quasi-log-likelihood as shrinkage point estimator. Exact conditional size in Gaussian location model. Asymptotic uniform size control more generally. Shrinkage confidence set by test inversion. Contains shrinkage point estimate. Minimum coverage discrepancy w. EB HPD set among similar CSs. Computationally feasible in 1 2 dimensions. Proj. shrinkage fast. Promising simulation evidence. 40

Thank you

Non-standard asymptotics: example ˆβ N n (β, T 1 I n ) James-Stein estimator of β R n : ˆβ JS = ( 1 n 2 ) T ˆβ 2 ˆβ. If β 0: T ( ˆβ JS β ) d N n (0, I n ). If β = 0: ( T ( ˆβ JS β ) d 1 n 2 Z 2 ) Z, Z N n (0, I n ). Back 42

W = I n for simplicity. URE captures bias/variance tradeoff Risk decomposition: Claeskens & Hjort (2003) R M,In (λ) = tr { [I n Θ M,In (λ)] 2 β β } + tr { Θ M,In (λ) 2 Σ }. }{{}}{{} bias squared variance Unbiased estimate: β β = E( ˆβ ˆβ ) Σ. Plug in: R M,In (λ) = tr { [I n Θ M,In (λ)] 2 ( ˆβ ˆβ Σ) } + tr { Θ M,In (λ) 2 Σ } = ˆR M,In (λ) tr(σ). Back 43

Triangle inequality: Bound on critical value LR M,W (Rβ) R( ˆβ M,W (ˆλ M,W ) ˆβ) V (ˆλ) 1 + R( ˆβ β) V (ˆλ) 1. Let Z N n (0, W 1 ). For any β R n and A R n n symm. p.d., ( R(β ˆβ) 2 β ˆβ 2 V (ˆλ) 1 A ρ RA 1 R Var(RZ MZ) 1). Since ˆR M,W (ˆλ M,W ) ˆR M,W (0), { ˆβ M,W (ˆλ M,W ) ˆβ 2 W 2 tr MΣM (MW 1 M ) 1}. Under the null H 0 : Rβ = Rβ, R( ˆβ β) 2 (RΣR ) 1 χ 2 (r). Back 44

Uniform confidence band Supremum test statistic of H 0 : β i = β i, i = 1,..., n: ŜLR M,W (β) = sup i=1,...,n ˆβ i,m,w (ˆλ M,W ) β i e i (W + ˆλ M,W M M) 1. e i Simulate null critical value q 1 α,m,w (β) for any β. Simultaneous confidence band: rectangular envelope of inverted test. n C M,W = inf β i, sup β i. i=1 β : ŜLR(β) q 1 α (β) β : ŜLR(β) q 1 α (β) Computationally challenging. Can sample from band. Inoue & Kilian (2016) Back 45

Coverage discrepancy: proof sketch Proof is a confidence set reinterpretation of Andrews & Mikusheva (2016) result on conditional testing. =1 α ( ) { [ }}{ P β Rβ C ĈEB = E β 1(Rβ C) ] [ ] +E β 1(Rβ ĈEB) [ 2E β 1(Rβ C)1(Rβ ] ĈEB) ( ) ( ) P β Rβ C ĈEB P β Rβ ĈM,W ĈEB [{ = 2E β 1(Rβ ĈM,W ) 1(Rβ C) } ] 1(Rβ ĈEB) [{ = 2E β 1(Rβ ĈM,W ) 1(Rβ C) } )] 1( LR M,W (Rβ ) χ 2 r,1 α 46

Similarity of C and completeness of the Gaussian family imply conditional similarity (like ĈM,W ): ( P β Rβ C ) ˆν = 1 α. By law of iterated expectations, [{ } ( )] 1(Rβ ĈM,W ) 1(Rβ C) 1 q 1 α,m,w (Rβ, ˆν) χ 2 r,1 α = 0. E β ( ) ( ) P β Rβ C ĈEB P β Rβ ĈM,W ĈEB [ { = 2E β 1(Rβ ĈM,W ) 1(Rβ C) } { ) ( )} ] 1( LR M,W (Rβ ) χ 2 r,1 α 1 q 1 α,m,w (Rβ, ˆν) χ 2 r,1 α Variable inside the expectation is a.s. nonnegative by def n of ĈM,W. 47