arxiv: v2 [math.st] 31 Oct 2017

Size: px

Start display at page:

Download "arxiv: v2 [math.st] 31 Oct 2017"

Justina April Stephens
5 years ago
Views:

1 arxiv: v2 [math.st] 31 Oct 2017 Unbiased Shrinkage Estimation Jann Spiess Working Paper This version: October 31, 2017 Abstract Shrinkage estimation usually reduces variance at the cost of bias. But when we care only about some parameters of a model, I show that we can reduce variance without incurring bias if we have additional information about the distribution of covariates. In a linear regression model with homoscedastic Normal noise, I consider shrinkage estimation of the nuisance parameters associated with control variables. For at least three control variables and exogenous treatment, I establish that the standard least-squares estimator is dominated with respect to squared-error loss in the treatment effect even among unbiased estimators and even when the target parameter is low-dimensional. I construct the dominating estimator by a variant of James Stein shrinkage in a high-dimensional Normal-means problem. It can be interpreted as an invariant generalized Bayes estimator with an uninformative (improper Jeffreys prior in the target parameter. Introduction Many inference tasks have the following feature: the researcher wants to obtain a high-quality estimate of a set of target parameters (for example, a Jann Spiess, Department of Economics, Harvard University, jspiess@fas.harvard.edu. I thank Gary Chamberlain, Maximilian Kasy, Carl Morris, and Jim Stock for insightful conversations, and seminar participants at Harvard for helpful comments. 1

2 set of treatment effects in an RCT, but also estimates a number of nuisance parameters she does not care about separately (for example, coefficients on control variables. In these cases, can we reduce variance in the estimation of a target parameter without inducing bias by shrinking in the estimation of possibly high-dimensional nuisance parameters? In a linear regression model with homoscedastic, Normal noise, I show that a natural application of James Stein shrinkage to the parameters associated with at least three control variables reduces loss in the possibly low-dimensional treatment effect parameter without producing bias provided that treatment is random. The proposed estimator effectively averages between regression models with and without control variables, similar to the Hansen (2016 modelaveraging estimator and coinciding up to a degrees-of-freedom correction with the corresponding Mallows estimator from Hansen (2007. For the specific choice of shrinkage, I contribute three finite-sample properties: First, I note that by averaging over the distribution of controls we obtain dominance of the shrinkage estimator even for low-dimensional target parameters, unlike other available results that require a loss function that is at least three-dimensional. Second, I establish that the resulting estimator remains unbiased under exogeneity of treatment. Third, I conceptualize it as a two-step estimator with a first-stage prediction component. Fourth, I show that it can be seen as a natural, invariant generalized Bayes estimator with respect to a partially improper prior corresponding to uninformativeness in the target parameter. The linear regression model is set up in Section 1. Section 2 proposes the estimator and establishes loss improvement relative to a benchmark OLS estimator provided treatment is exogenous. Section 3 motivates the estimator as an invariant generalized Bayes estimator (with respect to an improper prior in a suitably transformed many-means problem. 2

3 1 Linear Regression Setup I consider estimation of the structural parameter β R k in the canonical linear regression model Y i = α+x i β +W i γ +U i (1 from n iid observations (Y i,x i,w i, where X i R m are the regressors of interest, W i R k control variables, and U i R is homoscedastic, Normal noise. α is an intercept, 1 and γ is a nuisance parameter. To obtain identification of β in Equation (1, I assume that U i is orthogonal to X i and W i (no omitted variables. Throughout this document, I write upper-case letters for random variables (such as Y i and lower-case letters for fixed values (such as when I condition on X i = x i. When I suppress indices, I refer to the associated vector or matrix of observations, e.g. Y R n is the vector of outcome variables Y i and X R n m is the matrix with rows X i. 2 Two-Step Partial Shrinkage Estimator By assumption there are control variables W available with Y X=x,W=w N(1α+xβ +wγ,σ 2 I n where σ 2 need not be known. We care about the (possibly high-dimensional nuisance parameter γ only in so far as it helps us to estimate the (typically low-dimensional target parameter β, which is our object of interest. 2.1 A canonical form that preserves structure Given x R n m and w R n k, where we assume that (1,x,w has full rank 1 + m + k n, let q = (q 1,q x,q w,q r R n n orthonormal where 1 We could alternatively include a constant regressor in X i and subsume α in β. I choose to treat α separately since I will focus on the loss in estimating β, ignoring the performance in recovering the intercept α. 3

4 q 1 R n,q x R n m,q w R n k such that 1 is in the linear subspace of R n spanned by q 1 R n (that is, q 1 {1/ n, 1/ n}, the columns of (1,x are in the space spanned by the columns of (q 1,q x, and the columns of (1,x,w are in the space spanned by the columns of (q 1,q x,q w. (Such a basis exists, for example, by an iterated singular value decomposition. Then, q 1 1α+q 1 xβ +q 1 wγ Y = q Y X=x,W=w N q x xβ +q x wγ q w wγ,σ2 I n. 0 n 1 m k Writing Y x, Y w,y r for the appropriate subvectors of Y, we find, in particular, that Yx µ x +aµ w Yw X=x,W=w N µ w,σ 2 I n 1 Yr 0 n 1 m k where µ x = q xxβ R m, µ w = q wwγ R k, and a = q xw(q ww 1 R m k. 2 In transforming linear regression to this Normal-means problem, as well as in partitioning the coefficient vector into two groups, for only one of which I will propose shrinkage, I follow Sclove ( Two-step estimator Conditional on X=x,W=w and given an estimator ˆµ w = ˆµ w (Yw,Y r of µ w, a natural estimator of µ x is ˆµ x = ˆµ x (Yx,Y w,y r = Yx aˆµ w. An estimator of β is obtained by setting ˆβ = (q xx 1ˆµ x. (The linear least-squares estimator for β is obtained from ˆµ w = Y w. A natural loss function for ˆβ that represents prediction loss units is the weighted loss(ˆβ β (x q x q x x(ˆβ β = ˆµ x µ x 2. We can therefore focus on the (conditional expected squared- 2 Alternatively, we could have denoted by µ x the mean of Y x. However, by separating out µ x from aµ w I feel that the role of µ w as a relevant nuisance parameter becomes more transparent. 4

5 error loss in estimating µ x, for which we find E[ ˆµ x µ x 2 X=x,W=w] = mσ 2 + E[ ˆµ w µ w 2 a a X=x,W=w] with the seminorm v a a = v a av on R k. For high-dimensional µ w (k 3, a natural estimator ˆµ w with low expected squared-error loss is a shrinkage estimator of the form ˆµ w = CYw with scalar C, such as the James and Stein (1961 estimator for which C = 1 (k 2 Y r 2 (or its positive part. While improving with respect to (n m k+1 Yw 2 expected squared-error loss (a a = const. I k, this specific estimator may yield higher (conditional expected loss in µ x when the implied loss function for µ w deviates from squared-error loss (a a const. I k, so the loss function is not invariant under rotations. We will show below that it is still appropriate in the case of independence of treatment and control. 2.3 From conditional to unconditional loss For conditional inference it is known that the least-squares estimator is admissible for estimating β provided m 2 and inadmissible provided m 3 no matter what the dimensionality k of the nuisance parameter γ is (James and Stein, 1961, as the rank of the loss function is decisive. The above construction does not provide a counter-example to this result: the rank of a a = (w q w 1 w q x q xw(q ww 1 is at most m, so for m 2, ˆµ w = Y w remains admissible for the loss function on the right. While we could achieve improvements for m 3 through shrinkage in ˆµ w and/or directly in ˆµ x our interest is in the case where m is low and k is high. Conditional on X=x,W=w we can thus not hope to achieve improvements that hold for any (β,γ, but we can still hope that shrinkage estimation of µ w yields better estimates of β on average over draws of the data. To this end, assume that vec(w X=x N(vec(1α W +xβ W,Σ W I n (that is, W i X=x iid N(1α W +x i β W,Σ W. Here, Σ W R k k is symmetric 5

6 positive-definite (but not necessarily known. α W R 1 k,β W R m k describe the conditional expectation of control variables given the regressors X=x. The case where x and W are orthogonal (β W = O m k and controls W thus not required for identification will play a special role below. Given X=x, assume (q 1,q x is deterministic, and fix q such that q = (q 1,q x,q R n n is orthonormal. Note that vec((q x,q W X=x N = ( vec (( q x xβ W O n 1 m k,σ W I n 1 In particular, q x W q W. It follows with (( (q x,q q Y X=x,W=w N xxβ +q xwγ q wγ,σ 2 I n 1 that indeed q x(y,w = q (Y,W. Conditional on W=w in the above derivation, aˆµ w aµ w = q xwˆγ q xwγ for ˆγ = (q ww 1ˆµ w a function of q w and (Y w,y r = (q wq,q rq (q Y, so ˆγ = ˆγ(q Y,q w. Assuming measurability, ˆγ(q Y,q W (q xy,q xw. Now writing ˆγ = ˆγ(q y,q w this implies that =. E[ ˆµ w µ w 2 a a X=x,q (Y,W = q (y,w] = ˆγ γ 2 E[W q xq x W X=x] with E[W q x q x W X=x] = β W x q x q x xβ W + mσ W of full rank k. For the expectation of the implied ˆβ, we find E[ˆβ X=x,q (Y,W = q (y,w] = β β W(ˆγ γ. We obtain the following characterization of conditional bias and squarederror loss of the implied estimator ˆβ: Lemma 1 (Properties of the two-step estimator. Let (Ỹ, W be jointly 6

7 distributed according as vec( W N(0 k(n 1 m,σ W I n 1 m, Ỹ W = w N( wγ,σ 2 I n 1 m, and write Ẽ for the corresponding expectation operator. For any measurable estimator ˆγ : R n m 1 R n m 1 k R k with Ẽ[ ˆγ(Ỹ, W 2 ] <, the estimator ˆβ(y,w = (q x x 1 q x y (q x x 1 q x wˆγ(q y,q w defined for convenience for fixed x, has conditional bias E[ˆβ(Y,W X=x] β = β W (Ẽ[ˆγ(Ỹ, W] γ and expected (prediction-norm loss E[ ˆβ(Y,W β 2 x q xq xx X=x] = mσ2 + Ẽ[ ˆγ(Ỹ, W γ 2 φ ] for φ = β W x q x q xxβ W +mσ W. Note that this lemma does not rely on n 1 + m + k, and indeed generalizes to the case n > 1+m for any k 1, including k > n. 2.4 Exogenous treatment We consider the special case where treatment is exogenous, and thus β W = O m k. This assumption could be justified, for example, in a randomized trial. Note that in this case in addition to the linear least-squares estimator in the long regression that includes controls W another natural unbiased (conditional on X=x estimator is available, namely the coefficient (q x x 1 q x Y in the short regression without controls. The long and short regression represent special (edge cases in the class of two-step estimators introduced above, which are all unbiased in that sense under the exogeneity assumption: Corollary 1 (A class of unbiased two-step estimators. If β W = O m k then 7

8 for any ˆγ and ˆβ as in Lemma 1 E[ˆβ(Y,W X=x] = β. Furthermore, E[ ˆβ(Y,W β 2 x q xq x x X=x] = mẽ[(ỹ0 W 0ˆγ((Ỹi, W i n 1 m i=1 2 ] for (Ỹi, W i n 1 m i=0 iid with W i N(0 k,σ W, Ỹ i W i = w i N( w i γ,σ2 (here, (Ỹi, W i n 1 m i=1 is the training sample and (Ỹ0, W 0 an additional test point drawn from the same distribution. This corollary clarifies that the class of natural estimators derived above are unbiased conditional on X=x (but not necessarily on X=x, W =w jointly, with expected loss equal to the expected out-of-sample prediction loss in a prediction problem where the prediction function w 0 w 0ˆγ is trained on n 1 m iid draws, and evaluated on an additional, independent draw (Ỹ0, W 0 from the same distribution. The long and short regressions are included as the special cases ˆγ( w,ỹ = ( w w 1 w ỹ and ˆγ 0 k, respectively. The covariates in training and test sample follow the same distribution, which suggests an estimator that is invariant to rotations in the corresponding k-means problem. Indeed, the dominating estimator I construct in the following results is of the form ˆµ w = (1 p Y r 2 Y w 2 Y w, where the standard James and Stein (1961 estimator (for unnknown σ 2 is recovered at p = k 2 n m k+1. Theorem 1 (Inadmissibility of OLS among unbiased estimators. Maintain β W = O m k. Denote by (ˆα OLS, ˆβ OLS,ˆγ OLS the coefficients and by SSR = Y 1ˆα OLS Xˆβ OLS Wˆγ OLS 2 the sum of squared residuals in a linear least-squares regression of Y on an 1, X, and W. Write h = I n 1 n 1 n /n (the annihilator matrix with respect to the intercept. Assume that k 3 and n m+k +2. Then, the two-step estimator ˆβ = (X hx 1 X h(y Wˆγ with ( p SSR ˆγ = 1 ˆγ OLS 2 W h(i X(X hx 1 X hw ˆγ OLS 8

9 ( 2(k 2 where p 0, n m k+2 is unbiased for β given X=x and dominates in the sense that E[ ˆβ β 2 X hx X=x] < E[ ˆβ OLS β 2 X hx X=x]. ˆβ OLS Proof. The OLS estimator in the theorem corresponds to ˆγ OLS (ỹ, w = ( w w 1 ỹ w in Lemma 1, which yields the maximum-likelihood estimator ˆγ OLS (Ỹ, W for γ given data vec( W N(0 k(n 1 m,σ W I n 1 m, Ỹ W = w N( wγ,σ 2 I n 1 m. By Baranchik (1973, this maximum-likelihood estimator is inadmissible with respect to the risk Ẽ[ ˆγ γ 2 Σ W ] and thus for Ẽ[ ˆγ γ 2 φ in Lemma 1, as φ = mσ W for β W = O m k. However, Baranchik (1973 also includes an intercept that is estimated, but does not enter the loss function. To formally use the result for our case without intercept in the first-step prediction exercise, I construct an augmented problem such that the dominance result in the augmented problem implies the theorem. To this end, let vec(w a N(0 k(n m,σ W I n m, Y a W a = w a N(w a γ,σ 2 I n m. (which has one additional sample point, and could without loss include intercepts in W a, Y a. By Baranchik (1973, Theorem 1, the estimator ˆγ a = ( 1 p (Y a h a Y a ˆγ a,ols 2 (W a h a W a ˆγ a,ols 2 (W a h a W a ˆγ a,ols strictly dominates ˆγ a,ols = ((W a h a W a 1 (W a h a Z a, where h a = I n m 1 n m 1 n m /(n m, in the sense that E a [(ˆγ a γ Σ W (ˆγ a γ] < E a [(ˆγ a,ols γ Σ W (ˆγ a,ols γ] 9

10 ( for any γ R k 2(k 2, provided that p 0, n m k+2 with k 3 and n m k +2. We now show that this implies dominance of ˆγ(Ỹ, W for ( ˆγ(ỹ, w = 1 pỹ ỹ ˆγ OLS (ỹ, w 2 w w ˆγ OLS (ỹ, w 2 w w ˆγ OLS (ỹ, w in the original problem. Letq a R (n m (n m 1 be such that(q a,1 n m /(n m is orthonormal (that is, the columns of q a complete 1 n m /(n m to an orthonormal basis of R m n. This implies that q a (q a = h a and (q a q a = I n m 1. Then, (q a (Y a,w a d = (Ỹ, W. In particular, ((Y a h a (Y a,(y a h a (W a,(w a h a (W a d = (Ỹ Ỹ,Ỹ W, W W and thus (ˆγ a,ˆγ a,ols d = (ˆγ(Ỹ, W,ˆγ OLS (Ỹ, W. We have thus established Ẽ[(ˆγ(Ỹ, W γ Σ W (ˆγ(Ỹ, W γ] < Ẽ[(ˆγOLS (Ỹ, W γ OLS Σ W (ˆγ (Ỹ, W γ]. Note that ˆγ OLS (ỹ, w = ( w w 1 ỹ w in Lemma 1 does indeed yield ˆγ OLS and ˆβ OLS in the theorem, and that this extends to ˆγ and ˆβ by (1 p y 2 ˆγ(q y,q w = q q ˆγ OLS (... 2 w q q w ˆγ OLS (... 2 ˆγ OLS (... w q q w with q q = h(i x(x hx 1 x h and SSR = Y 1ˆα OLS Xˆβ OLS Wˆγ OLS 2 = Y Wˆγ OLS 2 h(i X(X hx 1 X h = Y 2 h(i X(X hx 1 X h WˆγOLS 2 h(i X(X hx 1 X h = Y 2 h(i X(X hx 1 X h ˆγOLS 2 W h(i X(X hx 1 X hw. Unbiasedness and dominance follow with β W = O m k in Lemma 1. 10

11 Note that the result extends to the positive-part analog for which the shrinkage factor is set to zero whenever the expression is negative. For m = 1, the following dominance is immediate: Corollary 2 (A non-contradiction of Gauss Markov. For exogenous treatment, m = 1, k 3, and n k + 3, there exists an estimator ˆβ with E[ˆβ X=x] = β and Var(ˆβ X=x < Var(ˆβ OLS X=x. The assumption of exogenous treatment is essential for this result, as dropping conditioning on W and restricting interest to β would not suffice to break optimality of linear least-squares. 3 Invariance Properties and Bayesian Interpretation Starting with the transformations in Section 2.3, we consider the decision problem of estimating β (equivalently, µ x. Guided by the treatment of a linear panel-data model in Chamberlain and Moreira (2009, I develop the specific estimator proposed in Theorem 1 as (the empirical Bayes version of an invariant Bayes estimator with respect to a partially uninformative (improper Jeffreys prior. 3.1 Decision problem set-up In this section, we condition on X throughout and assume that covariates W are Normally distributed given X. Writing W x = q x W,W = q W,Y x = q x Y,Y = q Y, the transformation developed in Section 2.3 yields the joint distribution ( W x W = ( µ W O s k ( ( Yx µ x +Wx Y = γ W γ +V W Σ 1/2 W (V W ij iid N(0,1 +V Y σ 2 (V Y i iid N(0,1 (2 where Σ 1/2 W is the unique symmetric positive-definite square-root of the symmetric positive-definite matrix Σ W, and V W and V Y are independent. Here, 11

12 in addition to µ x = q xxβ, also µ W = q xxβ W, and s = n m 1. I write Z = R m+s R (m+s k for the sample space from which (Y,W is drawn according to this P θ, where I parametrize θ = (µ x,γ Θ = R m R k. (I take σ 2,Σ W,µ W to be constants. The action space is A = R m, from which an estimate of µ x is chosen. As the loss function L : Θ A R I take squared-error loss L(θ,a = µ x a 2. An estimator ˆβ : Z A from the previous section is a feasible decision rule in this decision problem. 3.2 A set of transformations For an element g = (g µ,g x,g W,g in the (product group G = R m O(m O(k O(s, where R m denotes the group of real numbers with addition (neutral element 0 and O(k the group of ortho-normal matrices in R k k with matrix multiplication (neutral element I k, consider the following set of transformations (which are actions of G on Z,Θ,A: Sample space: m Z : G Z Z, (g,(y x,y,w x,w (g x y x +g µ,g y,g x w x Σ 1/2 W g WΣ 1/2 W,g w Σ 1/2 W g WΣ 1/2 W Parameter space: m Θ : G Θ Θ, (g,(µ x,γ (g x µ x +g µ,σ 1/2 W g WΣ 1/2 W γ Action space: m A : G A A,(g,a g x a+g µ For exogenous treatment, these transformations are tied together by leaving model and loss invariant. Indeed, the following is immediate from Equation 2: 3 Proposition 1 (Invariance of model and loss. For µ W = O m k : 3 Alternatively, we could have treated µ W as an element of the parameter space and extend the analysis to the case of endogenous treatment. Adding (g,µ W g xµ WΣ 1/2 W g WΣ 1/2 W to the action on the parameter space would have retained invariance. 12

13 1. The model is invariant: m Z (g,(y,w P mθ (g,θ for all g G. 2. The loss is invariant: L(m Θ (g,θ,m A (g,a = L(θ,a for all g G. 3.3 An invariant Bayes estimator... By Proposition 1, a natural (generalized Bayes estimator of µ x is derived from an improper prior on θ that is invariant under the action of G on Θ, as this will yield a decision rule d : Z A that is invariant in the sense that d(m Z (g,(y,w = m A (g,d((y,w for all (g,(y,w G Z. This implies for µ x as an improper prior the Haar measure with respect to the translation action (i.e. up to a multiplicative constant the σ-finite Lebesgue measure on R m, and for γ a prior that is uniform on ellipsoids γ Σ W γ = ω. Taking ω χ 2 τ 2 m with some τ > 0 yields the prior γ N(0,τ2 Σ 1 W. With a product prior for θ, the resulting generalized Bayes estimator for µ x which minimizes posterior loss conditional on the data is E[µ x Y = y,w = w] = y x w x E[γ Y = y,w = w] = y x w x (w w +σ 2 Σ W /τ 2 1 w y and a specific empirical Bayes implementation Replacing Σ W by the specific sample analog W W /s, we obtain the estimator Y x sτ2 Similarly assuming that W sτ 2 +σ 2 x ((W W 1 (W Y. γ N(0,sτ 2 W W sτ, an unbiased estimator of 2 (given W is sτ 2 +σ 2 C = 1 (Y (I s W ((W W 1 (W Y /(s k (Y W ((W W 1 (W Y /(k 2. This estimator corresponds to the estimator from Theorem 1 at p = k 2 k 2 n m k 1 s k =. By construction, it retains the invariance of the associated generalized Bayes estimator. This is not specific to this value of p: Proposition 2 (Invariance of estimator. For any p, the estimator ˆβ from Theorem 1 is invariant with respect to the above actions of G. 13

14 Conclusion A natural application of James Stein shrinkage to control variables in a Normal linear model consistently reduces expected prediction error without introducing bias in the treatment parameter of interest provided treatment is random. In this case, the linear least-squares estimator is thus inadmissible even among unbiased estimators. In a companion paper (Spiess, 2017, I show how shrinkage in at least four instrumental variables in a canonical structural form provides consistent bias improvement over the two-stage least-squares estimator. Together, these results suggests different roles of overfitting in control and instrumental variable coefficients, respectively: while overfitting to control variables induces variance, overfitting to instrumental variables in the first stage of a two-stage least-squares procedure induces bias. References Baranchik, A. J. (1973. Inadmissibility of Maximum Likelihood Estimators in Some Multiple Regression Problems with Three or More Independent Variables. The Annals of Statistics, 1(2: Chamberlain, G. and Moreira, M. J. (2009. Decision Theory Applied to a Linear Panel Data Model. Econometrica, 77(1: Hansen, B. E. (2007. Least Squares Model Averaging. Econometrica, 75(4: Hansen, B. E. (2016. Efficient shrinkage in parametric models. Journal of Econometrics, 190(1: James, W. and Stein, C. (1961. Estimation with quadratic loss. Fourth Berkeley Symposium. Sclove, S. L. (1968. Improved estimators for coefficients in linear regression. Journal of the American Statistical Association, 63(322:

15 Spiess, J. (2017. Bias Reduction in Instrumental Variable Estimation through First-Stage Shrinkage. arxiv preprint arxiv:

Fixed Effects, Invariance, and Spatial Variation in Intergenerational Mobility

American Economic Review: Papers & Proceedings 2016, 106(5): 400 404 http://dx.doi.org/10.1257/aer.p20161082 Fixed Effects, Invariance, and Spatial Variation in Intergenerational Mobility By Gary Chamberlain*