Lecture 20 May 18, Empirical Bayes Interpretation [Efron & Morris 1973]

Stats 300C: Theory of Statistics Spring 2018 Lecture 20 May 18, 2018 Prof. Emmanuel Candes Scribe: Will Fithian and E. Candes 1 Outline 1. Stein s Phenomenon 2. Empirical Bayes Interpretation of James-Stein JS 3. Extensions 4. A Famous Baseball Example The primary topic of this lecture is to derive the James-Stein estimator via an empirical Bayes argument, to give some motivation for the form of the estimator. [Reference: Efron, Chapter 1] 2 Empirical Bayes Interpretation [Efron & Morris 1973] Consider the Bayes model µ i N 0, τ 2 X µ N µ, σ 2 I Then we have the posterior distribution: v Λµ X N σ 2 X, vi with so that For instance, if σ 2 = 1, then 1 v = 1 τ 2 + 1 σ 2 = τ 2 + σ 2 τ 2 σ 2 Λµ X N Λµ X N τ 2 τ 2 + σ 2 X, τ 2 σ 2 τ 2 + σ 2 I τ 2 τ 2 + 1 X, τ 2 τ 2 + 1 I 1

Bayes Estimate: We suggestively write the Bayes estimate as ˆµ B = 1 σ2 σ 2 + τ 2 X; this is a shrinkage estimate. E.g., if τ = σ we would shrink halfway toward zero. Bayes Risk: Writing ˆµ B µ = 1 ρx µ ρµ with ρ = σ2, the shrinkage factor, we have the conditional MSE σ 2 +τ 2 E [ ˆµ B µ 2 µ ] = 1 ρ 2 pσ 2 + ρ 2 µ 2 and integrating over the prior distribution of µ we obtain the Bayes risk E ˆµ B µ 2 = 1 ρ 2 pσ 2 + ρ 2 pτ 2 τ = pσ 2 2 τ 2 + σ 2 = R MLE τ 2 τ 2 + σ 2 Note that this is strictly smaller than the risk of the MLE. If τ = σ, this removes half of the risk! Of course we can t achieve this in real life, since no one gives us the Bayes prior. Still, this gives us a hint that we might be able to improve upon the MLE even without prior knowledge of τ. 3 Empirical Bayes Estimation Recall that σ is known and assume that the Bayes model is correct but τ is unknown. Now we cannot compute the Bayes estimator from the data because we don t know the right shrinkage factor ρ. Still, we might hope to estimate τ 2 from our data vector. Note that X i = µ i + z i N0, τ 2 + σ 2 i.e. A useful fact from calculus is that X 2 τ 2 + 1χ 2 p [ ] p 2 E = 1 χ 2 p This means that estimate, we obtain p 2 X 2 is an unbiased estimate of the right shrinkage factor. Plugging in this ˆµ = 1 p 2 X 2 σ2 X 2

which exactly recovers the James-Stein estimator! Now, if the Bayes model is correct, then the overall Bayes risk of ˆµ JS is E ˆµ JS µ 2 = p σ2 τ 2 τ 2 + σ 2 + 2σ4 τ 2 + σ 2 which is larger than the Bayes risk but only by a factor of 1 + 2σ2 pτ 2. This factor can be quite small when p is large. E.g., for p = 20 and τ = σ SNR = 1, we only miss the Bayes risk by 10% whereas the MLE misses by 100%. It is not so surprising that we do better than MLE in this setting, since the µ i are not too far from 0 the point toward which we are shrinking our estimates. The surprising fact comes from the earlier frequentist result that we outperform the MLE everywhere. Extension 1: The James-Stein phenomenon is more general than the independent normal case; in fact it works with correlated data, so long as the effective dimension is sufficiently large. Imagine we have X N µ, Σ, with Σ known. The MLE is of course ˆµ MLE = X. A James-Stein estimate JSE [Bock, 1975] is: ˆµ JS = 1 p 2 X T Σ 1 X, X where p is the effective dimension defined as Bock showed that if p > 2, p = TrΣ λ max Σ. Rˆµ JS, µ < Rˆµ MLE, µ for all µ R p. This is also true for c p 2 1 X T Σ 1 X, 0 < c < 2. X The condition on the effective dimension makes sense because if rankσ = 2 for instance, then we would not expect the Stein s phenomenon to hold since we are dealing with a two-dimensional problem. There is an interesting consequence of this in the context of linear regression. Consider the model y = Xβ + z where y R n 1 is observed, X R n p is known, and β R p 1 is unobserved, and to be estimated. The z i s are stochastic errors z i N0, σ 2. If X has full column rank, then the JSE c ˆβ JS = 1 ˆβMLE ˆβ MLE X T X ˆβ MLE will dominate the MLE the least-squares estimate for MSE. 3

Extension 2: There is nothing special about shrinking toward 0, in fact. We could shrink toward an arbitrary point µ 0, i.e. p 2σ2 ˆµ JS = µ 0 + 1 X µ 0 2 X µ 0 This also dominates the MLE. To see this, consider X = X µ 0 Nµ µ 0, σ 2 I. We know that the standard JSE dominates the MLE for estimating µ µ 0. But this is equivalent to saying our modified JSE dominates the MLE for µ. In practice, rather than choosing an arbitrary µ 0, it would make sense to use X, so that we adapt to the true center of Λµ. Empirical Bayes Viewpoint: center µ 0 of the distribution of µ Suppose we modify the earlier model so that we don t know the µ i Nµ 0, τ 2 X µ Nµ, σ 2 I Then marginally, X i Nµ 0, τ 2 + σ 2 and our posterior is with ρ = σ2. τ 2 +σ 2 µ i X i Nµ 0 + 1 ρx i µ 0, 1 ρ As before we can estimate µ 0 and τ 2 via the MLE, which is the sample mean X and sample variance i X i X 2 τ 2 + σ 2 χ 2 p 1. Then as before, so we obtain another JSE σ 2 p 3 E = ρ S ˆµ i = X + 1 p 3 X i S X Theorem: If p > 3, then this new JSE dominates the MLE. 4 Example: Baseball Suppose we want to estimate year-long player batting averages from observations during the first week of play. Efron and Morris chose 18 players with exactly 45 at-bats as of April 26, 1970. Batting averages are approximately binomial, so that we have approximately x i N θ i, 1 45 θ i1 θ i 4

We have a problem here since the variance is a function of the mean. We can make a variancestabilizing transformation y i = 45 arcsin2x i 1 so that approximately y i Nµ i, 1 with µ i = 45 arcsin2θ i 1. Then the JSE is 15 ˆµ JS = ȳ + 1 y ȳ 2 y ȳ Using the full-season batting average as the true θ i, we can compare the performance of the JSE to the MLE. On this data set the difference is quite dramatic. ˆµ JS µ 2 = 5.01 ˆµ MLE µ 2 = 17.56 and in the original coordinates θ i ˆθ JS θ 2 =.022 ˆθ MLE θ 2 =.077 In both cases, the improvement is by a factor of approximately 3.5. Interestingly, for three players the JSE underperforms the MLE. For these players, the θ i are actually extreme, so we get them wrong by shrinking them toward the mean. In general, the JSE will not improve the MSE for every single coordinate. If µ 1 = 1000 and the other µ i = 0, then the shrinkage toward 0 will be quite inappropriate for the first coordinate. What the James-Stein phenomenon guarantees is only that, in terms of the overall MSE, we will gain more in using it to estimate the other coordinates than we lose in using it to estimate the first one. 5

Figure 1: Baseball data table reproduced from An Introduction to James-Stein Estimation by John A. Richards. In the table, ψ i is the same as µ i see text. 6