Lecture 20 May 18, Empirical Bayes Interpretation [Efron & Morris 1973]

Similar documents
Module 22: Bayesian Methods Lecture 9 A: Default prior selection

g-priors for Linear Regression

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Shrinkage Estimation in High Dimensions

5.2 Expounding on the Admissibility of Shrinkage Estimators

Lecture 24 May 30, 2018

3.0.1 Multivariate version and tensor product of experiments

Combining Biased and Unbiased Estimators in High Dimensions. (joint work with Ed Green, Rutgers University)

Data Mining Stat 588

F & B Approaches to a simple model

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

Carl N. Morris. University of Texas

STAT 830 Bayesian Estimation

Biostatistics Advanced Methods in Biostatistics IV

Master s Written Examination

Module 4: Bayesian Methods Lecture 9 A: Default prior selection. Outline

Week 2 Spring Lecture 3. The Canonical normal means estimation problem (cont.).! (X) = X+ 1 X X, + (X) = X+ 1

IEOR165 Discussion Week 5

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Part III. A Decision-Theoretic Approach and Bayesian testing

Regression Models - Introduction

Regression Models - Introduction

Bayesian performance

Answer Key for STAT 200B HW No. 7

A union of Bayesian, frequentist and fiducial inferences by confidence distribution and artificial data sampling

Topic 12 Overview of Estimation

Review and continuation from last week Properties of MLEs

Machine Learning for OR & FE

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

A Very Brief Summary of Bayesian Inference, and Examples

Statistics: Learning models from data

Multinomial Data. f(y θ) θ y i. where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is

Journal of Statistical Research 2007, Vol. 41, No. 1, pp Bangladesh

ESTIMATORS FOR GAUSSIAN MODELS HAVING A BLOCK-WISE STRUCTURE

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

A Resampling Method on Pivotal Estimating Functions

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

FE670 Algorithmic Trading Strategies. Stevens Institute of Technology

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Regularized PCA to denoise and visualise data

Small Area Modeling of County Estimates for Corn and Soybean Yields in the US

Lecture 2: Statistical Decision Theory (Part I)

Week 4 Spring Lecture 7. Empirical Bayes, hierarchical Bayes and random effects

Preliminary Statistics Lecture 5: Hypothesis Testing (Outline)

More on nuisance parameters

Robustness to Parametric Assumptions in Missing Data Models

Probability and Statistics Notes

A Bayesian Treatment of Linear Gaussian Regression

Problem Selected Scores

Tweedie s Formula and Selection Bias. Bradley Efron Stanford University

Sparse Linear Models (10/7/13)

COS513 LECTURE 8 STATISTICAL CONCEPTS

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Bayesian Interpretations of Regularization

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Uncertain Inference and Artificial Intelligence

Linear Regression (9/11/13)

Lecture 13 Fundamentals of Bayesian Inference

(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1.

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Model Checking and Improvement

Random vectors X 1 X 2. Recall that a random vector X = is made up of, say, k. X k. random variables.

Contents 1. Contents

Lecture 14 October 13

Lecture 2: Priors and Conjugacy

MS&E 226: Small Data

Interval estimation. October 3, Basic ideas CLT and CI CI for a population mean CI for a population proportion CI for a Normal mean

Bayesian Linear Models

Minimax Estimation of Kernel Mean Embeddings

STA 2201/442 Assignment 2

STAT215: Solutions for Homework 2

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Bayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling

Statistics 3858 : Maximum Likelihood Estimators

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Spring 2012 Math 541B Exam 1

Fixed Effects, Invariance, and Spatial Variation in Intergenerational Mobility

Statistical Data Mining and Machine Learning Hilary Term 2016

MFM Practitioner Module: Risk & Asset Allocation. John Dodson. February 18, 2015

Better Bootstrap Confidence Intervals

1.1 Basis of Statistical Decision Theory

Lecture 8: Information Theory and Statistics

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments

Lecture 14: Shrinkage

Maximum Likelihood Estimation

Discrete Dependent Variable Models

Estimation, Inference, and Hypothesis Testing

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Topics in empirical Bayesian analysis

Regression Shrinkage and Selection via the Lasso

ECE531 Lecture 10b: Maximum Likelihood Estimation

Econometrics of Panel Data

The regression model with one fixed regressor cont d

Plausible Values for Latent Variables Using Mplus

Lecture 7 October 13

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

Lecture 23 Maximum Likelihood Estimation and Bayesian Inference

High-dimensional regression

Unobserved Heterogeneity in Longitudinal Data An Empirical Bayes Perspective

Transcription:

Stats 300C: Theory of Statistics Spring 2018 Lecture 20 May 18, 2018 Prof. Emmanuel Candes Scribe: Will Fithian and E. Candes 1 Outline 1. Stein s Phenomenon 2. Empirical Bayes Interpretation of James-Stein JS 3. Extensions 4. A Famous Baseball Example The primary topic of this lecture is to derive the James-Stein estimator via an empirical Bayes argument, to give some motivation for the form of the estimator. [Reference: Efron, Chapter 1] 2 Empirical Bayes Interpretation [Efron & Morris 1973] Consider the Bayes model µ i N 0, τ 2 X µ N µ, σ 2 I Then we have the posterior distribution: v Λµ X N σ 2 X, vi with so that For instance, if σ 2 = 1, then 1 v = 1 τ 2 + 1 σ 2 = τ 2 + σ 2 τ 2 σ 2 Λµ X N Λµ X N τ 2 τ 2 + σ 2 X, τ 2 σ 2 τ 2 + σ 2 I τ 2 τ 2 + 1 X, τ 2 τ 2 + 1 I 1

Bayes Estimate: We suggestively write the Bayes estimate as ˆµ B = 1 σ2 σ 2 + τ 2 X; this is a shrinkage estimate. E.g., if τ = σ we would shrink halfway toward zero. Bayes Risk: Writing ˆµ B µ = 1 ρx µ ρµ with ρ = σ2, the shrinkage factor, we have the conditional MSE σ 2 +τ 2 E [ ˆµ B µ 2 µ ] = 1 ρ 2 pσ 2 + ρ 2 µ 2 and integrating over the prior distribution of µ we obtain the Bayes risk E ˆµ B µ 2 = 1 ρ 2 pσ 2 + ρ 2 pτ 2 τ = pσ 2 2 τ 2 + σ 2 = R MLE τ 2 τ 2 + σ 2 Note that this is strictly smaller than the risk of the MLE. If τ = σ, this removes half of the risk! Of course we can t achieve this in real life, since no one gives us the Bayes prior. Still, this gives us a hint that we might be able to improve upon the MLE even without prior knowledge of τ. 3 Empirical Bayes Estimation Recall that σ is known and assume that the Bayes model is correct but τ is unknown. Now we cannot compute the Bayes estimator from the data because we don t know the right shrinkage factor ρ. Still, we might hope to estimate τ 2 from our data vector. Note that X i = µ i + z i N0, τ 2 + σ 2 i.e. A useful fact from calculus is that X 2 τ 2 + 1χ 2 p [ ] p 2 E = 1 χ 2 p This means that estimate, we obtain p 2 X 2 is an unbiased estimate of the right shrinkage factor. Plugging in this ˆµ = 1 p 2 X 2 σ2 X 2

which exactly recovers the James-Stein estimator! Now, if the Bayes model is correct, then the overall Bayes risk of ˆµ JS is E ˆµ JS µ 2 = p σ2 τ 2 τ 2 + σ 2 + 2σ4 τ 2 + σ 2 which is larger than the Bayes risk but only by a factor of 1 + 2σ2 pτ 2. This factor can be quite small when p is large. E.g., for p = 20 and τ = σ SNR = 1, we only miss the Bayes risk by 10% whereas the MLE misses by 100%. It is not so surprising that we do better than MLE in this setting, since the µ i are not too far from 0 the point toward which we are shrinking our estimates. The surprising fact comes from the earlier frequentist result that we outperform the MLE everywhere. Extension 1: The James-Stein phenomenon is more general than the independent normal case; in fact it works with correlated data, so long as the effective dimension is sufficiently large. Imagine we have X N µ, Σ, with Σ known. The MLE is of course ˆµ MLE = X. A James-Stein estimate JSE [Bock, 1975] is: ˆµ JS = 1 p 2 X T Σ 1 X, X where p is the effective dimension defined as Bock showed that if p > 2, p = TrΣ λ max Σ. Rˆµ JS, µ < Rˆµ MLE, µ for all µ R p. This is also true for c p 2 1 X T Σ 1 X, 0 < c < 2. X The condition on the effective dimension makes sense because if rankσ = 2 for instance, then we would not expect the Stein s phenomenon to hold since we are dealing with a two-dimensional problem. There is an interesting consequence of this in the context of linear regression. Consider the model y = Xβ + z where y R n 1 is observed, X R n p is known, and β R p 1 is unobserved, and to be estimated. The z i s are stochastic errors z i N0, σ 2. If X has full column rank, then the JSE c ˆβ JS = 1 ˆβMLE ˆβ MLE X T X ˆβ MLE will dominate the MLE the least-squares estimate for MSE. 3

Extension 2: There is nothing special about shrinking toward 0, in fact. We could shrink toward an arbitrary point µ 0, i.e. p 2σ2 ˆµ JS = µ 0 + 1 X µ 0 2 X µ 0 This also dominates the MLE. To see this, consider X = X µ 0 Nµ µ 0, σ 2 I. We know that the standard JSE dominates the MLE for estimating µ µ 0. But this is equivalent to saying our modified JSE dominates the MLE for µ. In practice, rather than choosing an arbitrary µ 0, it would make sense to use X, so that we adapt to the true center of Λµ. Empirical Bayes Viewpoint: center µ 0 of the distribution of µ Suppose we modify the earlier model so that we don t know the µ i Nµ 0, τ 2 X µ Nµ, σ 2 I Then marginally, X i Nµ 0, τ 2 + σ 2 and our posterior is with ρ = σ2. τ 2 +σ 2 µ i X i Nµ 0 + 1 ρx i µ 0, 1 ρ As before we can estimate µ 0 and τ 2 via the MLE, which is the sample mean X and sample variance i X i X 2 τ 2 + σ 2 χ 2 p 1. Then as before, so we obtain another JSE σ 2 p 3 E = ρ S ˆµ i = X + 1 p 3 X i S X Theorem: If p > 3, then this new JSE dominates the MLE. 4 Example: Baseball Suppose we want to estimate year-long player batting averages from observations during the first week of play. Efron and Morris chose 18 players with exactly 45 at-bats as of April 26, 1970. Batting averages are approximately binomial, so that we have approximately x i N θ i, 1 45 θ i1 θ i 4

We have a problem here since the variance is a function of the mean. We can make a variancestabilizing transformation y i = 45 arcsin2x i 1 so that approximately y i Nµ i, 1 with µ i = 45 arcsin2θ i 1. Then the JSE is 15 ˆµ JS = ȳ + 1 y ȳ 2 y ȳ Using the full-season batting average as the true θ i, we can compare the performance of the JSE to the MLE. On this data set the difference is quite dramatic. ˆµ JS µ 2 = 5.01 ˆµ MLE µ 2 = 17.56 and in the original coordinates θ i ˆθ JS θ 2 =.022 ˆθ MLE θ 2 =.077 In both cases, the improvement is by a factor of approximately 3.5. Interestingly, for three players the JSE underperforms the MLE. For these players, the θ i are actually extreme, so we get them wrong by shrinking them toward the mean. In general, the JSE will not improve the MSE for every single coordinate. If µ 1 = 1000 and the other µ i = 0, then the shrinkage toward 0 will be quite inappropriate for the first coordinate. What the James-Stein phenomenon guarantees is only that, in terms of the overall MSE, we will gain more in using it to estimate the other coordinates than we lose in using it to estimate the first one. 5

Figure 1: Baseball data table reproduced from An Introduction to James-Stein Estimation by John A. Richards. In the table, ψ i is the same as µ i see text. 6