Combining Biased and Unbiased Estimators in High Dimensions. (joint work with Ed Green, Rutgers University)

Similar documents
Lecture 20 May 18, Empirical Bayes Interpretation [Efron & Morris 1973]

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

Chapter 14 Stein-Rule Estimation

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Multinomial Data. f(y θ) θ y i. where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is

3.0.1 Multivariate version and tensor product of experiments

Evaluating the Performance of Estimators (Section 7.3)

Data Mining Stat 588

Lecture 2: Statistical Decision Theory (Part I)

Pubh 8482: Sequential Analysis

Shrinkage Estimation in High Dimensions

Empirical Bayes Quantile-Prediction aka E-B Prediction under Check-loss;

Bayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling

Methods of evaluating estimators and best unbiased estimators Hamid R. Rabiee

Lecture 7 October 13

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

STAT215: Solutions for Homework 2

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Regression. Oscar García

Carl N. Morris. University of Texas

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Review of Maximum Likelihood Estimators

Journal of Statistical Research 2007, Vol. 41, No. 1, pp Bangladesh

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

5.2 Expounding on the Admissibility of Shrinkage Estimators

On the half-cauchy prior for a global scale parameter

Bayesian performance

Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation Discussion of sampling approach in big data

Problem Solving. Correlation and Covariance. Yi Lu. Problem Solving. Yi Lu ECE 313 2/51

Biostatistics Advanced Methods in Biostatistics IV

g-priors for Linear Regression

Ensemble Minimax Estimation for Multivariate Normal Means

Statistics Ph.D. Qualifying Exam: Part II November 3, 2001

9. Least squares data fitting

f (1 0.5)/n Z =

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Part 4: Multi-parameter and normal models

Chapter V A Stand Basal Area Growth Disaggregation Model Based on Dominance/Suppression Competitive Relationships

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Peter Hoff Statistical decision problems September 24, Statistical inference 1. 2 The estimation problem 2. 3 The testing problem 4

On the Half-Cauchy Prior for a Global Scale Parameter

FE670 Algorithmic Trading Strategies. Stevens Institute of Technology

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

New Statistical Methods That Improve on MLE and GLM Including for Reserve Modeling GARY G VENTER

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables THE UNIVERSITY OF MANCHESTER. 21 June :45 11:45

A Probability Review

A General Overview of Parametric Estimation and Inference Techniques.

UC Berkeley CUDARE Working Papers

Estimation of a location parameter with restrictions or vague information for spherically symmetric distributions

Alternative Bayesian Estimators for Vector-Autoregressive Models

Estimation of parametric functions in Downton s bivariate exponential distribution

Module 4: Bayesian Methods Lecture 9 A: Default prior selection. Outline

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Stochastic Gradient Descent

Machine learning, shrinkage estimation, and economic theory

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Machine Learning for OR & FE

Problem Selected Scores

Shrinkage Estimation of the Slope Parameters of two Parallel Regression Lines Under Uncertain Prior Information

On Efficiency of Midzuno-Sen Strategy under Two-phase Sampling

Classification Methods II: Linear and Quadratic Discrimminant Analysis

A Modern Look at Classical Multivariate Techniques

Regression #3: Properties of OLS Estimator

Remedial Measures for Multiple Linear Regression Models

Qualifying Exam in Probability and Statistics.

1.1 Basis of Statistical Decision Theory

Bayes and Empirical Bayes Estimation of the Scale Parameter of the Gamma Distribution under Balanced Loss Functions

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Extensions to LDA and multinomial regression

10. Exchangeability and hierarchical models Objective. Recommended reading

Lecture 8: Information Theory and Statistics

Linear Methods for Regression. Lijun Zhang

Ad Placement Strategies

The consequences of misspecifying the random effects distribution when fitting generalized linear mixed models

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n

IEOR165 Discussion Week 5

Diagnostics. Gad Kimmel

Lecture 14 Singular Value Decomposition

Estimation of large dimensional sparse covariance matrices

Week 5: Logistic Regression & Neural Networks

Fixed Effects, Invariance, and Spatial Variation in Intergenerational Mobility

F & B Approaches to a simple model

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

On Modifications to Linking Variance Estimators in the Fay-Herriot Model that Induce Robustness

Lecture 4: Multivariate Regression, Part 2

Habilitationsvortrag: Machine learning, shrinkage estimation, and economic theory

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

ST5215: Advanced Statistical Theory

Nonparametric Bayes tensor factorizations for big data

Homework # , Spring Due 14 May Convergence of the empirical CDF, uniform samples

Lecture 14: Shrinkage

Matematické Metody v Ekonometrii 7.

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Random vectors X 1 X 2. Recall that a random vector X = is made up of, say, k. X k. random variables.

Linear Models A linear model is defined by the expression

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

Estimation Tasks. Short Course on Image Quality. Matthew A. Kupinski. Introduction

Transcription:

Combining Biased and Unbiased Estimators in High Dimensions Bill Strawderman Rutgers University (joint work with Ed Green, Rutgers University)

OUTLINE: I. Introduction II. Some remarks on Shrinkage Estimators III. Combining Biased and Unbiased Estimators IV. Some Questions/Comments V. Example

I. Introduction: Problem: Estimate a vector valued parameter θ (dim θ = p). We have multiple (at least 2) estimators of θ, at least one of which is unbiased How do we combine the estimators? Suppose, e.g., X ~ N p (θ, σ 2 ), and Y ~ N p (θ + η, τ 2 ), i.e., X is unbiased and Y is biased. (and X and Y are independent) Can we effectively combine the information in X and Y to estimate θ?

Example from Forestry: Estimating basal area per acre of tree stands (Total cross sectional area at a height of 4.5 feet) Why is it important? It helps quantify the degree of above ground competition in a particular stand of trees. Two sources of data to estimate basal area a. X, sample based estimates. (Unbiased) b. Y, Estimates based on regression model predictions (Possibly biased) Regression model based estimators are often biased for parameters of interest since they are often based on non-linearly transformed responses which become biased on transforming back to the original scale. In our example it is the log(basal area) which is modeled (linearly). Dimension of X and Y in our application range from 9 to 47 (Number of stands)

The Usual Combined Estimator: The usual combined estimators assume both estimators are unbiased and average with weights which are inversely proportional to the variances, i.e., δ*(x, Y) = w1x+(1- w1)y= (τ 2 X+ σ 2 )/ (σ 2 + τ 2 ) Can we do something sensible when Y is suspected of being biased? Homework Problem: a. Find a biopharmaceutical application

II. Some remarks on Shrinkage Estimation: Why Shrink: Some Intuition Let X = (X,, ) 1 be a random vector in R p, with X p [ ] = θ, and Var[ ] = σ, and Cov(X,X ) = 0 (i j) E X i X 2 i i i j PROBLEM: Estimate (,, ) with loss 1 p p p 2 2 i i i X i i=1 i=1 θ θ ( δ θ ) and Risk = E[ δ ( ) θ ]. Consider linear estimates of the form δ i ( X,, X ) = (1- a)x, i = 1.,,p, or (X) = (1- a)x. 1 p δ i Note: a =0 corresponds to the usual unbiased estimator X, but is it the best choice?

WHAT IS THE BEST a? Risk of (1-a)X = p p p 2 2 2 δi X θ = i a X i θ = i i a X i θi i=1 i=1 i=1 E[ ( ) ] E[(1 ) ] {Var[(1-a)X ]+[E(1 ) ] } p p 2 pvar (1 a) X ( ) 2 = p(1-a) 2 2 2 i + aθ σ + a θ i= 1 i i= 1 i = Q( a) The best a corresponds to Q ( a) = 0 = -2{(1- a)p σ 2 + a θ 2 a = p σ 2 /( pσ 2+ θ 2) Hence the best linear estimator is (1 pσ 2 /( pσ 2+ θ 2) X, which depends on θ. BUT E[ X 2] = p σ 2 + θ 2 so we can estimate the best a as p σ 2 / X 2, and the resulting approximate best linear estimator is (1 pσ 2 / X 2) X

The James-Stein Estimator is (1 ( p 2) σ 2 / X 2) X, which is close to the above. Note that the argument doesn t depend on normality of X. In the normal case, theory shows that the James Stein estimator has lower risk than the usual unbiased (UMVUE, MLE, MRE) estimator, X, provided that p 3. In fact if the true θ is 0 then the risk of the James-Stein Estimator is 2σ 2 which is much less than the risk of X (which is identically pσ 2 )

A slight extension: Shrinkage toward a fixed point, θ 0 : θ + (1- (p-2) σ 2 / X θ 2)( X θ ) = X (p-2) σ 2( X θ )/ X θ 2) 0 0 0 0 0 also dominates X and has risk 2σ 2 if θ = θ 0. Risk = pσ 2 ( p 2) 2σ 4E[1/ X θ 2] < pσ 2 0

III. Combining Biased and Unbiased Estimates: Suppose we wish to estimate a vector θ, and we have 2 independent estimators, X and Y. Suppose, e.g., X ~ N p (θ, σ 2 I), and Y ~ N p (θ+η, τ 2 I), i.e., X is unbiased and Y is biased. Can we effectively combine the information in X and Y to estimate θ? ONE ANSWER: YES, Shrink the unbiased estimator, X, toward the biased estimator Y. A James-Stein type combined estimator: δ ( X, Y) = Y + (1 ( p 2) σ 2 / X Y 2)( X Y) = X ( p 2) σ X Y 2 2 ( X Y)

IV. Some Questions/Comments: 1. Risk of δ(x,y) = E[ R( θ, δ ( X, Y)) Y] = pσ 2 ( p 2) 2σ 4EE[1/ X Y 2 Y] = pσ 2 ( p 2) 2σ 4E[1/ X Y 2] pσ 2 ( p 2) 2σ 4 / E[ X Y 2] R( θ, X ) = pσ 2 Hence the combined estimator beats X no matter how badly biased Y is.

2. Why not shrink Y towards X instead of shrinking X towards Y? Answer: Note that if X and Y are not close together, δ(x,y) is close to X and not Y. This is desirable since Y is biased. If we shrunk Y toward X, the combined estimator would be close to Y when X and Yare far apart. This is not desirable, again, since Y is biased.

3. How does δ(x, Y) compare to the usual method of combining unbiased estimators, i.e., weighing inversely proportionally to the variances, δ*(x, Y) = (τ 2 X+ σ 2 Y)/ (σ 2 + τ 2 )? ANSWER: The risk of the JS combined estimator is slightly greater than the risk of the optimal linear combination when Y is also unbiased. (JS is, in fact, an approximation to the best linear combination.) Hence if Y is unbiased (η=0), the JS estimator does a bit worse than the usual linear combined estimator. The loss in efficiency (if η=0) is particularly small when the ratio var(x)/var(y) is not large and p is large. But it does much better if the bias of Y is significant.

Risk (MSE) Comparison of Estimators: X, Usual linear Combination (δ combined ), and JS Combination (δ p-2 ); Dimension p = 25; σ = 1, τ = 1, equal bias for all coordinates

4. Is there a Bayes/Empirical Bayes Connection Answer: Yes. The combined estimator can be interpreted as an Empirical Bayes Estimator under the prior structure η N(0, γ2i), θ N(0, k2i), with k (i.e. locally uniform)

5. How does the combined JS estimator compare with the usual JS estimator (i.e., shrinking toward a fixed point)? Answer: The risk functions cross. Neither is uniformly better. Roughly, the combined JS estimator is better than the usual JS estimator if η 2 +pτ 2 is small compared to θ 2.

6. Is there a similar method if we have several different estimators X and Y i, i=1,,k. Answer: Yes. Multiple Shrinkage; suppose Y (, 2 i N θ + η i τ i I) A Multiple Shrinkage Estimator of the form X 2 k ( X Y ) k ( p 2) σ [ / 1 i ] 1 X Y 2 i= i= 1 X Y 2 i i will work and have somewhat similar properties. In particular, it will improve on the unbiased estimator X.

7. How do you handle the unknown scale (σ 2 ) case? Answer: Replace σ 2 in the JS combined (and uncombined) estimator by SSE/(df+2). Note that, interestingly, the scale of Y (τ 2 ) is not needed to calculate JS combined but that it is needed for the usual linear combined estimator.

8. Is normality of X (and/or Y) essential? Answer: Whether σ 2 is known or not, normality of Y is not needed for the combined JS estimator to dominate X. (Independence of X and Y is needed) Additionally, if σ 2 is not known, then Normality of X is not needed as well! That is, (in the unknown scale case) the Combined JS estimator dominates the usual unbiased estimator, X, simultaneously for all spherically symmetric sampling distributions. Also, the shrinkage constant (p-2)sse/(df+2) is (simultaneously) uniformly best.

V. EXAMPLE: Basal Area per Acre by Stand (Loblolly Pine) Data: Company Number of Stands (p=dim) Number of Plots 1 47 653 2 9 143 3 10 330 Number of Plots/Stand: Average = 17, Range = 5 50 True θ ij and σ i 2 calculated on basis the of all data in all plots (i= company, j=stand)

Simulation: Compare Three Estimators: X, dcomb, and djs+ for several different sample sizes, m = 5, 10, 30, 100 (plots/stand) X = mean of m measured average basal areas Y = estimated mean based on a linear regression of log(basal area) (on log(height), log (number of trees), (age)-1) Generally X came in last each time

MSE δ JS + /MSE δ comb for Company 1(solid), Company 2 (dashed) and Company 3 (dotted) for various sample sizes (m)

References: James and Stein (61), 4th Berkeley Symposium Green and Strawderman (91) JASA Green and Strawderman (90), Forest Science George (86) Annals of Statistics Fourdrinier, Strawderman, and Wells (98) Annals of Statistics Fourdrinier, Strawderman, and Wells (03) J. Mult. Analysis Maruyama and Strawderman (05) Annals of Statistics Fourdrinier and Cellier (95) J. Multivariate Analysis

Some Risk Approximations (Upper Bounds): The Unbiased Estimator, X R( θ, X ) = p σ 2-0 Usual James Stein R( θ, δ ( X )) ~ p σ 2 - (p-2) 2σ 4 /[ θ 2 + p σ 2] JS James Stein Combination R( θ, δ ( X, Y)) ~ p σ 2 - (p-2) 2σ 4 /[ η 2 + p σ 2 + pτ 2] JS Usual Linear Combination R( θ, δ ( X, Y)) = p σ 2 - p σ 4 /[ σ 2 + τ 2] + { σ 2 /[ σ 2 + τ 2]} 2 η 2 comb