Explaining the Stein Paradox

Similar documents
/ n ) are compared. The logic is: if the two

Chapter 7 Generalized and Weighted Least Squares Estimation. In this method, the deviation between the observed and expected values of

STAT 3008 Applied Regression Analysis

Estimation: Part 2. Chapter GREG estimation

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Composite Hypotheses testing

Chapter 11: Simple Linear Regression and Correlation

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

Lecture 6: Introduction to Linear Regression

Comparison of the Population Variance Estimators. of 2-Parameter Exponential Distribution Based on. Multiple Criteria Decision Making Method

Lecture 3 Stat102, Spring 2007

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Lecture Notes on Linear Regression

Feb 14: Spatial analysis of data fields

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Chapter 6. Supplemental Text Material

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Linear Regression Analysis: Terminology and Notation

Solutions Homework 4 March 5, 2018

Statistics for Economics & Business

Fall 2012 Analysis of Experimental Measurements B. Eisenstein/rev. S. Errede

The Ordinary Least Squares (OLS) Estimator

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Lecture 2: Prelude to the big shrink

Statistics for Managers Using Microsoft Excel/SPSS Chapter 13 The Simple Linear Regression Model and Correlation

Comparison of Regression Lines

Linear Approximation with Regularization and Moving Least Squares

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Chapter 14 Simple Linear Regression

Section 8.3 Polar Form of Complex Numbers

Statistics for Business and Economics

Maximum Likelihood Estimation (MLE)

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Classification as a Regression Problem

Computing MLE Bias Empirically

Stat 543 Exam 2 Spring 2016

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

Goodness of fit and Wilks theorem

Chapter 9: Statistical Inference and the Relationship between Two Variables

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

The Second Anti-Mathima on Game Theory

x = , so that calculated

Kernel Methods and SVMs Extension

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

The Order Relation and Trace Inequalities for. Hermitian Operators

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics

Stat 543 Exam 2 Spring 2016

Chapter 15 Student Lecture Notes 15-1

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

x i1 =1 for all i (the constant ).

ANOMALIES OF THE MAGNITUDE OF THE BIAS OF THE MAXIMUM LIKELIHOOD ESTIMATOR OF THE REGRESSION SLOPE

Gaussian process classification: a message-passing viewpoint

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

Properties of Least Squares

SIMPLE LINEAR REGRESSION

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

Generalized Linear Methods

a. (All your answers should be in the letter!

Chapter 5 Multilevel Models

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Feature Selection: Part 1

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Basic Business Statistics, 10/e

Biostatistics 360 F&t Tests and Intervals in Regression 1

Bayesian predictive Configural Frequency Analysis

Rockefeller College University at Albany

Laboratory 1c: Method of Least Squares

Global Sensitivity. Tuesday 20 th February, 2018

Linear regression. Regression Models. Chapter 11 Student Lecture Notes Regression Analysis is the

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Statistics MINITAB - Lab 2

where I = (n x n) diagonal identity matrix with diagonal elements = 1 and off-diagonal elements = 0; and σ 2 e = variance of (Y X).

2016 Wiley. Study Session 2: Ethical and Professional Standards Application

Limited Dependent Variables

Economics 130. Lecture 4 Simple Linear Regression Continued

Chapter 8 Indicator Variables

Laboratory 3: Method of Least Squares

Here is the rationale: If X and y have a strong positive relationship to one another, then ( x x) will tend to be positive when ( y y)

ISSN: ISO 9001:2008 Certified International Journal of Engineering and Innovative Technology (IJEIT) Volume 3, Issue 1, July 2013

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Lecture 20: Hypothesis testing

Uncertainty as the Overlap of Alternate Conditional Distributions

Polynomial Regression Models

Introduction to Regression

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

and V is a p p positive definite matrix. A normal-inverse-gamma distribution.

Lecture 4 Hypothesis Testing

10-701/ Machine Learning, Fall 2005 Homework 3

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Estimating the Fundamental Matrix by Transforming Image Points in Projective Space 1

Econ107 Applied Econometrics Topic 9: Heteroskedasticity (Studenmund, Chapter 10)

Transcription:

Explanng the Sten Paradox Kwong Hu Yung 1999/06/10 Abstract Ths report offers several ratonale for the Sten paradox. Sectons 1 and defnes the multvarate normal mean estmaton problem and ntroduces Sten s paradox. Sectons 3 8 advances the Galtonan perspectve and explans Sten s paradox usng regresson analyss. Sectons 9 and 10 approaches Sten s paradox through the conventonal emprcal Bayes approach. The closng sectons 11 and 1 compares admssblty to equvarance and to mnmaxty as crtera for smultaneous estmaton. 1 Estmatng Mean of Multvarate Normal Sten s paradox arses n estmatng the mean of a multvarate normal random varable. Let X 1, X,..., X k be ndependent normal random varables, such that X N(θ, 1). All k random varables have a common known varance, but ther unknown means dffer and vary separately. In other words, (X 1, X,..., X k ) N(θ, I), where θ = (θ 1, θ,..., θ k ) and I s the k k dentty matrx. Consder estmatng the mean θ under squarederror loss L(θ, d) = k (θ d ) = θ d. A natural and ntutve estmate of θ would be X tself. Charles Sten showed that ths naïve estmator ˆθ = X s not even admssble. For k 3, the obvous estmate X s domnated by ˆθ JS = (1 k S )X, where S = X. (1) The James-Sten estmator ˆθ JS shrnks the naïve estmate towards zero by a factor 1 k S, where S = X depends on the other random varables. Although the rsk s optmal for k, more generally, for k 3 and 0 < c < (k ), any estmator of the form ˆθ JS = (1 c S )X () has unformly smaller rsk for all θ. Smlarly, for k 4 and 0 < c < (k 3), the famly of Efron- Morrs estmator where ˆθ EM = X + (1 c S )(X X), (3) S = (X X) (4) domnates the naïve estmator X. In ths case, the optmal c = k 3. The Efron-Morrs estmators shrnk towards the mean X rather than 0. Because each X s ndependently dstrbuted, estmatng each θ as separate experments would seem ntutve. Paradoxcally, combnng observatons actually provdes a much better estmate of the means. So Sten s paradox states not only that the natural estmator X s not admssble, but also that X s domnated by ˆθ JS, an estmator whch combnes ndependent experments. Sten s Paradox Volates Intuton On the surface, Sten s paradox seems unacceptable. Intutvely each ndependent experment should not affect the other experments. For example, consder

X as test scores of each student n a class. Suppose that X N(θ, 1), where θ s the theoretcal IQ of student. Gven that each student takes hs test ndependently, the natural estmate of each student s IQ would be hs ndvdual test score. Snce each student took the test ndvdually and separately, justce dctates that each student be judged on hs personal performance. In partcular, the natural estmate of θ would ntutvely be X. Yet accordng to Sten s paradox, the performance of the entre class should be factored nto the estmate of each ndvdual s IQ. Rather than estmate Tom s IQ by hs personal test score, Sten s estmator shrnks Tom s IQ by a factor based on the standard devaton of all test scores. Sten s paradox asserts that makng use of test scores of the whole class provdes a better estmate of each ndvdual s IQ. 3 Galtonan Perspectve on Sten s Paradox Consder pars {(X, θ ) = 1,,..., k}, where the X s are known but the θ s are unknown. Snce X N(θ, 1), wrte X = θ + Z, where Z N(0, 1). Because X are mutually ndependent, the Z are also mutually ndependent. Francs Galton over a century ago offered a perspectve that can shed lght on Sten s paradox. On a graph wth X as the horzontal axs and θ as the vertcal axs, the pont (X, θ ) would le near the dagonal lne θ = X. Because of the error term Z, the pont (X, θ ) s horzontally shfted from the dagonal lne θ = X by the random dstance Z. Snce E X = θ and V ar X = 1/k, the mean X should be centered about θ. In other words, ( X, θ) should le near the center dagonal lne θ = X on the X/θ coordnate system. 4 Regresson Analyss Pcture Gven an X, the correspondng θ can be estmated usng standard regresson ˆθ = E[θ X]. The naïve estmate ˆθ = X, apparently agrees wth ˆX = E[X θ] = θ, the regresson lne of X on θ. However, the regresson lne of X on θ s not approprate n estmatng θ on the bass of X. Instead the opposte regresson lne should be used. Usng the regresson of θ on X would provde a more meanngful estmate of θ. Snce the dstrbuton of θ gven X s unknown, the regresson lne ˆθ = E[θ X] must be approxmated. 5 Lnear Regresson Motvates the Efron-Morrs Estmator The optmal estmator ˆθ = E[θ X] s unavalable because the dstrbuton of θ gven X s unknown. Instead, restrct attenton to only lnear estmators of the form ˆθ = a + bx, = 1,,..., k. (5) Mnmzng the loss functon L(θ, ˆθ) = k (θ ˆθ ) s equvalent to fndng the least squares ft to {(X, θ ) = 1,,..., k} n the X/θ coordnate system. The usual least squares ft would be ˆθ = θ + ˆβ(X X), where ˆβ = (X X)(θ ˆθ) (Xj ˆX) (6) f the ponts {(X, θ ) = 1,,..., k} were actually known. Snce θ are unknown, the ponts {(X, θ ) = 1,,..., k} are not avalable and so the least squares coeffcents cannot be computed exactly. Instead, consder estmatng the coeffcents usng only the avalable data X 1, X,..., X k. Snce X N(θ, I) s a full-rank exponental famly, X s complete suffcent for θ. So X s not only an unbased estmator of θ but also the UMVU estmator of θ. The numerator of ˆβ can be estmated by (X X) (k 1) snce E[ (X X) (k 1)] = E[ (X X)(θ θ)] = (θ θ). An estmate of ˆβ would be (X X) (k 1) (Xj X) = 1 k 1 (Xj X). (7)

The estmate of the least square lnear ft becomes ˆθ EM = X + (1 k 1 (Xj X) )(X X), (8) whch s just an Efron-Morrs estmator. Although the constant c = k 1 s not the optmal choce among the Efron-Morrs famly, the estmator suggested by the estmated least square lnear ft nonetheless domnates the naïve estmator X. 6 Lnear Regresson Motvates the James-Sten Estmator The James-Sten estmator can also be found through a smlar analyss. In the X/θ coordnate system, consder usng the regresson of θ on X to get an estmator ˆθ = E[θ X]. Once agan the dstrbuton of θ gven X s unavalable. Ths tme, restrct attenton to only estmators proportonal to X, of the form ˆθ = bx. (9) Mnmzng the loss functon L(θ, ˆθ) = (θ ˆθ ) s equvalent to fndng the least squares ft through the orgn. The least squares estmator s ˆβ = θ X X. (10) Snce E[ X k] = E[ θ X] = θ s an estmate of the numerator of ˆβ, the least square ft through the orgnal can be estmated by ˆθ JS = (1 k X )X, (11) whch has the form of the James-Sten estmator wth c = k nstead of the optmal c = k. 7 Shrnkage Estmators Domnate The regresson analyss pcture provdes a smple route toward the shrnkage estmators of James-Sten and Efron-Morrs. The Galtonan perspectve also explans why the shrnkage estmators are preferred over the naïve estmator. The naïve estmator X corresponds to an estmate of the regresson lne ˆX = E[X θ] = θ. To predct θ on the bass of X, however, the proper regresson lne to use s ˆθ = E[θ X]. Shrnkage estmators are just estmates of the proper regresson lne. When k = 1,, the two regresson lnes meet at each pont (X, θ ). In ths case, the James-Sten shrnkage estmator does not domnate the usual estmator. The superorty of the James-Sten estmator becomes apparently only when the two regresson lnes dffer, for k 3. 8 Mnmzng the Rsk Drectly From the Galtonan vewpont, Sten s estmator s just an estmate of the least squares ft to the regresson of θ on X n the X/θ coordnate system. The least squares ft mnmzes the loss functon, whch n turn mnmzes the rsk. A smlar strategy follows from mnmzng the rsk drectly. Among the class of estmators ˆθ = bx, the rsk s functon R(θ, bx) = E θ [ acheves ts mnmum at (θ bx ) ] (1) = kb + (1 b) θ, (13) b = to gve the mnmum rsk θ k + θ (14) R(θ, b θ X) = k < k. (15) k + θ In actualty, θ s the unknown estmand. Snce E[ X ] = (θ + 1) = θ + k, (16) 3

estmatng the numerator and denomnator of b separately gves ˆb = X k X = 1 k X. (17) As before, the estmatng the optmal b leads to the shrnkng factor. 9 Emprcal Bayes Approach to the James-Sten Estmator The emprcal Bayes pcture offers an alternatve explanaton for Sten s paradox. From the Bayesan standpont, consder X θ N(θ, 1) (18) θ N(0, τ ), (19) where each s an ndependent experment. Then the posteror densty of θ gven X s θ X N((1 B)X, 1 B), where B = 1 τ + 1. (0) Under squared-error loss, the Bayes estmator s ˆθ = E[θ X ] = (1 B)X. In the context of the orgnal problem, however, the Bayes estmator s unsatsfactory because τ s unknown. Snce X θ N(θ, 1), wrte X = θ + Z, where Z s ndependent of θ and Z N(0, 1). Then X N(0, τ + 1). The natural estmator of B = 1/(τ + 1) s therefore ˆB = c S, where S = X. (1) Thus, the emprcal Bayes argument gves ˆθ = (1 ˆB)X = (1 c )X, () S whch has the form of the James-Sten shrnkage estmator. 10 Emprcal Bayes Approach to the Efron-Morrs Estmator The emprcal Bayes pcture also provdes a route towards the Efron-Morrs estmator. From the Bayesan standpont, consder ndependent experments whch n turn gves X θ N(θ, 1) (3) θ N(µ, τ ), (4) θ X N(µ + (1 B)(X µ), 1 B) (5) X N(µ, 1 B ), where B = 1/(τ + 1).(6) Usng ˆµ = X and ˆB = c/ (X j X) gves the Efron-Morrs estmator ˆθ = X c + (1 (Xj X) )(X X). (7) 11 Equvarance and Mnmaxty n Smultaneous Estmaton Consder X 1 N(θ 1, 1), wth unknown estmand θ 1 R. The estmator ˆθ 1 = X 1 s nvarant under the translaton group G 1 = {g a g a X 1 = X 1 +a}. The class of all equvarant estmators has form X 1 + b. Under the nvarant loss functon L 1 (θ 1, d 1 ) = (θ 1 d 1 ), ˆθ 1 = X 1 s the mnmum rsk equvarance estmator because the normal dstrbuton s symmetrc about ts mean. Now generalze to X N(θ, I), where unknown estmand θ R 1 R... R k has components varyng separately. Consder the translaton group G = G 1 G... G k. The class of all equvarant estmators have form X + b, where constant b R k. Under the nvarant loss functon L(θ, d) = L (θ, d ) = θ d, the mnmum rsk equvarant estmator s X, agan because the normal dstrbuton s symmetrc about ts mean. 4

Now reconsder the one-dmensonal estmaton problem under the mnmax crteron. Under the onedmensonal squared-error loss functon L 1 (θ 1, d 1 ) = (θ 1 d 1 ), the Bayes estmator correspondng to the pror θ 1 N(µ, b ) s ˆθ 1 = X 1 + µ/b 1 + 1/b, (8) wth posteror rsk r 1 = 1/(1 + 1/b ). As b, r 1 1. Snce R(θ 1, X 1 ) = E θ1 (θ 1 X 1 ) = 1 for all θ 1, the usual least favorable sequence argument shows that X 1 s mnmax for estmatng θ 1. For the multvarate estmaton problem, take multvarate pror θ N(µ, b I). Then under the multvarate squared-error loss functon L(θ, d) = L (θ, d ) = θ d, the Bayes estmator s ˆθ = (ˆθ 1, ˆθ,..., ˆθ k ), where ˆθ = X + µ/b 1 + 1/b. (9) The posteror rsk becomes r = r = k/(1 + 1/b ). As b, r k. Snce R(θ, X) = E θ (θ X ) = k for all θ, the vector X s mnmax for θ. The above arguments for equvarance and for mnmaxty apply to many other smultaneous estmaton stuatons. Under qute general assumptons, mnmum rsk equvarance and mnmaxty both extend by components. 1 Concluson: Paradox of Admssblty n Smultaneous Estmaton Unlke mnmum rsk equvarance and mnmaxty, the admssblty of component estmators does not extend to the admssblty of the smultaneous estmator. Sten s paradox ponts out clearly that the admssblty of X 1 does not mply the admssblty of the vector X. The Galtonan pcture and the Bayesan pcture offer nsght on why admssblty does not extend by components. In the Galtonan perspectve, the ndependent samples X are coupled through regresson of {(X, θ )} n the X/θ coordnate system. As s typcal n regresson analyss, the predcton of any sngle pont s nfluenced by ts neghborng ponts. In the Bayes perspectve, the ndependent experments X θ are coupled through the common pror for all the θ. In estmatng the common varance of θ, all the ndependent samples X are pooled together to provde a more accurate emprcal Bayes estmate. Mnmum rsk equvarance and mnmaxty are both strong optmalty crtera. Mnmum rsk equvarance wth respect to a transtve group s a concrete property whch defnes a total orderng on the space of equvarant estmators. Mnmaxty s also a concrete property whch defnes a total orderng on the space of estmators. So mposng mnmum rsk equvarance or mnmaxty on component estmators s usually suffcent to guarantee that the same concrete property wll hold over the product space. Admssblty s not a concrete optmalty crteron. Snce the rsk functons of two estmators may cross, comparson of rsk functons n ther entrety does not defne a total orderng on the space of estmators. Because admssblty s a weak crteron representng the absence of optmalty, the product of admssble estmators does not guarantee admssblty. On the other hand, nadmssblty s a concrete optmalty crteron. So n general, the product of nadmssble estmators wll reman nadmssble. 13 References 1. Efron, B. and Morrs, C. N. (1977). Sten s paradox n statstcs. Scentfc Amercan. 36 (5) 119 17.. Lehmann, E. L. (1983). Theory of Pont Estmaton. Wley, New York. 3. Stgler, S. M. (1990). The 1998 Neyman Memoral Lecture: A Galtonan Perspectve on Shrnkage Estmators. Statstcal Scence. 5 (1) 147 156. 5