Explanng the Sten Paradox Kwong Hu Yung 1999/06/10 Abstract Ths report offers several ratonale for the Sten paradox. Sectons 1 and defnes the multvarate normal mean estmaton problem and ntroduces Sten s paradox. Sectons 3 8 advances the Galtonan perspectve and explans Sten s paradox usng regresson analyss. Sectons 9 and 10 approaches Sten s paradox through the conventonal emprcal Bayes approach. The closng sectons 11 and 1 compares admssblty to equvarance and to mnmaxty as crtera for smultaneous estmaton. 1 Estmatng Mean of Multvarate Normal Sten s paradox arses n estmatng the mean of a multvarate normal random varable. Let X 1, X,..., X k be ndependent normal random varables, such that X N(θ, 1). All k random varables have a common known varance, but ther unknown means dffer and vary separately. In other words, (X 1, X,..., X k ) N(θ, I), where θ = (θ 1, θ,..., θ k ) and I s the k k dentty matrx. Consder estmatng the mean θ under squarederror loss L(θ, d) = k (θ d ) = θ d. A natural and ntutve estmate of θ would be X tself. Charles Sten showed that ths naïve estmator ˆθ = X s not even admssble. For k 3, the obvous estmate X s domnated by ˆθ JS = (1 k S )X, where S = X. (1) The James-Sten estmator ˆθ JS shrnks the naïve estmate towards zero by a factor 1 k S, where S = X depends on the other random varables. Although the rsk s optmal for k, more generally, for k 3 and 0 < c < (k ), any estmator of the form ˆθ JS = (1 c S )X () has unformly smaller rsk for all θ. Smlarly, for k 4 and 0 < c < (k 3), the famly of Efron- Morrs estmator where ˆθ EM = X + (1 c S )(X X), (3) S = (X X) (4) domnates the naïve estmator X. In ths case, the optmal c = k 3. The Efron-Morrs estmators shrnk towards the mean X rather than 0. Because each X s ndependently dstrbuted, estmatng each θ as separate experments would seem ntutve. Paradoxcally, combnng observatons actually provdes a much better estmate of the means. So Sten s paradox states not only that the natural estmator X s not admssble, but also that X s domnated by ˆθ JS, an estmator whch combnes ndependent experments. Sten s Paradox Volates Intuton On the surface, Sten s paradox seems unacceptable. Intutvely each ndependent experment should not affect the other experments. For example, consder
X as test scores of each student n a class. Suppose that X N(θ, 1), where θ s the theoretcal IQ of student. Gven that each student takes hs test ndependently, the natural estmate of each student s IQ would be hs ndvdual test score. Snce each student took the test ndvdually and separately, justce dctates that each student be judged on hs personal performance. In partcular, the natural estmate of θ would ntutvely be X. Yet accordng to Sten s paradox, the performance of the entre class should be factored nto the estmate of each ndvdual s IQ. Rather than estmate Tom s IQ by hs personal test score, Sten s estmator shrnks Tom s IQ by a factor based on the standard devaton of all test scores. Sten s paradox asserts that makng use of test scores of the whole class provdes a better estmate of each ndvdual s IQ. 3 Galtonan Perspectve on Sten s Paradox Consder pars {(X, θ ) = 1,,..., k}, where the X s are known but the θ s are unknown. Snce X N(θ, 1), wrte X = θ + Z, where Z N(0, 1). Because X are mutually ndependent, the Z are also mutually ndependent. Francs Galton over a century ago offered a perspectve that can shed lght on Sten s paradox. On a graph wth X as the horzontal axs and θ as the vertcal axs, the pont (X, θ ) would le near the dagonal lne θ = X. Because of the error term Z, the pont (X, θ ) s horzontally shfted from the dagonal lne θ = X by the random dstance Z. Snce E X = θ and V ar X = 1/k, the mean X should be centered about θ. In other words, ( X, θ) should le near the center dagonal lne θ = X on the X/θ coordnate system. 4 Regresson Analyss Pcture Gven an X, the correspondng θ can be estmated usng standard regresson ˆθ = E[θ X]. The naïve estmate ˆθ = X, apparently agrees wth ˆX = E[X θ] = θ, the regresson lne of X on θ. However, the regresson lne of X on θ s not approprate n estmatng θ on the bass of X. Instead the opposte regresson lne should be used. Usng the regresson of θ on X would provde a more meanngful estmate of θ. Snce the dstrbuton of θ gven X s unknown, the regresson lne ˆθ = E[θ X] must be approxmated. 5 Lnear Regresson Motvates the Efron-Morrs Estmator The optmal estmator ˆθ = E[θ X] s unavalable because the dstrbuton of θ gven X s unknown. Instead, restrct attenton to only lnear estmators of the form ˆθ = a + bx, = 1,,..., k. (5) Mnmzng the loss functon L(θ, ˆθ) = k (θ ˆθ ) s equvalent to fndng the least squares ft to {(X, θ ) = 1,,..., k} n the X/θ coordnate system. The usual least squares ft would be ˆθ = θ + ˆβ(X X), where ˆβ = (X X)(θ ˆθ) (Xj ˆX) (6) f the ponts {(X, θ ) = 1,,..., k} were actually known. Snce θ are unknown, the ponts {(X, θ ) = 1,,..., k} are not avalable and so the least squares coeffcents cannot be computed exactly. Instead, consder estmatng the coeffcents usng only the avalable data X 1, X,..., X k. Snce X N(θ, I) s a full-rank exponental famly, X s complete suffcent for θ. So X s not only an unbased estmator of θ but also the UMVU estmator of θ. The numerator of ˆβ can be estmated by (X X) (k 1) snce E[ (X X) (k 1)] = E[ (X X)(θ θ)] = (θ θ). An estmate of ˆβ would be (X X) (k 1) (Xj X) = 1 k 1 (Xj X). (7)
The estmate of the least square lnear ft becomes ˆθ EM = X + (1 k 1 (Xj X) )(X X), (8) whch s just an Efron-Morrs estmator. Although the constant c = k 1 s not the optmal choce among the Efron-Morrs famly, the estmator suggested by the estmated least square lnear ft nonetheless domnates the naïve estmator X. 6 Lnear Regresson Motvates the James-Sten Estmator The James-Sten estmator can also be found through a smlar analyss. In the X/θ coordnate system, consder usng the regresson of θ on X to get an estmator ˆθ = E[θ X]. Once agan the dstrbuton of θ gven X s unavalable. Ths tme, restrct attenton to only estmators proportonal to X, of the form ˆθ = bx. (9) Mnmzng the loss functon L(θ, ˆθ) = (θ ˆθ ) s equvalent to fndng the least squares ft through the orgn. The least squares estmator s ˆβ = θ X X. (10) Snce E[ X k] = E[ θ X] = θ s an estmate of the numerator of ˆβ, the least square ft through the orgnal can be estmated by ˆθ JS = (1 k X )X, (11) whch has the form of the James-Sten estmator wth c = k nstead of the optmal c = k. 7 Shrnkage Estmators Domnate The regresson analyss pcture provdes a smple route toward the shrnkage estmators of James-Sten and Efron-Morrs. The Galtonan perspectve also explans why the shrnkage estmators are preferred over the naïve estmator. The naïve estmator X corresponds to an estmate of the regresson lne ˆX = E[X θ] = θ. To predct θ on the bass of X, however, the proper regresson lne to use s ˆθ = E[θ X]. Shrnkage estmators are just estmates of the proper regresson lne. When k = 1,, the two regresson lnes meet at each pont (X, θ ). In ths case, the James-Sten shrnkage estmator does not domnate the usual estmator. The superorty of the James-Sten estmator becomes apparently only when the two regresson lnes dffer, for k 3. 8 Mnmzng the Rsk Drectly From the Galtonan vewpont, Sten s estmator s just an estmate of the least squares ft to the regresson of θ on X n the X/θ coordnate system. The least squares ft mnmzes the loss functon, whch n turn mnmzes the rsk. A smlar strategy follows from mnmzng the rsk drectly. Among the class of estmators ˆθ = bx, the rsk s functon R(θ, bx) = E θ [ acheves ts mnmum at (θ bx ) ] (1) = kb + (1 b) θ, (13) b = to gve the mnmum rsk θ k + θ (14) R(θ, b θ X) = k < k. (15) k + θ In actualty, θ s the unknown estmand. Snce E[ X ] = (θ + 1) = θ + k, (16) 3
estmatng the numerator and denomnator of b separately gves ˆb = X k X = 1 k X. (17) As before, the estmatng the optmal b leads to the shrnkng factor. 9 Emprcal Bayes Approach to the James-Sten Estmator The emprcal Bayes pcture offers an alternatve explanaton for Sten s paradox. From the Bayesan standpont, consder X θ N(θ, 1) (18) θ N(0, τ ), (19) where each s an ndependent experment. Then the posteror densty of θ gven X s θ X N((1 B)X, 1 B), where B = 1 τ + 1. (0) Under squared-error loss, the Bayes estmator s ˆθ = E[θ X ] = (1 B)X. In the context of the orgnal problem, however, the Bayes estmator s unsatsfactory because τ s unknown. Snce X θ N(θ, 1), wrte X = θ + Z, where Z s ndependent of θ and Z N(0, 1). Then X N(0, τ + 1). The natural estmator of B = 1/(τ + 1) s therefore ˆB = c S, where S = X. (1) Thus, the emprcal Bayes argument gves ˆθ = (1 ˆB)X = (1 c )X, () S whch has the form of the James-Sten shrnkage estmator. 10 Emprcal Bayes Approach to the Efron-Morrs Estmator The emprcal Bayes pcture also provdes a route towards the Efron-Morrs estmator. From the Bayesan standpont, consder ndependent experments whch n turn gves X θ N(θ, 1) (3) θ N(µ, τ ), (4) θ X N(µ + (1 B)(X µ), 1 B) (5) X N(µ, 1 B ), where B = 1/(τ + 1).(6) Usng ˆµ = X and ˆB = c/ (X j X) gves the Efron-Morrs estmator ˆθ = X c + (1 (Xj X) )(X X). (7) 11 Equvarance and Mnmaxty n Smultaneous Estmaton Consder X 1 N(θ 1, 1), wth unknown estmand θ 1 R. The estmator ˆθ 1 = X 1 s nvarant under the translaton group G 1 = {g a g a X 1 = X 1 +a}. The class of all equvarant estmators has form X 1 + b. Under the nvarant loss functon L 1 (θ 1, d 1 ) = (θ 1 d 1 ), ˆθ 1 = X 1 s the mnmum rsk equvarance estmator because the normal dstrbuton s symmetrc about ts mean. Now generalze to X N(θ, I), where unknown estmand θ R 1 R... R k has components varyng separately. Consder the translaton group G = G 1 G... G k. The class of all equvarant estmators have form X + b, where constant b R k. Under the nvarant loss functon L(θ, d) = L (θ, d ) = θ d, the mnmum rsk equvarant estmator s X, agan because the normal dstrbuton s symmetrc about ts mean. 4
Now reconsder the one-dmensonal estmaton problem under the mnmax crteron. Under the onedmensonal squared-error loss functon L 1 (θ 1, d 1 ) = (θ 1 d 1 ), the Bayes estmator correspondng to the pror θ 1 N(µ, b ) s ˆθ 1 = X 1 + µ/b 1 + 1/b, (8) wth posteror rsk r 1 = 1/(1 + 1/b ). As b, r 1 1. Snce R(θ 1, X 1 ) = E θ1 (θ 1 X 1 ) = 1 for all θ 1, the usual least favorable sequence argument shows that X 1 s mnmax for estmatng θ 1. For the multvarate estmaton problem, take multvarate pror θ N(µ, b I). Then under the multvarate squared-error loss functon L(θ, d) = L (θ, d ) = θ d, the Bayes estmator s ˆθ = (ˆθ 1, ˆθ,..., ˆθ k ), where ˆθ = X + µ/b 1 + 1/b. (9) The posteror rsk becomes r = r = k/(1 + 1/b ). As b, r k. Snce R(θ, X) = E θ (θ X ) = k for all θ, the vector X s mnmax for θ. The above arguments for equvarance and for mnmaxty apply to many other smultaneous estmaton stuatons. Under qute general assumptons, mnmum rsk equvarance and mnmaxty both extend by components. 1 Concluson: Paradox of Admssblty n Smultaneous Estmaton Unlke mnmum rsk equvarance and mnmaxty, the admssblty of component estmators does not extend to the admssblty of the smultaneous estmator. Sten s paradox ponts out clearly that the admssblty of X 1 does not mply the admssblty of the vector X. The Galtonan pcture and the Bayesan pcture offer nsght on why admssblty does not extend by components. In the Galtonan perspectve, the ndependent samples X are coupled through regresson of {(X, θ )} n the X/θ coordnate system. As s typcal n regresson analyss, the predcton of any sngle pont s nfluenced by ts neghborng ponts. In the Bayes perspectve, the ndependent experments X θ are coupled through the common pror for all the θ. In estmatng the common varance of θ, all the ndependent samples X are pooled together to provde a more accurate emprcal Bayes estmate. Mnmum rsk equvarance and mnmaxty are both strong optmalty crtera. Mnmum rsk equvarance wth respect to a transtve group s a concrete property whch defnes a total orderng on the space of equvarant estmators. Mnmaxty s also a concrete property whch defnes a total orderng on the space of estmators. So mposng mnmum rsk equvarance or mnmaxty on component estmators s usually suffcent to guarantee that the same concrete property wll hold over the product space. Admssblty s not a concrete optmalty crteron. Snce the rsk functons of two estmators may cross, comparson of rsk functons n ther entrety does not defne a total orderng on the space of estmators. Because admssblty s a weak crteron representng the absence of optmalty, the product of admssble estmators does not guarantee admssblty. On the other hand, nadmssblty s a concrete optmalty crteron. So n general, the product of nadmssble estmators wll reman nadmssble. 13 References 1. Efron, B. and Morrs, C. N. (1977). Sten s paradox n statstcs. Scentfc Amercan. 36 (5) 119 17.. Lehmann, E. L. (1983). Theory of Pont Estmaton. Wley, New York. 3. Stgler, S. M. (1990). The 1998 Neyman Memoral Lecture: A Galtonan Perspectve on Shrnkage Estmators. Statstcal Scence. 5 (1) 147 156. 5