The Evil of Supereciency P. Stoica B. Ottersten To appear as a Fast Communication in Signal Processing IR-S3-SB-9633 ROYAL INSTITUTE OF TECHNOLOGY Department of Signals, Sensors & Systems Signal Processing S-100 44 STOCKHOLM KUNGL TEKNISKA HÖGSKOLAN Institutionen för Signaler, Sensorer & System Signalbehandling 100 44 STOCKHOLM
The Evil of Supereciency P. Stoica Systems and Control Group, Uppsala University P.O. Box 27, S{751 03 Uppsala, Sweden B. Ottersten 1 Department of Signals, Sensors and Systems Royal Institute of Technology, S{100 44 Stockholm, Sweden. We discuss the intriguing notion of statistical supereciency in a straightforward manner with a minimum of formality. We point out that for any given parameter estimator there exist other estimators which have a strictly lower asymptotic variance and hence are statistically more ecient than the former. In particular, if the former estimator was statistically ecient (in the sense that its asymptotic variance was equal to the Cramer-Rao bound) then the latter estimators could be called \superef- cient". Among others, the phenomenon of supereciency implies that asymptotically there exists no uniformly minimum-variance parameter estimator. Key words: Eciency, Minimum-Variance, Cramer-Rao Bound 1 Introductory Remarks and Finite-Sample Supereciency The notion of statistical eciency plays a central role in the theory of parameter estimation. Usually an estimator ^ N of a (true and unknown) parameter vector is called statistically ecient if its mean square error (MSE) matrix attains the unbiased Cramer-Rao bound (CRB) C^N def = Ef(^ N? )(^ N? ) T g = C CRB (1) 1 Corresponding author. Currently visiting at ArrayComm Inc., 3141 Zanker Rd, San Jose, CA 95134, Phone (408) 952-1854, Email otterste@ s3.kth.se. Preprint submitted to Signal ProcessingTo appear as a Fast Communication insignal Processing
Hereafter Efg stands for the expectation operator and N denotes the number of data samples. Remark: As an historical aside we remark on the fact that what is now known as the CRB inequality was apparently discovered, for the single-parameter case, by Doob [1] and rediscovered in a neater manner by Frechet [2]. Darmois [3], Cramer [4] and Rao [5] have presented generalizations of the CRB inequality to the multi-parameter case. In particular the CRB derivation by Cramer in [4] is a masterpiece that is worth reading even nowadays. 2 It is well known that in nite samples (i.e. for N < 1) parameter estimators which satisfy (1) exist only under very restrictive conditions (see, e.g., [6] and the references therein). At any rate, whenever those conditions are satised, the maximum likelihood estimator (MLE) satises (1). Nevertheless, even in such cases it is misleading to call the MLE \statistically ecient", since parameter estimators with lower MSE may well exist. The latter estimators, which must necessarily be biased, might be said to be \superecient" with respect to the MLE. A rst example of an estimator more ecient statistically than the MLE was given by Stein (see [6] and the references therein). Let fy t g N t=1 denote a sequence of independent and identically distributed Gaussian random vectors with mean. Also let ^ = 1 N NX t=1 y t : (2) It is well known that ^ above is the MLE of. The covariance matrix of the estimation errors in ^ is readily shown to equal the CRB, and hence ^ can be said to be \statistically ecient" according to the usual terminology. Stein's estimator of is given by! ^ = 1? n? 2 ^ for n = dim() 3 (3) Nk^k 2 where k k denotes the vector Euclidean norm. By a somewhat involved calculation (see, e.g. [6]) it is possible to show that ^ has a lower MSE than ^ and hence than the CRB. More exactly n? 2 2 E k^? k 2? E k^? k 2 = E k^k?2 : (4) N A simpler example of an estimator more ecient than the MLE was presented in [7]. Consider the scenario above with = 0 and n = 1, and let 2 denote 2
the variance of fy t g. It is well known that the MLE of 2 is given by ^ 2 = 1 N NX t=1 y 2 t (5) and also that its MSE (or variance) is equal to C CRB = 2 4 =N. Let ^ 2 = 1 N + 2 NX t=1 y 2 t : (6) A straightforward calculation (see, e.g., [7]) shows that the MSE of (6) is 2 4 =(N + 2), which is always less than the MSE of the MLE and hence than the C CRB. The conclusion of the previous discussion about the nite-sample case is that calling an estimator, such as the MLE, \statistically ecient" whenever it achieves the C CRB is not fully justied, as more ecient estimators (that is, estimators with MSE less than C CRB ) may exist. In the nite-sample case, discussed so far, the existence of such \superecient" estimators is easily understood and hence accepted. These estimators compromise bias for variance in such a way that their MSE becomes lower than the C CRB. In the asymptotic case, which is discussed in the next section, the existence of superecient estimators is more intriguing and hence more dicult to accept. At an intuitive level the existence of such estimators should apparently be related to the discussion on biased estimation in the previous paragraphs. More exactly, even though asymptotically superecient estimators that are asymptotically unbiased do exist, (as we will see shortly), such estimators must be biased for N < 1. Indeed, under regularity conditions the covariance matrix of an unbiased estimator must satisfy the CRB inequality for all values of N (including N! 1, after proper normalization), and hence such an estimator cannot be \superecient" (i.e. it cannot have an asymptotic covariance matrix less than the asymptotic CRB). The phenomenon of statistical supereciency was apparently ignored in the engineering literature (with the notable exception of [8]). The discussion of asymptotic statistical supereciency in this note is based on [6] and is meant to introduce the concept and its main consequences to the signal processing community. 3
2 Asymptotic supereciency Let denote the compact set of possible parameter values, and let 2 be the true (and unknown) parameter vector. Furthermore, let ^ N denote the MLE of. Under weak regularity conditions the normalized estimation error p N(^N? ) converges in distribution to a Gaussian random vector with zero mean and covariance matrix equal to the asymptotic normalized CRB, that is p N(^N? ) d?! N (0; C CRB ) (7) where CCRB = lim N!1 NC CRB (with C CRB as dened before). The statistical and signal processing literature contains many examples of cases where (7) holds true 2. Let ~ N denote any other estimator of which is such that, similarly to (7), p N( ~ N? ) d?! N (0; C ~N ) : (8) Since both ^ N and ~ N above are asymptotically unbiased and consistent (we assume that the matrices C CRB and C ~N are nite), one might think that the following inequality (which is reminiscent of the CRB inequality for unbiased estimators) should hold C ~ N C CRB for any 2 : (9) (Note that both sides in (9) usually depend on, but to simplify the notation we have written C~ N in lieu of C~ () etc.). Fisher himself conjectured that N (9) should hold true, and hence that the MLE should asymptotically be the minimum-variance estimator in. However, (9) does not hold. In fact, for any given parameter estimator (MLE included) there exist other estimators which have a strictly lower asymptotic variance at some points in, and which are thus statistically more ecient. In particular, if the former estimator is asymptotically statistically ecient (such as is the MLE), in the sense that the covariance matrix of its asymptotic distribution attains C CRB, then the latter estimators can be called asymptotically statistically superecient. 2 The normalization may be dierent from that used in (7), for instance in sinusoidal parameter estimation problems the normalizing factor may be N 3=2, but this variation is of minor importance for the discussion that follows. We also stress that the following discussion does not depend on the assumption that the asymptotic distribution is Gaussian, as in (7). 4
The phenomenon of supereciency was discovered by Hodges and presented as a counterexample to Fisher's conjecture on (9) (see, e.g., [6] and the references therein). It implies that asymptotically there exists no uniformly minimumvariance parameter estimator. This strong negative result is sometimes said to be due to the evil of supereciency [6] (which has inspired the title of this note). To motivate the claims above we will make use of a generic example patterned after [6]. (We stress the fact that what follows is just an example of how to obtain a superecient estimator; no \general rules" for deriving such estimators appear to be available.) Let g denote a xed point in, and let ^ N be any given estimator of 2 satisfying (7). Consider the following estimator associated with ^ N ^ N = 8 > < >: ^ N if k^ N? g k > N?1=4 g if k^ N? g k N?1=4 (10) The asymptotic distribution of ^N is readily derived. Let 6= g (we assume that the dierence (? g ) does not depend on N, a rather reasonable condition). Then, prob(k^n? ^ N k > 0) (11) = prob(k^ N? g k N?1=4 ) = prob(kn 1=2 (^ N? ) + N 1=2 (? g )k N 1=4 )! 0 as N! 1 which implies that (^N? ^ N ) converges in probability to zero (^N? ^ N ) p?! 0 as N! 1 : (12) It follows from (12) and some standard stochastic convergence results (see, e.g., Prop. 6.3.3 in [9]) that ^N and ^ N have the same asymptotic distribution for 6= g. In particular, C^N = C^N at 6= g : (13) Next, let = g. Then, prob(k^n? k > 0) = prob(kn 1=2 (^ N? )k > N 1=4 )! 0 as N! 1 (14) 5
and hence, in this case, ^N converges both in distribution and in probability to (see Prop. 6.3.5 in [9]). Consequently, the asymptotic variance of ^N is zero at = g. In conclusion, ^N has the same asymptotic variance as ^ N for all points 6= g in, and a strictly smaller asymptotic variance at = g. The proof that for any estimator ^ N one can obtain an estimator ^N with lower asymptotic variance and the same type of asymptotic law is thus concluded. It is interesting to note that the subsets of on which one can asymptotically beat the CRB can be shown to have zero measure (a result attributed to LeCam, see [8] and the references therein). In view of this fact we may wonder whether the supereciency has any practical relevance. The theoretical importance of this concept denitely dominates its practical importance. However this does not mean that the statistical supereciency has no potential practical relevance. Consider, for instance, the problem of detecting a signal of unknown amplitude in noisy measurements. A usual way to solve this problem is to obtain an estimate of (let us say ^ N ) as well as of the standard deviation of ^ N (say ^ N ). Then one decides that the signal is not present (i.e. = 0) if j^ N j ^ N (for some constant ), and that it is present (i.e. 6= 0) otherwise. In general ^ N is on the order of N?1=2, and the aforementioned rule can be shown to be inconsistent. A simple idea to obtain a consistent detection rule would be to make use of the \superecient" estimate in (10) with g = 0. Based on ^N we simply decide that the signal is not present (is present) whenever ^N = 0 (j^n j > 0). Under the hypothesis that = 0 we have (cf. (14)), prob(j^n j > 0)! 0 as N! 1 (15) and under the assumption that 6= 0 (and constant) we get (cf. (11)), prob(^n = 0) = prob(j^n j 6= N?1=4 )! 0 as N! 1 (16) which proves the consistency of the detection rule based on ^N. Observe that in terms of the original estimate, ^ N, the detection rule based on ^N amounts to comparing ^ N with a threshold on the order of N?1=4 (in lieu of the order of N?1=2 mentioned above). 6
3 Concluding Remarks Owing to the existence of supereciency we should in principle avoid saying that an estimator which asymptotically achieves the (asymptotic) CRB is (uniformly) asymptotically ecient. However, there exists a considerable statistical and signal processing literature making use of such a phrasing. It may therefore be confusing and dicult to attempt changing the terminology. A solution to this dilemma is provided by [6]. It is shown there that there exist parameter estimators which, under weak conditions, are asymptotically minimax optimal in the sense that they yield the lowest possible value of the following loss function lim sup E k^ N? k 2 (17) N!1 2 By an abuse of terminology such estimators can be called \asymptotically statistically ecient" [6]. Under regularity conditions the MLE is such an estimator. Hence we can continue saying that the MLE is \asymptotically statistically ecient", but at the same time we should be aware of the fact that such a statement is not true in the Fisher's sense of possessing minimum asymptotic variance at each point in the parameter set. References [1] J.L. Doob, Statistical estimation, Trans. American Math. Soc., 39 (1936) 410{ 421. [2] M. Frechet, Sur l'extension de certaines evaluations statistiques au cas de petits echantillons, Revue Inst. Int. de Stat., 11 (1943) 182{205. [3] G. Darmois, Sur les limites de la dispersion de certaines estimations, Revue Inst. Int. de Stat., 13 (1945) 9{15. [4] H. Cramer, A contribution to the theory of statistical estimation, Skand. aktuarietidskrift, 29 (1946) 85{94. [5] C.R. Rao, Minimum variance and the estimation of several parameters, Proc. Cambridge Phil. Soc., 43 (1946) 280{283. [6] I.A. Ibragimov and R.Z. Has'minskii, Statistical Estimation { Asymptotic Theory (Springer-Verlag, New York, 1981). [7] P. Stoica and R. Moses, On biased estimators and the unbiased Cramer-Rao bound, Signal Processing 29 (1991) 344 350. 7
[8] G.R. Benitz, Asymptotic results for maximum likelihood estimation with an array of sensors, IEEE Trans. Info. Th., 39:1374{1385, 1993. [9] B. Brockwell and R. Davis, Time Series { Theory and Methods, 2 nd edition (Springer-Verlag, New York, 1991). 8