Testing Estimator s Credibility Part I: Tests for MSE

Size: px

Start display at page:

Download "Testing Estimator s Credibility Part I: Tests for MSE"

Laurence Robertson
5 years ago
Views:

1 Testing Estimator s Credibility Part I: Tests for MSE X. Rong Li Zhanlue Zhao Department of Electrical Engineering University of New Orleans New Orleans, LA 748, USA xli@uno.edu, zhanluezhao@gmail.com Abstract - Most estimators and filters provide assessments of their own estimation error, often in the form of mean-square error. Are these self-assessments trustable? What is the degree to which they are trustable? This is Part I of a two-part series that provides answers to some of these questions, referred to as the credibility of the estimators. It formulates the concept of credibility, proposes tests for MSE-credibility, and discusses their superiority to the existing test. Numerical examples are provided to illustrate the utility and effectiveness of the proposed tests and the drawbacks of the existing test. Keywords: Performance evaluation, estimation, credibility, statistical test. Introduction Algorithms for parameter, signal, and state estimation are widely used in science and engineering. No matter how solid such an estimation algorithm, or estimator for short, is in theory, its performance and characteristics must be evaluated in practice to serve a number of purposes, such as verification of its validity, demonstration of its performance, and comparison with other estimators. More specifically, estimators are almost always derived on the basis of more or less restrictive assumptions. These assumptions are often not transparent to practitioners. It is not easy to verify the validity of these assumptions in many practical situations. For a practitioner, the validity of these assumptions per se is of little concern; what is important is whether the estimator works well for the application under consideration. This can be evaluated by stochastic simulation using a number of measures, such as those proposed or discussed in [4]. This paper deals with a closely related issue the credibility of an estimator. Many estimators provide selfassessments of their estimation errors based on some simplifying assumptions. These self-assessments carry useful information about the estimation errors and the capability of the estimators. It would be ashamed to waste such information. Compared with those used to derive the estimator itself, however, usually these assumptions are even less transparent to practitioners and harder to verify. Even worse, these self-assessments are not always reliable and they may even be misleading when the underlying assumptions are not accurate enough. Then important questions for Research supported in part by ARO grant W9NF , NASA/LEQSF grant (2-4)-, and Navy through Planning Systems Contract # N C-382. practitioners include: Can we trust these self-assessments? If not, are the estimators too optimistic or pessimistic? By how much amount? In other words, an important issue in practice is how credible an estimator s self-assessment is and how to determine this credibility both qualitatively and quantitatively. We refer to this issue as the credibility issue of an estimator. Evidently, it amounts to evaluating the self-assessments. Since an estimator is data-driven and its self-assessment is data-dependent in general, the evaluation of its selfassessment should be done only in a statistical sense. In practice, such a task is almost always done by the Monte- Carlo method via computer simulations. Albeit very important in practice, work on this issue has been scarce. Only limited treatments of this topic can be found in publications, e.g., [3, 4, 6, 5]. In our opinion, it has received attention far less than it deserves. As a result, it is rarely the case that a practitioner can answer the above important questions satisfactorily. The purpose of this series is three-fold. First, it provides a formal definition of the credibility of an estimator to facilitate further studies of this topic. Second, it discusses how the credibility of an estimator can be tested properly, along with relevant theoretical results; that is, how to properly answer such questions as Is the self-assessment of the estimator credible? Finally, it intends to stimulate further studies of this important topic. 2 The Notion of Credibility Clearly, the credibility issue has two related sides. On the qualitative side, it addresses whether an estimator s selfassessment is credible. The answer should be yes or no. Unfortunately, like many other decision problems, there is not a clear line that separates the two answers in many cases. An estimator accepted as credible by one user may be rejected by another user for the same case because, for example, the required levels of credibility may differ. Similarly, two noncredible estimators may have vastly different levels of noncredibility, which may lead to completely different actions. Therefore, it would be desirable if we could also quantify the amount by which an estimator is credible or noncredible. This will enable us to compare quantitatively the credibility levels of estimators. In this two-part series, however, we will only deal with the qualitative side of the credibility issue by statistical hypothesis testing. The quantitative side what we called credibility measures is the topic of a companion paper [5] as well as previous

2 publications [2, 3]. Estimation performance is usually described in terms of the estimation error. For simplicity, throughout the paper we will consider only the first two moments (mean and covariance) of the estimation error since most estimators will not be able to provide any other information about its performance. With the above considerations, we define the credibility of an estimator as follows. Definitions. An estimator is said to be credible at a level α ( α ) if the difference between its actual bias and/or mean-square error (MSE) matrix and its self-assessment (i.e., calculated bias and/or MSE matrix) is statistically insignificant at level α in the sense that the actual and the calculated values can be treated as statistically equal. The maximum level α at which an estimator is rejected as being noncredible is referred to as the noncredibility level of the estimator. An estimator is said to be optimistic (or pessimistic) in MSE at a level α if its computed MSE matrix is statistically smaller (or larger) than the actual MSE matrix at the α level. We emphasize that the word difference above should not be interpreted literally. In fact, we mean both (or either) additive difference and/or multiplicative difference. For example, a small difference in A and B could actually mean that A/B is close to, as well as A B is small. In fact, a matrix version of the ratio is used in both credibility tests and measures. The self-assessment of a noncredible estimator is not reliable/credible at the level considered. This is not to be confused with the reliability of the estimates per se. One type of reliability does not imply the other. The above definition makes it explicit that rigorously speaking, when we are speaking of the noncredibility of an estimator, the corresponding level should also be specified. However, it is not appropriate to define the (minimum) level at which the credibility of an estimator is not rejected as the credibility level of the estimator. In fact, we do not propose any definition of the credibility level due to a number of difficulties, as explained later. To our knowledge, the most notable prior publications that provide a considerable amount of treatment of the credibility issue are [3, 4], referred to as (finite-sample) consistency. They address the issue and present a chi-square significance test for determining whether a filter should be accepted as credible, although without a formal definition of the (finite-sample) consistency. The term credibility is recommended here because consistency is an extremely well-established concept widely used in statistics and its application areas, which differs very much from the concept of credibility, and furthermore, credible/credibility has been used in statistics as a technical term, such as the credible region and the degree of credibility, in reference to the commensurability of a hypothesis relative to data or evidence [24]. Terminology and notation. The quantity to be estimated, called estimatee, can be a time-invariant (or slowly varying) parameter, a (deterministic or random) process or signal, or in particular, the state of a (deterministic or random) system. We will use the term estimator to mean both parameter estimator and filter (in particular, state estimator). Let the n-dimensional estimatee, estimate, and estimation error be denoted by x, ˆx, and x, respectively. Note that n is reserved throughout the paper for the dimension of the estimate. We denote the actual MSE matrix of ˆx by P and the estimator-provided MSE matrix by P. Note that P and P are MSE matrices, not error covariance matrices. Each test always uses an independent and identically distributed sample of size N. All default vectors are column vectors. The determinant of A is denoted by A. The B - norm of a vector a is defined as a B = (a B a) /2, where a stands for the transpose of the column vector a. By y N(µ, Σ) we mean y is Gaussian distributed with mean µ and covariance matrix Σ; the corresponding density is denoted as N(y; µ, Σ). Chi-square distribution with n degrees of freedom is denoted as χ 2 n. We use y (µ, Σ) to denote that y has mean µ and covariance Σ. Note that since µ = E[ x] and P = MSE(ˆx), the actual covariance matrix of the estimation error x is Σ = P µ (µ ) and so we always have x (µ, P µ (µ ) ). 3 NEES-Based Credibility Test 3. ANEES and Its Mean and Variance We first describe the existing method for credibility evaluation. It is based on the so-called normalized estimation error squared (NEES), defined by ǫ i = (x i ˆx i ) Pi (x i ˆx i ) = x i 2 P i where x i, ˆx i, and P i are the estimatee, estimate and MSE matrix provided by the estimator on the ith run. NEES can be interpreted as the distance squared between x and ˆx in terms of the P -norm. It depends on the dimension n of x. To remove this dependence, we define the average normalized estimation error squared (ANEES) by ǫ = nn N ǫ i = N i= N i= x i 2 /n P i which we believe appeared first in [6]. In this sense, the ANEES is a one-dimensional-equivalent average distance between x and ˆx in terms of the P -norm. Assume x N(, P ). It can be shown that if the estimator-computed MSE P is not random, then E[ ǫ] = tr(p P ), var( ǫ) = 2tr(P P P P ) n n 2 N These results hold true even if P is very different from the actual MSE P. 3.2 The Test The above results show that if P = P then E[ ǫ] = regardless of the dimension n and var( ǫ) = 2/(nN), which decreases towards zero as sample size N increases. It follows that loosely speaking, the ANEES has a mean larger (or smaller) than if the actual MSE matrix P is larger (or smaller) than the estimator provided MSE P. The closer to the ANEES is, the closer to P the estimator-provided P is. In view of this, a simple conclusion is that an estimator ()

3 is optimistic (or pessimistic) if its ANEES is significantly larger (or smaller) than. However, how much larger (or smaller) is significant? This is usually answered by the chisquare significance test, described next. Credibility testing based on the ANEES relies on the following fact: When x N(, P ) and P P, NEES and thus ANEES are not chi-square distributed because by the Ogasawara-Takahashi theorem (see, e.g., [22]) x P x is chi-square if and only if P P P P P = P P P (which is equivalent to P = P if P is nonsingular), in which case the degrees of freedom is rank(p P ). If the assumptions, models, and approximations based on which the MSE P is computed are valid, P should be close to the actual MSE matrix P. As such, assuming x N(, P ) the NEES ǫ i as a quadratic form in x should have a standard χ 2 n distribution approximately. Then, nn ǫ = N i= ǫ i should be (approximately) the sum of N independent χ 2 n random variables, that is, chi-square χ 2 nn distributed with nn degrees of freedom, which has mean nn and variance 2nN. As such, the MSE-credibility can be evaluated based on the chi-square significance test: Reject P = P if ǫ is outside of the interval [a, b] such that P { ǫ [a, b] nn ǫ χ 2 nn } = α for a < < b and some very small α. In addition, the estimator can be deemed optimistic (or pessimistic) if ǫ > b (or ǫ < a). The rejection of P = P is based on the following principle of small-probability events, sometimes referred to as Cournot s principle: The occurrence of an event of an extremely small probability on a single trial has a profound implication the model (or assumptions) based on which the probability is calculated is incorrect and should be abandoned. It is this principle that underlies the acceptance of a theory as scientific. Specifically, a theory, no matter how exotic or bizarre, is confirmed to be correct and generally accepted if its unexpected predictions are verified by experiments or observations. Good examples include Einstein s general theory of relativity and the Big-Bang theory in cosmology, which are much more bizarre than most science fictions. What is particularly amazing is that these theories were accepted by most physicists in the same field right after only a few bold predictions were verified. Why was this the case? The answer lies in Cournot s principle. Assuming these theories were not correct, a verification of any of their bold predictions would be highly improbable. Given such verifications, the underlying assumption that they are incorrect must be abandoned. According to this principle, P = P should be rejected only if a very small-probability event occurred on a single trial, that is, when the ANEES is outside the interval of a very high probability (say, 99%). The higher the probability, the more confident we can reject P = P. When the ANEES is outside the 99% probability interval, we can say P P with at least 99% confidence or the MSEnoncredibility level is at least 99%. The ANEES is used to reject P = P in the sense that P is not statistically acceptable as the actual MSE P. It is usually a good indication that the assumptions and approximations used by the estimator to compute P are not sufficiently accurate. If the ANEES is much larger (or smaller) than, the actual estimation error is much larger (or smaller) than what the estimator believes; that is, the estimator is too optimistic (or pessimistic). To avoid unnecessary confusion, care should be exercised when using noncredibility levels. For instance, if the ANEES of an estimator is outside the 95% interval but inside the 99% interval, then the estimator may be deemed not credible with a ( confidence ) level between 95% and 99%, that is, the noncredibility level is between 95% and 99%. However, such noncredibility levels should be used only when we have rejected the estimator as being noncredible and the level is very high (say, at or above 95%). In other words, we should use such noncredibility levels only for noncredible estimators because a credible estimator may have a seemingly high noncredibility level, which can be confusing to most practitioners. When the ANEES is outside an interval of a not high probability (say, 85%), rejection of the credibility of an estimator and treating it with a noncredibility level of 85% lacks solid ground and is in fact quite questionable because the underlying smallprobability-event principle is not applicable here. In short, a noncredibility level is meaningful only if it is very high for the significance test. On the other hand, in the case when the ANEES falls inside its 99% probability interval, it is not appropriate to think that the estimator has a 99% credibility level; otherwise all estimators are credible at the % level because every ANEES falls inside the % probability interval, which is [, )! Given the above definition of noncredibility levels, if we require that the credibility and noncredibility levels sum up to unity, then the credibility level of many credible estimators are very low (e.g., 5% or lower). This is highly undesirable. To define the credibility level as the minimum level at which an estimator s self-assement is accepted (i.e., not rejected) as credible is seriously flawed. For instance, suppose that the ANEES is at an end point of its 95% probability interval (say, equal to.43 for χ 2 4 ). This definition states that the estimator has a 95% credibility level. However, from our definition of noncredibility level, it also has a 95% noncredibility level. Clearly, the estimator should be defined as having a 95% noncredibility level because the ANEES is supposed to be in the interval with 95% probability. Otherwise the credibility level is higher (say 99%) if the ANEES is further away from (say,.92)! These examples illustrates that it is not easy to define a simple, quantitative credibility level for an estimator properly. The above difficulty stems from the inherent difficulty in interpreting the confidence level of a decision for hypothesis testing based on probability (or confidence) intervals, i.e., significance tests. No simple, general, and good solution is available within this framework. A better solution is to use credibility measures, the topic of [5, 3, 2], rather than test-based confidence levels. The credibility test discussed here is essentially the same as that presented in [3] and discussed in [4, 5]. The difference is the introduction of the /n factor in the ANEES that makes the mean of the ANEES invariant w.r.t. the dimension of x. This makes it more convenient to use. Also, we believe that the underlying principle is elucidated better here, which helps avoid pitfalls. Note that the tests based on measurement residual proposed in [3] actually deal with filter s optimality, not credibility.

4 3.3 Drawbacks of NEES-Based Test The above NEES-based chi-square test has some obvious and subtle drawbacks, as explained next. First and perhaps most importantly, as explained before, the above chi-square significance test is based on the smallprobability-event principle and is justifiable only when the ANEES ǫ is outside an interval of a very large probability, say,.95. If ǫ is inside the interval, no conclusion can be drawn in principle, although a wide-spread misconception and malpractice is to conclude that P = P. Were this practice correct, we could always conclude that P = P by using a large enough probability, say, , so that the interval [a, b] becomes large enough to include ǫ. Put it differently, the chi-square significance test should only be used to reject the hypothesis P = P. If ǫ is inside the interval, we do not even know whether P = P is more likely than P P or not, let alone to conclude P = P. For instance, data may even suggest that P = P is false but it is not strong enough to make a convincing case (beyond a reasonable doubt) that it is false. This subtle drawback is inherent in all significance tests for a single hypothesis in general, not just for the above ANEES chi-square test in particular. Strangely enough, this drawback is not well known in the engineering community. This lack of knowledge and warning, along with the deceptively simple structure of the significance test, gives rise to the wide-spread misconception and malpractice mentioned above. Second, as explained before the use of this test can be justified only when the noncredibility level is very high (say, not lower than 95%). Third, under the assumption x (, P ), we have E[ ǫ] = tr(p P )/n = λ(p /2 P P /2 ) where λ(a) is the arithmetic average of the eigenvalues of A. Consequently, given any estimator-computed MSE matrix P, the ANEES will be around provided the average eigenvalue of Py = P /2 P P /2 is approximately. Obviously P can be far from P even if λ(p y ) =. On the other hand, the acceptance interval [a, b] of the above ANEES-based chi-square test must include. So there is no reason to expect that the above chi-square test can tell P from P even if they are very different provided λ(p y ), although the ANEES is not chi-square distributed in this case. This drawback has been verified repeatedly by simulation results, including those of Sec. 5. Also, from the Rayleigh-Ritz theorem, λ min (A) a Aa/(a a) λ max (A), it follows immediately that x x/λ max (P) x P x x x/λ min (P). So ǫ will be close to if P is such that λ max (P) E[ x x] λ min (P), although this P can be far from P. These observations all support the above statement that the ANEESbased chi-square test can only be used to reject, rather than verify, the credibility. Fourth, the NEES and thus the ANEES do not account for the bias at all. As such, they only evaluate the MSE credibility alone they cannot be used for bias-credibility or bias-mse joint credibility evaluation of any sort. In a nutshell, except for a rejection of the MSE credibility at a very high confidence, any other conclusion of the Except that we cannot conclude with 95% or higher confidence that the estimator is MSE-noncredible. NEES-based significance test in fact lacks solid scientific ground and can be very misleading; the test should only be used to reject the claim of MSE-credibility very roughly; it is incapable of judging whether the estimator is credible in terms of MSE and/or bias. While the NEES-based chi-square significance test is seriously flawed, its introduction in [3] was an important, pioneering contribution. Fortunately, much better tests are available that can be used to judge the credibility of an estimator in terms of MSE and/or bias. Tests for the MSE are presented in the next section and those for the bias alone and bias-mse jointly are presented in Part II [6]. 4 New Tests for MSE-Credibility 4. MSE-Credibility as Covariance Testing In this case, although the actual bias µ is unknown, we are only concerned with the MSE matrix. This is a common situation in practice, where only the MSE matrix P is provided by the estimator since it assumes zero bias. So P is actually also the estimator-computed error covariance matrix C. In this case, the problem of testing the MSE matrix H : P = P vs H : P P is equivalent to testing the error covariance H : Σ = P vs. H : Σ P Assume that x (µ, Σ ) but µ and Σ are unknown. Let y = P /2 x. Then µ = E[y] = P /2 µ and Σ = cov(y) = P /2 Σ P /2 Clearly, Σ = P if and only if Σ = I. As such, the above problem is equivalent to testing H : Σ = I vs. H : Σ I We refer to this problem as Problem P Σ. It is a standard hypothesis testing problem in multivariate statistical analysis, where generally accepted, highly effective solutions are available and not surprisingly the chi-square significance test is not one of them. The credibility tests we proposed below among to adapting these solutions to the credibility problems. As with the ANEES-based chi-square test, we assume that the data set { x, x 2,..., x N } is available for credibility evaluation. This is a fundamental assumption that requires precise knowledge of the estimatee x, which is hardly available except in computer simulations. Further, for notational brevity, let ȳ = N N y i, V = i= N i= (y i ȳ)(y i ȳ), S = V N (2) Note that ȳ is the sample mean and V is a scaled version of sample covariance S. 4.2 Generalized Likelihood Ratio Test Assume y N(µ, Σ) with unknown µ and Σ. As shown in [], the generalized likelihood ratio test (GLRT) for Problem P Σ is: Reject H if Λ > t or reject H if Λ < t (note

5 that P {Λ = t} = ). Here the generalized likelihood ratio Λ is given by ( e ) nn/2 Λ = V N/2 exp[ tr(v )/2] N where V was defined in (2), e is the base of the natural logarithm ln, tr(v ) and V are the trace and determinant of V. This test is biased. So, we consider the following modified generalized likelihood ratio test []: Reject H if φ c or reject H if φ < c, where φ = 2 ln Λ = tr(v ) + nl(lnl ) L ln V L = N is the degrees of freedom of V, and Λ = Λ N=L. This modified GLRT is unbiased, meaning that the probability of correctly rejecting H (i.e., detection probability) is never smaller than that of incorrectly rejecting H (i.e., false alarm probability). While this looks like an easy requirement, some good tests for the multivariate case do not satisfy it. The distribution of φ under H is needed to determine the threshold c. Under H, the statistic φ has a complex exact distribution when the sample size N is small [9], but asymptotically (as N becomes large, say, larger than ) it tends to have a χ 2 n(n+)/2 distribution. More precisely the statistic ρφ has a χ 2 n(n+)/2 distribution approximately for moderately large N (say, larger than 2), where ρ = 2n2 +3n 6(N )(n+) [] (see also, e.g., [, 8, 2]). Tables of exact 5% and % significance points of φ were presented in [9] and included in []. The corresponding tables of approximate points can be found in [, 8]. It is sensible that the power of these GLRTs increases monotonically as λ i (Σ) increases for each eigenvalue λ i [7], [23]. 4.3 Union-Intersection Test As shown in [2], [23], [7], by the union-intersection principle for Problem P Σ, we have H : Σ = I a H : Σ I a H a, where H a : a Σa = a a H a, where H a : a Σa a a where a is any nonzero vector, meaning that H is true if and only if all H a are true and H is true if and only if any of H a is true. Recall that y i = P /2 x i and V was defined in (2). Given a, the test for H a vs H a is: reject H a if a V a/a a / [c, c 2 ], where the thresholds c and c 2 can be determined by the fact that a V a/a a χ 2 N under H [23]. Hence, the test for H vs H is: reject H if max a a V a/a a c 2 or min a a V a/a a c. It is well known that λ min (V ) a V a a a λ max(v ) where the left and right inequalities become equalities if and only if a is the eigenvector associated with the smallest λ min and largest eigenvalues λ max of V, respectively. Finally, the test for Problem P Σ is: reject H if λ max (V ) c 2 or λ min (V ) c. Tables of the distribution of c and c 2 under H are given by Table 5 of [2], where it appears that c.25 = c. βd and c.5 = c. + γd on Page 3 should be c.25 = c. + βd and c.5 = c. γd. The UIT does not rely on the small-probability-event principle although its component tests are chi-square tests: Since Σ = I if and only if λ max (Σ) = λ min (Σ) =, the fact that [λ min (V ), λ max (V )] belongs to a small interval [c, c 2 ] is a compelling evidence for Σ = I. This is not the case for ǫ [a, b] for the ANEES-based test. In general, the UIT is consistent 2 if the univariate component tests are so, unbiased under certain conditions, and admissible if the univariate component tests are admissible 3 [2] (see also [7]). 4.4 Discussions Invariance. Problem P Σ is clearly invariant under orthogonal transformations and additive transformations: u = Ay + b, where A is an orthogonal matrix. As a result, our tests should also be invariant under these transformations. It can be shown (see, e.g., [2]) that a test for Problem P Σ is invariant under these transformations if and only if the test statistic depends on the data only through the eigenvalues of V. Unfortunately, there is no uniformly most powerful invariant test for this problem [2]. Eigenanalysis. For the null hypothesis Σ = I, it is sensible that the test can be done by checking whether the eigenvalues of the sample covariance V/N as an estimate of Σ are all within a small interval centered at. It turns out that the GLRT, UIT, and the ANEES-based test correspond to three different ways of checking. From the identities tr(v ) = n λ(v ) and V = [ λ(v )] n, where λ(v ) and λ(v ) are the arithmetic and geometric averages of the eigenvalues of V, respectively, it follows that the test statistic for the modified GLRT φ = n[ λ(v ) + L(lnL ) L ln λ(v )] depends on the data only through λ(v ) and λ(v ). The UIT also depends on the data only through the maximum and minimum eigenvalues of V. Therefore, both tests are invariant under orthogonal and additive transformations. From this angle, it appears that the UIT is most natural, simple and intuitively appealing. Note that if the estimator-computed MSE is not datadependent, then it can be easily shown that ǫ = n λ(p /2 ˆPx P /2 ) and V = NC /2 Ĉ x C /2, where C = P is the estimator computed error covariance since the estimator assumes unbiased estimates (i.e., µ = ) and Ĉ x = (/N) N i= ( x i x)( x i x) is its sample version, where x = (/N) N i= x i, and ˆP x = (/N) N i= x i x i. Consider now the test of [9]: Reject H if tr(v ) t 2 or tr(v ) t. In other words, replace the criterion [λ min (V ), λ max (V )] [c, c 2 ] for the UIT by λ(v ) [c, c 2 ]. Clearly, this test is essentially equivalent to the ANEES-based chi-square test if the estimates are indeed unbiased in which case C = P, Ĉ x ˆP x and thus ǫ = n λ(p /2 ˆPx P /2 ) n λ(v/n) otherwise it is superior. It is, however, inferior to the UIT in terms of 2 A test for H against any fixed alternative H is consistent if its power converges to as the sample size increases. 3 A test is admissible if there is no other test that is never worse and sometime better than it.

6 thoroughness if λ(v/n) and λ max (V ) λ min (V ), its decisions are most likely to be incorrect, while the UIT is immune from such mistakes. In other words, the UIT does not suffer from the third drawback of the ANEES-based test explained in Sec GLRT vs. UIT. An important advantage of the GLRT is that 2 lnλ (Λ is the likelihood ratio) is chi-square distributed asymptotically under mild regularity conditions and thus it is relatively easy to determined the test threshold. No such general results are available for the UIT. As a consequence, the UIT is generally harder to develop and apply than the GLRT, but it carries more information than the GLRT since it can pinpoint the univariate components that reject H. The modified GLRT is unbiased. In fact, it is not only unbiased but also uniformly most powerful of all unbiased tests [23] for Problem P Σ. This is a very strong justification for the GLRT. Whether the above UIT is unbiased or not depends on the choice of c and c 2 [23]. Clearly the GLRT statistics rely more critically on the assumption of the data than the UIT statistics, which are valid provided the component statistics are valid. As a result, the UIT statistics are in general less sensitive to the assumption of the data than the GLRT statistics [7]. In addition, the UIT is less sensitive than the GLRT to the actual distribution of the alternative hypothesis. In view of this, it appears that the UIT is preferable to the GLRT when y is far from Gaussian distributed or far from the null hypothesis; the GLRT is probably the better choice when y is (approximately) Gaussian distributed. 5 Illustrative Examples In this section, we provide some simple examples to demonstrate the tests proposed as well as a comparison with the one based on the ANEES. All results are based on Monte Carlo runs. Unless otherwise stated, the vertical axis is always the probability of rejecting the null hypothesis (P = P ) P { H } = P { H µ, Σ} = P { H µ, Σ}, where µ, Σ are the true mean and covariance of the (transformed) estimation error. Example. It follows from the relationship y = ỹ µ that all cases with some µ and Σ are equivalent to the case with µ = and Σ = Σ. 4 So, for performance evaluation we only consider µ =. Specifically, we consider two cases of the truth y N(, Σ) with (a) Σ = σ 2 I and (b) Σ = σ [ 2 3], where σ 2 2. Note that the null hypothesis P = P is true when and only when σ 2 = in case (a); so a large P { H } is desirable in case (a) when σ 2 is quite different from or in case (b). For this problem x P x = y y since y = P /2 x. Thus the ANEESbased significance test checks whether y y is outside the chi-square probability interval. Fig. (a) shows the probabilities of rejecting H vs. σ 2 for case (a) using our proposed GLRT and UIT as well as the existing ANEES-based test, each using data of size in each run with two different test thresholds corresponding to type I error probabilities of α =. and.5, respectively. Fig. (b) shows the 4 Of course, they are different if P is actually the MSE matrix rather than error covariance α GLRT =5% α GLRT =% α UIT =5% α UIT =% α χ 2=5% α χ 2=% σ 2 (a) Σ = σ 2 I α GLRT =5% α GLRT =% α UIT =5% α UIT =% α χ 2=5% α χ 2=% σ 2 (b) Σ = σ 2[ 3 Figure : Probability of rejecting Σ = I vs. parameter σ 2. rejection probability for case (b), where our GLRT and UIT always have (correctly) a unity rejection probability while the ANEES-based test does not. As can be seen, while the differences between our tests and the existing one are minor in case (a), in case (b) no matter what value σ 2 is, the hypothesis P = P is never true and our GLRT and UIT always reject it correctly, but the ANEES-based test could not reject it for σ 2 close to.5. This is because the arithmetic average of the eigenvalues of Σ =.5 [ 3] ] is. This demonstrates the third drawback of the ANEES-based test explained in Sec Such failures of the ANEES-based test occur often in our simulations. Example 2. The above examples use Gaussian distributed data, as assumed in the tests. Here we consider testing Problem P Σ (H : Σ = I against H : Σ I) with non-gaussian data y having a Gaussian mixture distribution, given by f(y) = pn(y; µ, Σ )+( p)n(y; µ 2, Σ 2 ), p It can be easily shown that µ = E[y] = pµ + ( p)µ 2 Σ = cov(y) = p(σ + µ µ ) + ( p)(σ 2 + µ 2 µ 2) For simplicity, we consider two cases: (a) µ = [, ], µ 2 = and Σ = Σ 2 = σ 2 I and (b) µ = µ 2 = [, ] and Σ = Σ 2 = σ [ 2 3], where σ 2. Figs. 2(a) and (b) show the probabilities of rejecting H vs. p and σ 2 using our proposed GLRT and the existing ANEES-based test, respectively, each using data of size in each run ]

7 σ 2 p.2. p.5 σ 2 (a) GLRT (b) χ 2 test Figure 2: Probability of rejecting Σ = I vs. parameters p and σ 2 in case (a). with the thresholds corresponding to α =.5. Note that the distribution becomes more and more non-gaussian as p increases from to.5. The null hypothesis Σ = I (i.e., P = P ) is true in case (a) when and only when (p, σ 2 ) = (, ); so a large P { H } is desirable in other situations. It is never true in case (b). It turned out that all three tests always correctly reject the null hypothesis in case (b) (results not shown); for case (a), the P { H } surface of our GLRT is flat with a height of except for the cone around (p, σ 2 ) = (, ), meaning that our proposed test always rejects the null hypothesis correctly except when it should not be rejected [i.e., when (p, σ 2 ) is around (, ) in case (a)]. This indicates that our GLRT test is not sensitive to the Gaussianity assumption. The results of the UIT are similar to those of the GLRT and are not shown. However, the P { H } surface of the ANEES-based test has an undesirable valley, indicating that it cannot reject some clearly noncredible cases when the Gaussianity assumption is invalid. It has been verified that the arithmetic averages of the eigenvalues of Σ corresponding to the bottom of this valley are indeed around. This once again demonstrates the third drawback of the ANEESbased test explained in Sec Example 3. Consider now a simple filtering example, that is, tracking a target of a nearly constant-velocity onedimensional motion using position-only measurements under the linear-gaussian assumption of the Kalman filter. The system is given by x k+ = [ T z k = [, ]x k + v k ] x k + [ T 2 /2 T ] w k, x k = [x k, ẋ k ] w k =, cov(w k ) = σ 2, v k =, cov(v k ) = 2 2 where T = and the true value of σ is always 3. Consider three cases where the Kalman filter assumes σ = 2, 3, 4, respectively. The filter with σ = 3 is matched exactly, but with σ = 2 or 4 it is mismatched it is optimistic for σ = 2 and pessimistic for σ = 4. The filter was initialized by a weighted least-squares fit to the first two measurements, known as the two-point differencing in the target tracking community. Fig. 3 shows the average GLRT statistic over Monte-Carlo runs, along with the threshold (the horizontal line) corresponding to 5% type I error (false alarm) prob- GLRT Time σ=2 σ=3 σ=4 Figure 3: Generalized likelihood ratio test statistic for MSE-credibility over time. ability. We should not pay much attention on the first or so time steps because of the transient due to non-ideal initialization. It can be seen that the matched filter s MSE matrix is deemed credible most of the time (only 4 out of 5 points are slightly larger than the threshold, consistent with the 5% type I error probability), while the mismatched filter with σ = 2 is almost always rejected and the one with σ = 4 is rejected except over a small portion of time. The difference between the cases with σ = 2 and with σ = 4 makes good sense: It is better to be more conservative in choosing the process noise covariance (see, e.g., [8]). Also, the ratio of σ = 3 to σ = 2 is larger than that of σ = 4 to σ = 3 (the true ratio that matters most is those between the MSE matrices, not process noise covariances). Similar results hold true for the chi-square significance test based on ANEES, except that the matched filter is rejected about 5 times out of 5 time steps and for the rest the test can neither reject nor accept the credibility of the filter-computed MSE matrix.

8 6 Summary The problem of the MSE-credibility of an estimator whether and how much an estimator computed MSE matrix can be trusted has been formulated. A number of tests for MSE-credibility have been proposed. An in-depth discussion of the existing NEES-based significance test for MSE-credibility has been presented, which elucidates the underlying principle and reveals some fundamental drawbacks of the existing test. Not only are the new tests developed immune from these drawbacks, but they also enjoy many desirable properties, including some optimality properties. Simple, illustrative examples have been given, which demonstrate the performance of the proposed tests as well as their superiority to the existing one. Some of the drawbacks of the NEES-based test have been demonstrated clearly via simulation results. These results question the wisdom of continuing using the existing solution. References [] T. W. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley, New York, 2nd edition, 984. [2] S. F. Arnold. The Theory of Linear Models and Multivariate Analysis. Wiley, New York, 98. [3] Y. Bar-Shalom and K. Birmiwal. Consistency and Robustness of PDAF for Target Tracking in Cluttered Environments. Automatica, 9:43 437, July 983. [4] Y. Bar-Shalom and X. R. Li. Estimation and Tracking: Principles, Techniques, and Software. Artech House, Boston, MA, 993. (Reprinted by YBS Publishing, 998). [5] Y. Bar-Shalom, X. R. Li, and T. Kirubarajan. Estimation with Applications to Tracking and Navigation: Theory, Algorithms, and Software. Wiley, New York, 2. [6] O. E. Drummond, X. R. Li, and C. He. Comparison of Various Static Multiple-Model Estimation Algorithms. In Proc. 998 SPIE Conf. on Signal and Data Processing of Small Targets, vol. 3373, pages 5 527, Apr [7] N. C. Giri. Multivariate Statistical Inference. Academic Press, New York, 977. [8] A. H. Jazwinski. Stochastic Processes and Filtering Theory. Academic Press, New York, 97. [9] J. Kiefer and R. Schwartz. Admissible Bayes Character of T 2 - and R 2 - and Other Fully Invariant Tests for Classical Normal Problems. Ann. Math. Statist., 36:747 76, 965. [] B. P. Korin. On the Distribution of a Statistic Used for Testing a Covariance Matrix. Biometrika, 55:7 78, 968. [] P. K. Krishnaiah and J. K. Lee. Likelihood Ratio Tests for Mean Vectors and Covariance Matrices. In P. K. Krishnaiah, editor, Handbook of Statistics, volume, pages North-Holland, 98. [2] X. R. Li, Z. Zhao, and V. P. Jilkov. Practical Measures and Test for Credibility of an Estimator. In Proc. Workshop on Estimation, Tracking, and Fusion A Tribute to Yaakov Bar-Shalom, pages , Monterey, CA, May 2. [3] X. R. Li, Z. Zhao, and V. P. Jilkov. Estimator s Credibility and Its Measures. In Proc. IFAC 5th World Congress, Barcelona, Spain, July 22. Paper no [4] X. R. Li and Z.-L. Zhao. Evaluation of Estimation Algorithms Part I: Incomprehensive Performance Measures. IEEE Trans. Aerospace and Electronic Systems, AES-42(3), July 26. [5] X. R. Li and Z.-L. Zhao. Measuring Estimator s Credibility: Noncredibility Index. In Proc. 26 International Conf. on Information Fusion, Florence, Italy, July 26. [6] X. R. Li and Z.-L. Zhao. Testing Estimator s Credibility Part II: Other Tests. In Proc. 26 International Conf. on Information Fusion, Florence, Italy, July 26. [7] K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate Analysis. Academic Press, London, 979. [8] R. J. Muirhead. Aspects of Multivariate Statistical Theory. Wiley, New York, 982. [9] B. N. Nagarsenker and K. C. S. Pillai. Distribution of the Likelihood Ratio Criterion for Testing a Hypothesis Specifying a Covariance Matrix. Biometrika, 6: , 973. [2] H. K. Nandi. On Some Properties of Roy s Union- Intersection Tests. Cakcytta Statist. Assoc. Bull., 4:9 3, 965. [2] E. S. Pearson and H. O. Hartley. Biometrika Tables for Statisticians, volume 2. Cambridge University Press for Biometrika Trustees, Cambridge, England, 972. [22] C. R. Rao. Linear Statistical Inference and Its Applications. Wiley, New York, 2nd edition, 973. [23] M. S. Srivastava and C. G. Khatri. An Introduction to Multivariate Statistics. North-Holland, New York, 979. [24] A. Stuart, K. Ord, and S. Arnold. Kendall s Advanced Theory of Statistics Vol. 2A: Classical Inference and the Linear Model. Arnold, London, 999.

Optimal Linear Unbiased Filtering with Polar Measurements for Target Tracking Λ

Optimal Linear Unbiased Filtering with Polar Measurements for Target Tracking Λ Zhanlue Zhao X. Rong Li Vesselin P. Jilkov Department of Electrical Engineering, University of New Orleans, New Orleans,