The Empirical Likelihood: an alternative for Signal Processing Estimation

Size: px

Start display at page:

Download "The Empirical Likelihood: an alternative for Signal Processing Estimation"

Noreen Ball
5 years ago
Views:

1 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 1 The Empirical Likelihood: an alternative for Signal Processing Estimation Frédéric Pascal, Hugo Harari-Kermadec and Pascal Larzabal Abstract This paper presents a new robust estimation scheme for signal processing problems. The empirical likelihood is a recent semi-parametric estimation method [1] which allows to estimate unknown parameters and to build confidence areas without using a prior model for the PDF: this method uses only information contained in the observed data when we have no prior distribution on the problem. However, in presence of priors on the parameter of interest, this information can be taken into account by means of constraints in an optimization problem. The aim of this paper is twofold: first, the empirical likelihood procedure is introduced in a very simple case and then, some priors on the unknown parameters are added in the study of a more elaborated problem. In order to illustrate this analysis, an example is studied all around this paper: the covariance matrix estimation from random data. In this particular case, a closed-form expression is derived for the solution of the corresponding optimization problem. Finally, theoretical results are emphasized by several simulations corresponding to real situations, which compare classical methods against the empirical likelihood method. Index Terms Empirical Likelihood, Maximum Likelihood, covariance matrix estimation, structured parameters estimation, statistical performance analysis, non-gaussian noise. F. Pascal is with SATIE, ENS Cachan, CNRS, UniverSud, 61 Av. du Pdt Wilson, F Cachan Cedex, France ( pascal@satie.ens-cachan.fr). H. Harari-Kermadec is with CREST-LS and University Paris-Dauphine, France ( harari@ensae.fr). P. Larzabal is with SATIE, ENS Cachan, CNRS, UniverSud, 61 Av. du Pdt Wilson, F Cachan Cedex, France ( larzabal@satie.ens-cachan.fr).

2 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 2 I. INTRODUCTION It is often assumed that signals, interferences or noises are Gaussian stochastic processes. Indeed, this assumption makes sense in many applications. Among them, we can cite : sources localization in passive sonar, radar detection where thermal noise and clutter are often modeled as Gaussian processes, digital communications where the Gaussian hypothesis is widely used for interferences and noises. In these contexts, Gaussian models have been thoroughly investigated in the framework of Statistical Estimation and Detection Theory [2], [3], [4]. They have led to attractive algorithms. For instance, we can cite the stochastic Maximum Likelihood method for sources localization in array processing [5], [6], the matched filter and its adaptive variants in radar detection [7], [8] and in digital communications [9]. However, such widespread techniques suffer from several drawbacks when the noise process is a non- Gaussian stochastic process [10]. Therefore, non-gaussian noise modeling has gained many interest in the last decades and presently leads to active researches in the literature. Higher order moments methods [11] have initiated this research activity while particle filtering [12] is now intensively investigated. In radar applications, experimental clutter measurements, performed by MIT [13], showed that these data are not correctly described by Gaussian statistical models. More generally, numerous non-gaussian models have been developed in several engineering field. For example, we can cite the K-distribution already used in the domain of radar detection [14], [15]. Moreover, let us note that the Weibull distribution is a widely spread model in radar detection [16]. Nevertheless, the question of a model choice for previous applications remains since, most of the time, chosen modeling does not perfectly describe the data behavior. And in these cases, classical estimation methods like for example Maximum Likelihood (ML) based on the data Probability Density Function (PDF) are used, leading as expected to only sub-optimal results. Several non-parametric techniques are proposed in the literature to estimate this unknown PDF. We can cite for example wavelets methods which have been widely investigated. But, most of them are difficult to implement. An alternative is the Empirical Likelihood (EL) [1]. This method allows to estimate the unknown parameters without assuming a noise modeling. Moreover, prior informations on the data (known moments, parameters structure,...) can be integrated in the processing by means of constraints in the optimization procedure. However, surprisingly, this estimation scheme is still unused in the area of Signal Processing

3 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 3 estimation. To the best of our knowledge, we can only cite [17] in the corresponding literature. For a better investigation of the potentiality of this method, we will analyze the covariance matrix estimation problem in the case of an additive noise. In several applications, it is reasonable to assume that the noise is a zero-mean process, i.e. the first order moment is null, and the real covariance matrix has a Toeplitz structure [18], [19], [20], [21], [22]. As we will show in the sequel, EL method is a way to estimate such structured parameters by means of constraint problem. The paper is organized as follows. Section II presents the estimation problem of interest while Section III gives some background on the EL procedure. Sections IV and V present the main results of this paper, the EL method used without constraint and then, the EL method using prior informations. In these sections, comparison with the classical ML method will be analyzed through the problem of covariance matrix estimation under Gaussian assumptions. Then, Section VI contains simulations which illustrate theoretical results of Section V. II. PROBLEM FORMULATION In this section, we introduce the notations used in this paper and the statistical framework. A. Notations In the following, H denotes the conjugate transpose operator, denotes the transpose operator, E[.] stands for the statistical mean of a random variable, E P [.] is the statistical mean under the data probability P, Tr(M) is the trace of matrix M and det(m) is the determinant of matrix M. C (respectively R) denotes the set of complex (resp. real) numbers, while for any integer p, C p (resp. R p ) represents the set of p- vectors with complex (resp. real) elements. For z C, we write Re(z) and Im(z) its real and imaginary parts. B. Statistical Framework In a lot of signal processing problems, we have to extract the estimate θ of parameters θ based of some noisy data x. This leads to the functional relation: θ = T (x). (1)

4 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 4 To obtain a useful estimator T (.) of parameter θ, a mathematical model on the data has to be introduced. One of the most widespread model, in the area of signal processing, is the following: x = h(θ, t) + n(t), (2) where x is the observation data, n(t) is an additive noise and h(θ, t) is the noiseless part of observation which depends on θ and t. Sometimes, one has the separability property, h(θ, t) = A(θ)s(t), where A(θ) is a transfer matrix and s(t) is a signal. In the pioneering work on estimation in signal processing, the methods were not very performing. Recently, the huge increase of computers capacity has allowed the implementation of very sophisticated and performing methods. One of the most famous of them is the ML method which requires, by construction, the knowledge of the data PDF, up to parameters of interest and possibly, additional nuisance parameters. If such a PDF is given, ML methods lead to very performing results. The performance of such estimators critically depends on the PDF assumption. This assumption is chosen to be consistent with the problem, but also to be mathematically convenient. Therefore, the choice is restricted to a small number of classical PDF distributions and the real data PDF can appart, at least slightly, from these families. In this paper, we look for robust estimators, which means that a slight mismatch on the PDF does not dramatically decrease the performance. Unfortunately, this is not the case of ML and other parametric methods. Consequently, it is interesting to develop performing methods which do not require an assumption on the PDF family. Now we will focus on the generic problem presented as follows. Let us set (x 1,..., x n ) an Independent and Identically Distributed (IID) data set in R p with unknown PDF f (θ0,τ 0), where θ 0 Θ = R d is the parameter of interest and τ 0 represents a nuisance parameter. Thus, the classical problem of an unknown signal θ corrupted by an additive noise n i of unknown puissance τ, for i = 1,..., K, can be formulated as follows x i = θ + τ n i. (3) For identifiability considerations, the covariance matrix M = E [ n i n H i ] has to be normalized. According to [23], we set Tr(M) = p. Moreover, θ has obviously the same dimension as the data, i.e. d = p. However, to keep the generality, we will conserve the notation d for the parameter dimension in the sequel.

5 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 5 A. Statistical preliminaries III. EMPIRICAL LIKELIHOOD This section is devoted to the introduction of the Empirical Likelihood (EL) procedure. Let P 0 be the PDF from which data have been generated. In practice, this probability is unknown and a classical approach is to choose a parametric family of PDF and to assume that P 0 belongs to this family. The expression parametric family means that the PDF is known up to an element of a finite dimensional parameter space. For instance, the most currently used family is the set of Gaussian PDF, identified with two distinct parameters: the statistical mean (or expectation) and the covariance matrix (or variance in dimension 1). Notice that choosing a distribution family is a very restricting assumption since it reduces an infinite dimensional estimation problem to the estimation of finite dimensional parameters. Indeed, the only knowledge on P 0 is that the corresponding CDF belongs to the set of càdlàg (right-continuous with left limits) increasing functions from 0 to 1. EL method [1] has been designed as a mean of relaxing the restrictions on the PDF P 0 in comparison to parametric approaches. Indeed, the only assumption is that finite variance exists. The aim of this method is to prevent from degradations due to model misspecifications. Therefore, instead of restricting the approach to the choice of a parametric family to mimic the data and then estimate the parameters, we look for the bigger flexibility on the PDF. First, in order to class the candidate PDFs in regard of the data, one needs a measure on the PDFs. A natural choice is to use the Kullback-Leibler divergence which allows to compare a candidate Q with the data generating PDF P 0. This choice is motivated by the fact that the Maximum Likelihood theory is closely connected to the Kullback-Leibler divergence, see [24]. For Q and P two distinct PDFs, the Kullback-Leibler divergence is defined as follows: log dq dp if Q P K(Q, P ) = dp + otherwise. (4)

6 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 6 where Q P means that Q is dominated by P (i.e. for any event A, if P (A) = 0 then Q(A) = 0). However, the true PDF is unknown and thus, it is impossible to compare any PDF Q with P 0. To avoid this problem, the natural way to proceed is to replace P 0 by the empirical distribution of the data defined by P K (x) = 1 K where δx is the Dirac measure at element x. δx k (x). (5) This very classical idea is motivated by the Strong Law of Large Numbers (SLLN). Indeed, one has from the SLLN that P K (x) = 1 K a.s. δx k (x) E [δ x 1 (x)] = P 0 (x), (6) K where a.s. stands for an almost surely convergence. See for more details [24]. This gives a feasible measure K(Q, P K ). At this point, it is interesting to notice that for any Q which is not dominated by P K, the divergence diverges. Therefore, the only convenient PDFs are those which are dominated by P K. For this purpose, we introduce the family of multinomial PDF G supported by the data set. Notice that this is the identifiable choice of maximal dimension for a density family because it has as many degrees of freedom as there are observations: q k if k, x = x k G(x) = 0 otherwise or equivalently G(x) = q k δx k (x), (7) where 0 < q k < 1 and q k = 1. Notice that although this choice of multinomial PDFs is the parametric model of maximal dimension and that part of the EL method is inspired of the parametric likelihood under this model, it is never assumed that the true PDF P 0 belongs to this multinomial family.

7 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 7 B. Moment condition The discrete family of multinomials seems to be a poor model for functional estimation of possibly continuous PDFs. However, in a semi-parametric approach, the choice of multinomials is adapted since it provides performing moment estimates, whatever the true PDF of the data, see [24]. This is the reason why EL is adapted to estimation problems in which the parameter of interest θ is defined as the solution of a moment condition. Now, we suppose that, for some regular function m defined by R p R d R n m : x, θ m(x, θ) θ 0 is defined as the solution of, (8) E P0 [m(x, θ 0 )] = 0, (9) where 0 denotes the null vector with appropriate dimension (n here) and n d. Let us recall that p is the dimension of the observations x and d the dimension of the parameter of interest θ. Notice that the expectation is taken under the true PDF P 0. This kind of equation is called moment condition because it generalizes to non polynomial functions m the monomial equations giving the moments [25]. Following equations illustrate the case of first and second moments respectively with unknown expectation µ and/or unknown covariance matrix Σ E P0 [x µ] = 0 and E P0 [(x µ)(x µ) H Σ] = 0. (10) In order to estimate the parameter θ, the weights (q 1,..., q K ), introduced in Eq. (7) (which are the parameters of the multinomial PDF) are treated as nuisance parameters and G is constraint through the moment equation: E G [m(x, θ)] = 0, (11) which rewrites q k m(x k, θ)] = 0, (12) where the notation E G highlights the discrete probability G under which the expectation is now taken. In order to evaluate the goodness of a value of θ, one considers all the multinomials under which the moment equation is verifyed at θ. Among those multinomials, we choose the one having the smallest

8 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 8 divergence with P K. For that purpose, we introduce the following quantity: { } ELR(θ) = 2K inf K(G, P K ) E G [m(x, θ)] = 0. (13) G In fact, ELR(θ) is the Empirical Likelihood Ratio (ELR), as we will see in the sequel by emphasizing the link between EL method and the classical Maximum Likelihood method. C. Likelihood In this section, we briefly recall howthe EL procedure can be replaced in the classical likelihood context and how ELR can be interpreted as a likelihood ratio. For that pedagogic purpose, we consider the multinomials G as if it was a parametric model for the data. Notice that this parametric model assumption is only considered to interpret EL as a classical likelihood and is not necessary for the EL method. We define, for G and θ verifying the moment condition E G [m(x, θ)] = 0, the density q i if k, x = x k, g (θ,q1,...,q K)(x) = G(x) = 0 otherwise The corresponding likelihood is called Empirical Likelihood [26]: { K } EL(θ) = EL(x 1,, x K, θ) = sup g (θ,q1,...,q K)(x k ) (q 1,...,q K) E G[m(x, θ)] = 0 (15) { K K } = sup q k q k m(x k, θ) = 0, q k = 1. (16) (q 1,...,q K) Now, the following theorem allows to make the link between EL(θ) and ELR(θ) as well as the strong relation between EL method and the classical ML one. This justifies the appellation Likelihood. Theorem III.1 (Connection between ELR and EL) ELR(θ), defined by Eq. (13), can be expressed thanks to EL(θ) in a similar way as in the classical ML theory. ( ) EL(θ) 2 log sup θ {EL(θ)} (14) = ELR(θ). (17) Proof: For the clarity of the presentation, proof of Theorem III.1 is postponed to the Appendix A. Notice that the previous log-likelihood ratio is the same as the one defined by Eq. (13), i.e. when no assumption has been made on the parametric model.

9 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 9 Then, the resulting Maximum Empirical Likelihood (MEL) estimate is equivalently given by ˆθ MEL = argmax θ {EL(θ)} = arg inf θ {ELR(θ)}. (18) The main technical difficulty, evaluating the empirical likelihood EL(θ) at any given θ, is resolved by a Lagrangian method as follows { ( K ) K ELR(θ) = 2 inf log Kq k q k m(x k, θ) = 0, (q 1,...,q K) = 2 inf (q 1,...,q K,λ,γ) { = inf λ 2 { log(kq k ) λ K } (q k 1/K) = 0 q k m(x k, θ) γ } (q k 1/K) (19) (20) ( log 1 + λ m(x k, θ)) }. (21) because the first order condition gives qk = 1 ( λ m(x k, θ)), where λ is the optimal Lagrange K multiplier and depends on θ. Now, to derive ˆθ MEL, one writes { ( arg inf {ELR(θ)} = arg inf inf 2 log 1 + λ m(x k, θ)) }. (22) θ θ λ Remark III.1 In the specific case where n, the dimension of arrival space of function m is equal to the dimension d of the parameter of interest,, and under mild conditions, the optimum is reached at the θ such as λ = 0. Therefore, the optimal weights are and thus The constraint q k = 1 K ( m(x k, ˆθ ) 1 1 MEL ) = K, (23) EL(ˆθ MEL ) = K K. (24) q k m(x k, ˆθ MEL ) = 0 is then simplified and ˆθ MEL is given as the solution of the following equation 1 K m(x k, ˆθ MEL ) = 0 (25) This remark gives an explicit expression for ˆθ MEL in the case where the parameter of interest θ is the expectation. The corresponding moment function is given by m : (R p, R p ) R p, is m(x, θ) = x θ. This leads to the following theorem.

10 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 10 Theorem III.2 Let (x 1,..., x K ) an IID data set in R p, with common PDF P 0 with expectation θ 0 R p and finite variance. Then the Maximal Empirical Likelihood estimate of the expectation is given by ˆθ MEL = x. (26) where x denotes the empirical mean, x = 1 K x k. Proof: This result is provided by Eq. (25) with the particular moment function, m : (R p, R p ) R p, defined by m(x, θ) = x θ. This expression for ˆθ MEL can be extended to more general forms of m. In particular, the estimation scheme can take into account some priors information by introducing an augmented observation vector as proposed in the following. For notational convenience, this augmented vector will be also denoted x. x is divided in two parts: y, a first data transformation traducing the parameter estimation problem and z a second data transformation reflecting the priors information. This procedure leads to x = (y, z), where E P0 [z] = 0. (27) Therefore, the function m, defined by (R p, R d ) R p m : (x, θ) = ( (y, z), θ ), (28) (y θ, z) is such that E P0 [m(x, θ 0 )] = 0 and this leads to the following theorem. Theorem III.3 (Estimation with priors) Let (x 1,..., x K ) be an IID data set in R p, with common PDF P 0. For k = 1,..., K, let us set x k = (y k, z k ) R d R p d and assume that the expectation and variance are E P0 y = θ 0 and Var y = V yy z 0 z Then the Maximal Empirical Likelihood estimate is given by Proof: This result has been proved in [26], page 52. V zy V yz V zz. (29) ˆθ MEL = y V yz V 1 zz z. (30) Notice that when there is no prior information, i.e. z = 0, one obtains result of Theorem III.2. This is also the case when z is uncorrelated with y because V yz is the null matrix.

11 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 11 D. Confidence regions Now, we will introduce another feature of the EL method: construction of confidence interval for the parameter of interest θ. This confidence region can be interpreted as a test on the parameter of interest θ. For this purpose, one should build a parametric log-likelihood ratio, defined as follows R(θ) = 2 log L(θ) (31) sup θ {L(θ)} where L(θ) represents the classical Likelihood function. A well-known result is that R(θ) is asymptotically distributed as a χ 2 d, where d is the dimension of θ. In the similar way, the EL ratio yields a confidence region for θ under the moment condition: θ is in the confidence region with a level equal to 1 α, if ELR(θ) is smaller than the (1 α) - quantile of a χ 2 d distribution. The huge difference is that no parametric assumption has to be done on the PDF f θ 0 the data distribution P 0. The following theorem enunciates this result. of Theorem III.4 (Coincidence of R and ELR asymptotic distributions) Let (x 1,..., x K ) be an IID data set in R p, with common distribution P 0 such as, for some m : (R p, R d ) R n, E P0 [m(x, θ 0 )] = 0 and E P0 [m(x, θ 0 )m(x, θ 0 ) ] finite with full rank. Then where 2 log ( K K EL(θ 0 ) ) ( ) 2 log K K EL(ˆθ MEL ) ELR(θ 0 ) dist. denotes the convergence in term of distribution. K dist. K χ2 n, (32) dist. K χ2 n d, (33) dist. K χ2 d, (34) Let C K = { θ ELR(θ) χ 2 d (1 α)}, then C K is an asymptotic confidence region: Proof: This result has been proved in [27]. P r(θ 0 C K ) K 1 α. (35) To emphasize the adaptability of the EL method to the data distribution, we give here below two examples of confidence regions. Figure 1 clearly illustrates the behavior of EL: the confidence regions follow the level sets of the real PDF. The shapes of the regions are close to circles for the standard Gaussian distribution and are close to squares for the Uniform distribution. EL method, without parametric assumption on the data PDF,

12 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 12 behaves as the parametric likelihood when the PDF modeling exactly corresponds to the real data PDF. In practice, this ideal choice is a challenging if not intractable issue and the parametric likelihood leads to important drawbacks. Another enjoyable feature of the EL is provided by a result from [27] which states that the larger n, the more precise the confidence region. In the next section, we consider an application that illustrates this behavior. IV. A SIMPLE EXAMPLE: MEAN ESTIMATION IN PRESENCE OF GAUSSIAN NOISE This section is devoted to the presentation of the EL in a simple case. For that purpose, we first recap classical results obtained with the Maximum Likelihood (ML) method under Gaussian assumptions. In the general model defined by Eq. (3), we assume that the covariance matrix M is known. In regards of the ML procedure, the natural moment condition corresponding to our example should simply be E M x θ 1/2 = 0, (36) τ which is equivalent to E [x θ] = 0. (37) The parameter of interest is therefore defined as the solution of a moment equation that does not depend on M nor on τ. The empirical likelihood procedure will also share this enjoyable property. To estimate the parameter of interest θ, the natural procedure consists to build the profiled likelihood defined as follows L(θ) = L(x 1,..., x K, θ) = sup τ In our model, previous equation becomes K L(θ) = sup φ M x 1/2 k θ τ τ { K } f (θ,τ) (x k ). (38) where φ stands for the standard Gaussian PDF in R p. This method has been extensively studied and provides under Gaussian assumptions the ML estimate ˆθ ML, given by ˆθ ML = 1 K (39) x k = x. (40)

13 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 13 Notice that the ML estimate of θ 0 defined at Eq. (40) does not depend on the true value of the scale parameter τ 0. Let us now turn to the EL method. The moment function corresponding to the problem under study is m(x, θ) = x θ, (41) and therefore n = d. As seen previously in Theorem III.2, this implies that the MEL estimate is defined by ˆθ MEL = qk x k = 1 x k = x = K ˆθ ML. (42) This result is summarized by the following theorem. Theorem IV.1 (Coincidence of ML and MEL estimates under Gaussian assumptions) Let (x 1,..., x K ) an IID data set in R p, with common Gaussian distribution, denoted N p (θ 0, τ0 2 M), then the MEL and the ML estimates are equal: ˆθ MEL = ˆθ ML = x. (43) Remark IV.1 The conclusion of this section is that the estimate of first order moment provided by the MEL theory, without parametric modeling, is the same as the ML estimate built under Gaussian assumptions (and under many classical others). This is an important result for two reasons: Problems with Gaussian models are extensively studied in Signal Processing and in this reference case, Theorem IV.1 ensures that the MEL estimate performs exactly like the ML estimate. In harder contexts, i.e. when no natural PDF modeling is available, the EL method will still provide robust estimate. V. APPLICATION TO COVARIANCE MATRIX ESTIMATION In this section, we focus on the noise covariance matrix estimation problem. Let us set x a complex Gaussian p-vector with zero-mean and covariance matrix E [ xx H] = τ 2 M, denoted x CN (0, τ 2 M). Again, we set Tr(M) = p for identifiability considerations. For simplicity matters, we set N = τ 2 M. Since N is Hermitian, we only need to estimate the upper triangular part of N, which is our parameter, perviously denoted θ.

14 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 14 In the following, the problem statement will be modified according to priors assumed on the covariance matrix structure. The possible moment equations will be noted E[m i (x, N)] = 0. (44) In the sequel, d i will stand for the number of unknown real parameters, i.e. the dimension of θ and n i for the dimension of m i. Notice that each moment equation i will provide a different estimate. When there is no prior on the covariance matrix structure, the moment equation is given by [ (xi E[m 1 (x, N)] = E x H ) ] j N ij = 0, (45) where x i is the i th element of the vector x and x H stands for the conjugate of element x. We consider here N as an element of C p p, and then, the number of unknown real parameters is d 1 = p 2 : p for the real valued elements N jj of the diagonal and 2p(p 1)/2 for the complex valued elements N ij, i < j, of the strictly upper triangular part. The dimension n 1 of m 1 is also equal to p 2 and then, as mentioned previously in Remark III.1, the MEL weights are 1/K and the corresponding estimate is i j N EL1 = 1 x k x H k K = xxh. (46) k The point of this section is to show how the estimate can be improved in the direction proposed by [27]: increase the information used by constraining the likelihood to fit prior knowledge, i.e. increase the dimension n of m, or conversely decrease the dimension d of the parameter of interest N. Remark V.1 An important property of EL is to easily take into account available prior information in the estimation process. Notice that in the problem under study in the current section, one has E[x] = 0 so that it could be possible to use the more complete moment equation defined as follows ) (x i x H j N ij m 2 (x, N) = x i j, with d 2 = p 2 and n 2 = p(p + 2) (47) This would increase the dimension n 2 of m 2 to n 1 + 2p = p(p + 2) whereas the number d 2 of unknown parameters remains unchanged. This estimate writes N EL2 = arg sup {EL 2 (N)} = arg inf N (N,λ) { K ( log 1 + λ m 2 (x k, N)) }. (48) Unfortunately, it is well known that first and second moment estimates are independent in the Gaussian case by the Student theorem. Therefore, the new constraint does not bring any supplementary information

15 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 15 on the estimation of N. The effect of the independence in the Gaussian case can be illustrate through Theorem III.3: if y and z are asymptotically uncorrelated, V yz is null and then the estimate is ˆθ MEL = y V yz V 1 zz z = y. (49) In the following, this constraint will not be used. However, under non-gaussian assumptions, the zeromean prior could be useful for the covariance matrix estimation. A. First prior: N is a Toeplitz matrix A first step is to assume that N has real valued elements. The dimension of the parameter is therefore reduced to p(p + 1)/2. The number of constraints remains unchanged because the constraints remain complex. Several problems in Signal Processing assume that the covariance matrix of the additive noise has a Toeplitz structure [18], [19], [20], [21]. The covariance matrix (or correlation matrix since E P0 [x] = 0) is often a Toeplitz matrix, since the data vectors consist of subsequent samples from a single signal or times series. The Toeplitz matrix are also met in case of stationary random processes. For instance, because of the stationarity of the input process, the covariance matrix of the autoregressive (AR) process is a Toeplitz matrix. [22] is a good tutorial on Toeplitz matrices and contains most of their properties. Notice that there already exists methods for structured covariance matrix estimation in which the Toeplitz case is treated in the ML framework, see e.g. [28], [29]. We propose here to extend that to the MEL. Now, let us assume that M is a Toeplitz matrix with trace p, M can be written as follows M (a 1,..., a p 1 ) R p 1 ij = a i j, for i j,, M ii = 1. (50) i.e. M = I + p i=2 ( ) a i 1 J i + a H i 1J i, (51) where I stands for the identity matrix with appropriate dimension (p p here) and J i is the p p matrix with 1 s on the i th upper diagonal and 0 everywhere else.

16 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 16 In this subsection, we take advantage of the Toeplitz structure on M, and as a consequence on N = τ 2 M. Then, the constraint used in our process is the following: elements of the main diagonal N ii are all equal to τ 2 and elements on the upper diagonals N ij, for i < j, are equal to τ 2 a i. This changes both the dimension of the problem and the moment condition. Indeed, the number of unknown real parameters is 1 for each of the p upper diagonals. Therefore, d 3 = p. The moment conditions are modified to take the structure into account. m 3 (x, N) = ( xi x i+j τ 2 a j ) ( x 2 i τ 2) 1 i p 1 i p, 1 j p i, with d 3 = p and n 3 = p(p + 1). (52) 2 This leads to a new estimate N EL3 of N, which integrates the constraint on the Toeplitz structure of the real covariance matrix M. This writes N EL3 = arg sup {EL 3 (N)} = arg inf N (N,λ) { K ( log 1 + λ m 3 (x k, N)) }. (53) We can rewrite the constraint in terms of expectations in order to obtain an explicit form of the estimate by means of Theorem III.3. For simplicity purpose, we give the constraints for p = 3. Let y = Re ( x 1 x H 1, x 1 x 2, x 1 x 3 ) (54) z = ( Im (x 1 x 2, x 1 x 3 ), x 1 x H 1 x 2 x H 2, x 1 x H 1 x 3 x H 3, x 1 x 2 x 2 x 3 ) (55) Then, Theorem III.3 gives estimates for the first line of N: ( τ 2, τ 2 a 1, τ ) 2 a 2 = y Vyz Vzz 1 z (56) The estimate N EL3 writes then N EL3 = τ 2 τ 2 a 1 τ 2 a 2 τ 2 a 1 τ 2 τ 2 a 1 τ 2 a 2 τ 2 a 1 τ 2 (57) B. Second prior: N shares a special Toeplitz structure correlation matrix For some applications, an additive assumption can be taken into account. The correlation information contained in the covariance matrix is assumed to be reduced to only one parameter, the correlation coefficient ρ: M ij = ρ i j and N ij = τ 2 ρ i j, (58)

17 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 17 for 1 i, j m and for 0 < ρ < 1. Notice that the covariance matrix M is fully defined by the parameter ρ, which characterizes the correlation of the data. Then, the unknown parameters of N are 2 real scalars, ρ and τ. Thus, the dimension of the problem is d 4 = 2, while the moment condition still remains unchanged, n 4 = p(p+1) 2 : ( (xi m 4 (x, N) = x i+j τ 2 ρ j) 1 i p, 0 j p i ), with d 4 = 2 and n 4 = p(p + 1). (59) 2 This leads to the last estimate N EL4 of N, defined by { K ( N EL4 = arg sup {EL 4 (N)} = arg inf log 1 + λ m 4 (x k, N)) }. (60) N (N,λ) Remark V.2 Notice that no analytical expression of this last estimate is available, because we have not been able to write the constraints x i x i+j τ 2 ρ j as expectations in order to use Theorem III.3. One can give a general expression from Eq. (42), which is also valid for the previous estimates: N ELj = qk (j) x kx H k, (61) where the q k (j) depends on the constraints. For example, in the case of no constraint, i.e. N EL1, the q k (1) are, for 1 < k < K, equal to 1. This corresponds to the ML estimate of the covariance matrix for a K Gaussian vector. For the other estimates, the q k (j) allow to give a weight on the kth data x k in order to fulfill the a priori conditions. As the consequence, N EL3 has a Toeplitz structure while N EL4 satisfies the special Toeplitz structure given by ρ. These theoretical estimates of N will be compared in the section V thanks to simulations on their Mean Square Error (MSE) and an expression for each one will be given. VI. SIMULATIONS In order to enlighten results provided in sections IV and V, some simulation results are presented. We focus on the problem of structured covariance matrix estimation under Gaussian assumptions. Simulations are first performed with complex Gaussian noise and then with complex non-gaussian noise. In this section, we only consider the true value of the parameters. For simplicity of the notation, the index 0 is omitted.

18 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 18 In order to compare all the previous estimates, we will plot the Mean Square Error (MSE) against the number K of data in the Gaussian case. The MSE used in this section is the following criterion: MSE( M, M) = E M M, (62) M where. stands for the Frobenius norm. Actually, this expectation is replaced by its empirical form MSE = 1 L L MSE(l), l=1 which converges to the MSE thanks to Strong Law of Large Numbers on the number L of Monte Carlo trials. The covariance matrix M which has to be estimated has a Toeplitz structure and is defined as follows for 1 i, j p and for 0 < ρ < 1. M ij = ρ i j and N ij = τ 2 ρ i j, (63) The size of the data is p = 3, the shape parameter τ is equal to 1. The correlation coefficient ρ is equal to 0.1 or 0.9. This allows to build a covariance matrix close to the identity matrix (i.e. ρ = 0.1) or a covariance matrix of very correlated data (i.e. ρ = 0.9). This choice of different ρ allows to test the robustness of covariance matrix estimation to data correlation. In particular, Burg s method [28], based on the inversion of the estimated matrix, is expected to suffer from the correlation. Therefore, for ρ = 0.9, it is expected a decrease in estimation performance. For that purpose, several well-known covariance matrix estimates are compared to those provided by the EL method. This allows to evaluate performance of our method in comparison with classical ones. The chosen estimates of N are the following: The well-known Sample Covariance Matrix (SCM) which corresponds to the ML estimate of the covariance matrix under Gaussian assumptions and which is defined as follows N SCM = 1 x k x H k K. (64) N SCM is used as a benchmark but it is not appropriate to our problem since it does not take into account the structure of the real covariance matrix.

19 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 19 To fill this lack, we use an appropriate estimate for Toeplitz matrix, first introduced in [28] and defined by ( N B3 = max ln[det( N)] Tr( N 1 ) N SCM ), (65) bn M 3 where M 3 denotes the set of Toeplitz matrix: M 3 = { A C p p for i j, A ij = τ 2 a i j, and A ii = τ 2}. Moreover, since the EL estimate uses the third constraint (i.e. when the only unknown parameter is the correlation coefficient ρ), we also build the particular Burg estimate defined by ( N B4 = max ln[det( N)] Tr( N 1 ) N SCM ), (66) bn M 4 where M 4 = { A C p p for 0 < ρ < 1, A ij = τ 2 ρ i j }. Finally, EL estimates will be also compared to a recently introduced estimate adapted to non-gaussian noise, the Fixed Point estimate [30], [23], [31] defined as M F P = m K x H k x k x H k 1. (67) M F P x k Notice that M F P is self-normalized and it does not depend on τ. Thus, it provides directly an estimate of M. For the other estimates, a normalization by an estimate of τ 2 has to be made. As Tr(M) = p, one has Tr(N) = τ 2 p and thus, for all estimates, except M F P, one has M = p N. (68) Tr( N) Concerning the EL method, notations of section IV are still valid: N EL1 (which is equal to N EL2 and N SCM ), N EL3 and N EL4. A. Gaussian case Now we give the mean square errors (MSE) of the corresponding estimation procedures, for different values of the data set length K and for the seven estimates of interest when the data are Gaussian p-vector with zero-mean and the covariance matrix N = τ 2 M, with p = 3 and τ = 1, i.e. 0 1 ρ ρ 2 for k = 1,..., K, x k CN 0, ρ 1 ρ 0 ρ 2 ρ 1 (69)

20 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 20 1) ρ = 0.1: Figure 2 is plotted with a logarithmic scale on both horizontal and vertical axes. The curves which correspond to M EL1, M EL3 and M EL4 are respectively denoted EL1, EL3 and EL4. As expected, the MSE calculated for each estimates decreases when K increases. The larger the number of data K, the better the precision of estimates. For the EL, the MSE decreases with the difference n d, i.e. the estimation performance increases with the number of priors. This motivates the use of EL as soon as priors are available. That is why on Figure 2.a, the MSE of M EL4 is lower than the MSEs of M EL3 and M EL1. Moreover, as seen previously, simulations validate that M EL1 is equal to the SCM. On the other hand, since the FP estimate is not adapted to Gaussian context, its performance are quite poor. This estimate has been introduced in the literature in cases of non-gaussian noise models. Moreover, notice that the MSE of M EL4 (resp. M EL3 ) reaches the MSE of M B4 (reps. M B3 ) as soon as the number K of data is large enough: on Figure 2.a, it approximatively corresponds to K = 200. This can be explained by the fact that both methods exploit the same hypothesis on the structure of the covariance matrix M. For smaller values of K, the information contained in the Gaussian hypothesis plays an effective role which can not be reduced to the observed data. That is why Burg s method is sensitively better than the EL method for small values of K. Finally, Eq. (57) provides a closed-form expression of M EL3 in opposite to M B3 which needs an optimization procedure, for the same estimation performance. This first simulation advocates for the EL method. Despite the fact that it does not use the Gaussian assumption on the data PDF, the EL method shares the same estimation performance as Burg s one. 2) ρ = 0.5 and ρ = 0.8: First comments of Figure 2.a are still valid for Figure 2.b and Figure 2.c. The main difference is that, for ρ = 0.5 and even more for ρ = 0.8, Burg s method suffers from a relative lack of performance, due to the difficulty of the matrix inversion in its algorithm, for these more correlated situations: indeed, when ρ increases, M B4 deteriorates, in comparison to M EL4. Figure 2.c still shows that for ρ = 0.8, the MSE of M B4 is even above the MSE of M EL3.

21 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 21 An other important remark is that the distance, in terms of MSE, between M EL4 and M EL3 (resp. M B4 and M B3 ) seems to be decreasing, i.e. that the supplementary assumption on the particular structure of the Toeplitz matrix brings less additional information for the estimation procedure. Finally, notice that MSE values seem to be smaller as ρ increases: for instance, for K = 100, the MSE of M EL4 is around 0.06 for ρ = 0.1, 0.04 for ρ = 0.5 and 0.02 for ρ = ) MSE as a function of ρ: To confirm last comment, Figure 3 presents MSE against ρ for K = 100 and p = 3. First comment is that all the MSE decrease as ρ increases. This can be partially explained by the fact that the norm of the matrix M, which is the denominator of the MSE, increases with ρ. As observed in the previous paragraph, Burg s estimates have a gain in performance smaller than the EL s ones when ρ increases. Moreover, M EL4 and M EL3 (resp. M B4 and M B3 ) approach from each other: this means that the additional information concerning the particular Toeplitz structure of M (i.e. defined only by ρ), assumed to build M EL4 and M B4, does not allow to improve significantly estimation performance when ρ tends to 1. B. Non-Gaussian case: mixture of Gaussian and K-distribution To compare the different estimates in a non-gaussian context, we retain the K-distribution for the data PDF, since the shape parameter ν allows, for values close to 1, to approximate a Gaussian PDF and for small values, it provides impulsive noise. Moreover, this PDF is widely used in signal processing, see for example [14], [20], [21], [31]. The K-distribution is the product of the square root of a random variable which is Gamma distributed, and an independent complex zero-mean Gaussian vector, with covariance matrix S; it is denoted CK (ν, 0, S). In order to simulate as better as possible a real situation, the noise is the sum of two independent noises: a Gaussian noise, which would model thermal noise or interferences, and a K-distribution which would represent an additive non-gaussian noise (for instance, the clutter in

22 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 22 radar context). For k = 1,..., K, x k ρ ρ 2 ρ 1 ρ ρ 2 ρ 1 1/2 [ ( ) ( )] CN 0, I + CK ν, 0, I. (70) The set of parameters in this subsection is p = 3, K = 100 and ρ = 0.1. This value of the correlation coefficient allows to analyze the Burg s method in a suitable context. For all the estimates except the Fixed Point, the MSE decreases as ν increases. In other terms, the performance spoil as the data PDF diverges from the Gaussian one. In the opposite, the Fixed Point performance remains constant as ν variates, as it is designed for, see e.g. [31]. Since the FP estimate, like the SCM and EL1, does not assume any particular structure on the covariance matrix M, it makes no sense to compare these estimates to the others. Therefore, EL3 and Burg3 have to be considered together while EL4 and Burg4 are considered as a third set of estimates. As expected, the Fixed Point has the smallest MSE (beyond the three non structured estimates) when the noise is very impulsive, and deteriorates as ν increases, to approach the SCM on Gaussian data. Actually, performance of the Fixed Point are almost constant while SCM performs better when ν increases. When ν is small enough, data are non-gaussian and EL method provides significantly better performance than Burg s one. Then, when ν becomes bigger than 0.2, the noise is approximatively Gaussian and it coincides with results of Figure 2.a. Notice that the ratio between the curves EL3 and EL4 (resp. Burg3 and Burg4) is constant for each value of ν. This ratio corresponds theoretically to a ratio between the expectation of a χ 2 (3) (where 3 is the number of unknown parameters: τ, a 1 and a 2 ) and the expectation of a χ 2 (2) (where 2 is the number of unknown parameters: τ and ρ). This ratio is equal to 1.5. C. Simulations synthesis This set of simulations illustrates the use of Empirical Likelihood thanks to an example, the estimation of a structured covariance matrix of an additive noise. It appears that in a Gaussian context and when the data correlation is weak enough, EL competes with the standard method, introduced by Burg in [28] despite the fact that EL does not exploit the a priori PDF of the data. Moreover, simulations show that EL

23 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 23 method provides even better performance than the other estimates when the correlation becomes stronger. Secondly, under non-gaussian PDF, Burg s estimates performance seem to deteriorate whereas the EL ones remain robust. This is coherent with the fact that EL is a semi-parametric method designed to handle any data distribution. VII. CONCLUSIONS In this paper, an alternative estimation method, the Empirical Likelihood, is introduced. This method does not require an a priori knowledge of the data distribution but it uses different informations like, for instance, moments or specific structure on the parameter of interest. In the field of Signal Processing, such priors information is encountered in many estimation problems, in which the data PDF is not available. Moreover, the theoretical results provided by the EL method shows that under Gaussian assumptions, and without any supplementary information, the Maximum Empirical Likelihood estimate is the same as the Maximum Likelihood estimate. The second part of this paper has been devoted to a classical and generic example: the estimation of a structured covariance matrix under Gaussian and non-gaussian assumptions. Under Gaussian assumptions, the EL method has been compared to the classical and structured methods introduced by Burg. It appears that EL performs almost as well as Burg s method. On the other hand, in non-gaussian context, EL presents good performance even when other considered methods fail. Moreover, according to prior information used in the EL method, closed-form expressions have been derived. This improves classical methods which generally have to solve an optimization procedure in terms of computational complexity, leading to a substantial gain in time and robustness. APPENDIX A PROOF OF THEOREM III.1 First, thanks to Remark III.1, one has EL(ˆθ MEL ) = K K. Therefore, since sup θ {EL(θ)} = EL(ˆθ MEL ), the log-likelihood ratio writes: ( ) ( { K }) EL(θ) 2 log = 2 log K K g sup θ {EL(θ)} (θ,q1,...,q K)(x k ) E G[m(x, θ)] = 0 sup (q 1,...,q K) (71)

24 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 24 = 2 inf (q 1,...,q K) = 2 K inf { } log(kq k ) E G[m(x, θ)] = 0 (q 1,...,q K) 1 K { = 2 K inf (q 1,...,q K) log ( ) qk log 1/K ( dg dp K E G[m(x, θ)] = 0 } ) dp K E G [m(x, θ)] = 0 (72) (73) (74) = ELR(θ), (75) which concludes the proof. REFERENCES [1] A. B. Owen, Empirical likelihood ratio confidence regions, Annals of Statistics, vol. 18, pp , [2] S. M. Kay, Fundamentals of statistical signal processing - Detection theory. Prentice-Hall PTR, 1998, vol. 2. [3] H. L. Van Trees, Detection, Estimation and Modulation Theory, Part I, II and III. John Wiley & Sons, New York, [4] L. L. Scharf and D. W. Lytle, Signal detection in gaussian noise of unknown level: an invariance application, IEEE Trans.-IT, vol. 17, pp , July [5] S. Haykin, Array Signal Processing. Prentice-Hall Signal Processing Series, Englewood Cliffs, New Jersey, [6] H. L. Van Trees, Detection, Estimation and Modulation Theory, Part IV: Optimum Array Processing. John Wiley & Sons, New York, [7] E. J. Kelly, An adaptive detection algorithm, IEEE Trans.-AES, vol. 23, no. 1, pp , November [8] S. Kraut, L. L. Scharf, and L. T. Mc Whorter, Adaptive subspace detectors, IEEE Trans.-SP, vol. 49, no. 1, pp. 1 16, January [9] J. G. Proakis, Digital Communications. McGraw-Hill, Third Ed., New York, [10] M. Rangaswamy, J. H. Michels, and D. D. Weiner, Multichannel detection for correlated non-gaussian random processes based on innovations, IEEE Trans.-SP, vol. 43, no. 8, pp , August [11] J.-F. Cardoso, Source separation using higher order moments, in Proc. IEEE-ICASSP, Glasgow, May 1989, pp [12] P. M. Djuric et al., Particle filtering, IEEE SP Magazine, vol. 20, no. 5, pp , September [13] J. B. Billingsley, Ground clutter measurements for surface-sited radar, MIT, Tech. Rep. 780, February [14] S. Watts, Radar detection prediction in sea clutter using the compound k-distribution model, IEE Proceeding, Part. F, vol. 132, no. 7, pp , December [15] T. Nohara and S. Haykin, Canada east coast trials and the k-distribution, IEE Proceeding, Part. F, vol. 138, no. 2, pp , [16] A. Farina, A. Russo, and F. Scannapieco, Radar detection in coherent weibull clutter, IEEE Trans.-ASSP, vol. 35, no. 6, pp , June [17] A. Dogandzic and B. Zhang, Distributed estimation and detection for sensor networks using hidden markov random field models, IEEE Trans.-SP, vol. 54, no. 8, pp , August [18] U. Grenander and G. Szego, Toeplitz Forms and Their Applications. University of California Press, Berkeley and Los Angeles, 1958.

25 SUBMITTED TO IEEE TRANS. ON SIGNAL PROCESSING 25 [19] D. S. G. Pollock, A Handbook of Time-Series Analysis, Signal Processing and Dynamics. Academic Press, [20] E. Conte, M. Lops, and G. Ricci, Adaptive detection schemes in compound-gaussian clutter, IEEE Trans.-AES, vol. 34, no. 4, pp , October [21] F. Gini and M. V. Greco, Sub-optimum approach to adaptive coherent radar detection in compound-gaussian clutter, IEEE Trans.-AES, vol. 35, no. 3, pp , July [22] R. M. Gray, Toeplitz and circulant matrices, Stanford University Information Systems Laboratory, Tech. Rep , April [Online]. Available: gray/compression.html [23] F. Gini and M. V. Greco, Covariance matrix estimation for cfar detection in correlated heavy tailed clutter, Signal Processing, special section on SP with Heavy Tailed Distributions, vol. 82, no. 12, pp , December [24] A. W. van der Vaart, Asymptotic Statistics. Cambridge University Press, [25] S. M. Kay, Fundamentals of Statistical Signal Processing - Estimation Theory. Prentice-Hall PTR, Englewood CliJs, NJ, 1993, vol. 1. [26] A. B. Owen, Empirical Likelihood. Chapman and Hall/CRC, Boca Raton, [27] Y. S. Qin and J. Lawless, Empirical likelihood and general estimating equations, Annals of Statistics, vol. 22, no. 1, pp , [28] J. P. Burg, D. G. Luenberger, and D. L. Wenger, Estimation of structured covariance matrices, Proc. IEEE, vol. 70, no. 9, pp , September [29] D. R. Fuhrmann, Application of toeplitz covariance estimation to adaptive beamforming and detection, IEEE Trans.-SP, vol. 39, no. 10, pp , October [30] E. Conte, A. De Maio, and G. Ricci, Recursive estimation of the covariance matrix of a compound-gaussian process and its application to adaptive cfar detection, IEEE Trans.-SP, vol. 50, no. 8, pp , August [31] F. Pascal, Y. Chitour, J.-P. Ovarlez, P. Forster, and P. Larzabal, Covariance structure maximum likelihood estimates in compound gaussian noise : Existence and algorithm analysis, IEEE Trans.-SP, (to appear). (a) 40 data, standard Gaussian distributed (b) 20 data Uniform distributed on the set [0, 1] 2 Fig. 1. Empirical Likelihood confidence regions for the statistical mean.

A SIRV-CFAR Adaptive Detector Exploiting Persymmetric Clutter Covariance Structure

A SIRV-CFAR Adaptive Detector Exploiting Persymmetric Clutter Covariance Structure Guilhem Pailloux 2 Jean-Philippe Ovarlez Frédéric Pascal 3 and Philippe Forster 2 ONERA - DEMR/TSI Chemin de la Hunière