Data Fusion with Entropic Priors

Size: px

Start display at page:

Download "Data Fusion with Entropic Priors"

Blake Ray
6 years ago
Views:

1 Data Fusion with Entropic Priors Francesco PALMIERI, Domenico CIUONZO Dipartimento di Ingegneria dell Informazione, Seconda Università di Napoli, Italy Abstract. In classification problems, lack of knowledge of the prior distribution may make the application of Bayes rule inadequate. Uniform or arbitrary priors may often provide classification answers that, even in simple examples, may end up contradicting our common sense about the problem. Entropic priors, via application of the maximum entropy principle, seem to provide a much better answer and can be easily derived and applied to classification tasks when no more than the likelihood funtions are available. In this paper we present an application example in which the use of the entropic priors is compared to the results of the application of Dempster-Shafer theory. Keywords. Bayesian Networks, Artificial Intelligence, Propagation of Belief, Data Fusion Introduction In providing solutions to data fusion problems we often have to deal with incomplete data. More specifically, even if the likelihood functions may be available, lack of knowledge of the priors may strongly affect the results of our classifications. Classical use of uniform priors, that may be traced back to the work of Laplace [], may result inappropriate in many applications [7] [2]. The basic question is: what is the information on the priors that is carried only by the likelihoods? Or better: what is the prior distribution to be used in Bayes fusion formula that ensures that no other information beyond that included in likelihoods is injected into the problem? Various attempts have appeared in the literature in trying to obtain the least informative solution as an alternative to the inadequacy of using uniform, or arbitrary priors [7][8][9][0]. Evidence theory [] and alternative theories to probability [7], have been developed in the attempt of providing answers when Bayes theorem seems to lead to solutions that are inadequate, or contradictory with respect to our common sense. Dempster-Shafer approach to data fusion [3][4][5] is becoming very popular in some domain areas [2][7] and to many it seems to be a promising useful alternative to standard Bayesian approaches. We do not necessarily make classical probability theory our flagship, but try fillin our lack of knowledge about the model within the context of classical information theory [3]. More specifically, we show that, when prior probabilities are not available, by global maximization of the entropy we obtain a solution for the priors that is easy to Corresponding Author: Dipartimento di Ingegneria dell Informazione, Seconda Università di Napoli, via Roma 29, 803 Aversa (CE), Italy; francesco.palmieri@unina2.it.

2 compute and that, when applied to application scenarios, seems to fix the inadequacies of Bayes rule based on uniform or arbitrary priors. Entropy maximization has a long history [4][5][6] and entropic priors [7][8][9][0] have been advocated mostly within the context of theoretical physics. To our knowledge however, their full potential has not been exploited yet within the data fusion context. In this paper we derive the entropic priors and apply them to an example taken from [2] and [7] where Dempster-Shafer (DS) theory has been applied to a target classification problem. Our application of the maximum entropy priors shows strikingly similar results in providing good common sense answers to a problem where Bayes rule with uniform priors is clearly inaquate.. Bayes Data Fusion In classification, or pattern recognition problems, the typical scenario is one with available a sequence of N vectors of random attributes X = {X[], X[2],..., X[N]}, X[n] X n, n =,..., N () directly linked to a set of discrete labels S = {S[], S[2],..., S[M]}, S[m] S m, m =,..., M. (2) Each vector X[n] is a D-dimensional random vector that contains continuous or discrete attibutes. They may represent the results of measurements or discrete information coming from other subsystems. The classification consists in associating the measurements to the labels, that may be representative of different classes, or other discrete attributes. More compactly (X, S) X N S M, X N = X X N, S M = S S M. (3) The classification problem consists in estimating S from a realization X = x = {x[], x[2],..., x[n]}. (4) Detailed inference about S can be obtained if we were able to compute for each s S the a posteriori probabilities via application of Bayes theorem P r {S = s X = x} = f(x S = s)p r{s = s}, (5) f(x) where f(x S = s) are the likelihood functions and P r {S = s} are the a priori probabilities (priors). In many practical cases the system designer has available the likelihood funtions, that represent the data generative model, but he is not sure about the priors. Incorrect use of the priors may strongly affect the results of the classification and this has often put doubts on the effectiveness of using of Bayes theorem [7]. Assuming uniform priors is very common in statistical methods and it appears to be reasonable when such information is not available. After all the well-known maximum likelyhood method is equivalent to using uniform priors. Unfortunately, many cases have been shown in the literature in which the use of uniform priors shows contradictory results if compared to our intuitive understanding of the problem [2][5][7]. The assumption of uniform priors may be sometimes too strong, or completely arbitrary.

3 2. Entropic Priors We approach the problem here searching for the prior distribution that, given the likelihood functions, maximizes the joint entropy [3] H(X, S) = f(x S = s)p r{s = s} log dx. (6) s S x X f(x S = s)p r{s = s} The rationale behind this choice is that entropy effectively represents the degree of uncertainty about the model [4][5][6] and appears to be the natural choice when no further hypotheses can be injected into the system. Maximum entropy methods have been used extensively in the literature showing great success in many application areas. Wellknown is the exponential solution to entropy maximization [3][4] when constraints are imposed in the form of moments. We focus here on the classification problem and point to a solution in which no more than the likelihood funtions are needed. Proposition (Entropic Priors): The prior distribution that maximizes the joint entropy H(X, S) for given likelihoods f(x S = s), s S, is P re {S = s} = y S e H(X S=s) where H(X S = y) is the conditional entropy H(X S = s) = x X eh(x S=y), (7) f(x S = s) log [f(x S = s)] dx. (8) Entropic priors have been proposed in different forms [7][8][9][0][6] mostly within the context of theoretical physics. In [7][9][0] constraints on Fisher s information matrix are also imposed. Our derivation is different from Neumann and Skilling s [8] because we maximize joint entropy, while they maximize prior entropy with a contraint on the average likelihood entropy. Their result contains a scale parameter that needs to be resolved by other means, while in our case we obtain a closed-form solution that is totally self-contained. In anycase, to our knowledge, the real power of using entropy priors has not found full application yet in data fusion problems. For completeness we report here our proof. Proof: For maximization of H(X, S), the Lagrange function is L = H(X, S) λ ( s P (s) ) = H(X S) + H(S) λ ( s P (s) ) = s H(X s)p (s) s P (s) log P (s) λ ( s P (s) ) (9) where P (s) = P r{s = s} and H(X s) = H(X S = s) and the only constraint is P (s) =. Taking the derivative with respect to P (s) we get s L = H(X s) log P (s) λ. (0) P (s) Setting the derivative to zero we get p(s) = e H(X s) λ. () By imposing the constraint we get

4 e λ = y eh(x y), (2) which gives the solution (7). Joint, conditional and marginal entropies with entropy priors are respectively H E (X, S) = log C, H E (X S) = C s H(X s)eh(x s), H E (S) = log C C s H(X s)eh(x s), (3) where C = s eh(x s). Note that the entropies are expressed in nats for easier notation, but the base of the logarithm is totally irrelevant. Entropy priors can be computed numerically when we have empirical or dicrete distributions, or analytically for some canonical densities. In the example that follows we compute the entropy priors for Gaussian likelihoods for scalar variables. A more complete account for other densities will be reported elsewhere. The final fusion formula with entropy priors is then P re {S = s X = x} = 2.. A geometric interpretation f(x S = s)e H(X S=s). (4) f(x S = y)eh(x S=y) y S Entropy priors have a nice geometric interpretation. Recall that the volume of the typical set for an independent sequence of i.i.d. random variables X, X 2,..., X n is A n (X) = e nh(x) where H(X) is the entropy of all the X i. In our case we have that the entropy priors are just the relative volumes of the conditional typical sets, i.e. A n (X S = s) = A n (X S = s) y S A n(x S = y). (5) Likelihoods with larger volumes have the largest entropy priors. 3. Application Example We consider here an application with only one scalar attribute (D = ) presented N times independently and the class to estimate s {c, c 2, c 3 } is constant over the whole set (M = ). The example is extracted from [2] and [7] where the attribute x is an acceleration parameter and the classes are naval targets ({c, c 2, c 3 } = {slowship, f astship, f astraf t}. The various crafts are assumed to have accelerations in the ranges {[ 0.5g, 0.5g], [ g, g], [.5g,.5g]}, respectively. Following the approach taken in [2] and [7], the accelerations in the three cases are assumed to be zero-mean gaussian with σ = g/6, σ 2 = g/3 and σ 3 = g/2. The likelihoods are then N N f(x[],..., x[n] c j ) = f(x[i] c j ) = N (x[i]; 0, σ j ) j =, 2, 3. (6) i= i=

5 We compare the results obtained with the classical Bayesian approach based on uniform priors, with the results obtained from application of Dempster-Shafer theory and with those resulting from the entropic priors. 3.. Classical Bayesian Approach If the priors are assumed to be uniform (Laplace s principle []), the posterior probabilities are N i= P r {S = c j x[],..., x[n]} = N (x[i]; 0, σ j) 3 N l= i= N (x[i]; 0, σ l). (7) 3.2. Dempster-Shafer Approach The DS approach for posterior calculation can be summarized as follows by a four step procedure. A complete descrition of the DS technique is beyond the scope of this paper. We report here only the essential steps in support of the comparison outlined in the following section.. For every likelihood f(x[i] c j ) compute the respective plausibility function P l X[i] (x[i] c j ) corresponding to the q LC isopignistic belief density function [2][5], which in the case of a gaussian density N (x[i]; 0, σ j ) is defined as: P l X[i] (x[i] c j ) = 2y j 2π e y2 j /2 + erfc( y j 2 ) y j = x[i] σ j. 2. For every plausibility function P l X[i] (x[i] c j ) derive the conditional Basic Belief Assignment (BBA) m S (A x[i]) of each possible subset A S (the union of all these subsets is called the Power-set of S) through the Generalized Bayes theorem [3]: m S (A x[i]) = K c P j A lx[i] (x[i] c j ) [ c j Ā P l X[i] (x[i] c j ) ], where K is a normalization constant which is the sum of the BBAs over the whole Power-set. 3. Compute the total BBA m S (A x[],..., x[n]) from the all the evidences m S (A x[i]) given by the Dempster Combination Rule (denoted by ) []: m S (A x[],..., x[n]) = m S (A x[])... m S (A x[n]). 4. From the total BBA m S (A x[],..., x[n]) compute the posterior probabilities P rds {S = c j x[],..., x[n]}, through the Pignistic transformation [4]: P rds {S = c j x[],..., x[n]} = c j A S A ms (A x[],..., x[n]) Bayesian Approach with Entropic Priors For gaussian likelihoods [3] N N H(X S = c j ) = H(X[i] S = c j ) = 2 log 2πeσ2 j, (8) i= and the entropic priors are P re {S = c j } = σ j N 3 l= σn l i=. (9)

6 0 8 6 Acceleration Profile Acceleration Profile of a specific target Posterior Probabilities Profile Classic Bayesian Approach Slow Ship Fast Ship Fast Raft acceleration Time Slot posterior probability Time Slot Posterior Probabilities Profile DS Approach Slow Ship Fast Ship Fast Raft Posterior Probabilities Profile Entropy Prior Bayesian Approach Slow Ship Fast Ship Fast Raft posterior probability posterior probability Time slot Time Slot Figure. Acceleration profile (top left); Posterior probabilities with vaying N up to 34 estimated with: (a) Bayesian approach with uniform priors (top right); (b) Dempster-Shafer theory (bottom left); (c) Bayesian approach with entropic priors (bottom right). Bayes rule becomes P r {S = c j x[],..., x[n]} = σj N 3 l= σn l N i= N (x[i]; 0, σ j) N i= N (x[i]; 0, σ l). (20) 3.4. Simulations To visualize the results of the three approaches, we have simulated the identification process for a target of type c 3 (fast raft) that shows a specific acceleration profile over 34 time slots x = {x[], x[2],..., x[34]}. In particular the target moves at constant speed for 5 time slots, then it accelerates at g for 2 time slots, it decelerates at -g for 2 time slots and finally moves again at constant speed for the remaining 5 time slots. The acceleration profile is shown in Figure (top left). The posterior probabilities are computed with the three approaches for varying N up to 34 and shown in the rest of Figure. As we can see from the figure, the classical Bayesian approach with the uniform priors, in the first 5 slots favours the target with smaller variance, i.e. the target class c. After the 4 slots of manoeuvre the result changes, because the target c cannot reach acceleration magnitude g and its probability rapidly falls to zero. Unfortunately, in the following steps, even though initially target class c 3 is favoured, in the final 5 slots more confidence is definitely given to target class c 2, which is the wrong result.

7 With the DS approach the results are quite different. In the first 5 slots of constant speed the posterior probabilities remain equal, correctly reflecting the ambiguity of the observation. During the 4 slots of manoeuvre, target class c 3 becomes the favorite, for the same reason expressed by the classic Bayesian approach. In the final 5 slots the constant speed doesn t make the probabilities evolve any further correctly reflecting the lack of new information, giving rise then to a correct result. The Bayesian approach, simply by using different priors, reflects all the common sense that seems to be underlying the (complicated) DS approach. The result is strikingly similar and gives also the correct answer. We have carried out other simulations in which our results seem to confirm consistently the validity of the maximum entropy idea. 4. Conclusions In this paper we have shown that in classification problems, our lack of knowledge about the prior distribution, can be effectively substituted with a maximum entropy solution (entropy priors). The entropy method, very popular in many application areas, is yet to show its potential in classification where we seem to find strikingly effective results in many scenarios. In this paper we have derived the entropic prior distribution for a multiclass problem and have applied it to a simple example of identification based on a set of measurements from a scalar variable with gaussian likelihoods. The results are preliminary, but they are very promising, also because the method seems to provide answers that are consistent with our common sense and that have motivated the development of alternatives to probability theory. References [] P. S. Laplace, Théorie analytique des Probabilités, Veuve Courcier, Paris, 82. [2] J. Pearl, Probabilistic Reasoning in Intelligent Systems, 2nd ed. San Francisco: Morgan Kaufmann, 988. [3] T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing), 2nd ed.: Wiley-Interscience, [4] E. T. Jaynes, Information Theory and Statistical Mechanics, Phys. Rev., Vol. 06, NO. 4, pp , American Physical Society, May 957. [5] E. T. Jaynes, Information Theory and Statistical Mechanics. II, Phys. Rev., Vol. 08, NO. 2, pp. 7-90, American Physical Society, Oct 957. [6] E. T. Jaynes, Probability Theory : The Logic of Science, Cambridge University Press, [7] A. Caticha and R. Preuss, Maximum Entropy and Bayesian Data Analysis: Entropic Priors, Physical Review, E 70, pp :-2, [8] T. Neumann, Bayesian Inference Featuring Entropic Priors, Bayesian Inference and Maximum Entropy Methods in Science and Engineering: 27th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Vol. 954, NO., pp , AIP, [9] A. Caticha, "Maximum entropy, fluctuations and priors," Bayesian Inference and Maximum Entropy Methods in Science and Engineering: 20th International Workshop. AIP Conference Proceedings, Vol. 568, pp , 200. [0] C. C. Rodriguez, Entropic priors for discrete probabilistic networks and for mixtures of Gaussians models, Bayesian Inference and Maximu Entropy Methods in Science and Engineering. AIP Conference Proceedings, Vol. 67, pp , 2002.

8 [] G. Shafer, A mathematical theory of evidence, Princeton University Press, 976. [2] B. Ristic and P. Smets, Target classification approach based on the belief function theory, IEEE Transactions on Aerospace and Electronic Systems, Vol. 4, NO. 2, pp , April [3] P. Smets, Belief functions: The disjunctive rule of combination and the generalized Bayesian theorem, International Journal of Approximate Reasoning, Vol. 9, NO., pp. - 35, August 993. [4] P. Smets, Decision making in the TBM: the necessity of the pignistic transformation, International Journal of Approximate Reasoning, Vol. 38, NO. 2, pp , February [5] B. Ristic and P. Smets, Belief function theory on the continuous space with an application to model based classification, Proceedings of 0th International Conference IPMU 2004, pp. 9-26, July [6] A. Giffin and A. Caticha, Updating Probabilities with Data and Moments," Bayesian Inference and Maximum Entropy Methods in Science and Engineering: 27th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Vol. 954, NO., pp , AIP, [7] A, Benavoli, L. Chisci, B. Ristic, A. Farina and A. Graziano, Reasoning under uncertainty: from Bayes to Valuation Based Systems. Applications to target classification and threat evaluation, Dipartimento di Sistemi e Informatica, Universitá degli Studi di Firenze, Italy, Selex Sistemi Integrati, Sept. 2007, ISBN:

Tracking and Identification of Multiple targets

Tracking and Identification of Multiple targets Samir Hachour, François Delmotte, Eric Lefèvre, David Mercier Laboratoire de Génie Informatique et d'automatique de l'artois, EA 3926 LGI2A first name.last