Introduction to Estimation and Data fusion Part I: Probability, State and Information Models

Size: px

Start display at page:

Download "Introduction to Estimation and Data fusion Part I: Probability, State and Information Models"

Frank Potter
6 years ago
Views:

1 Introduction to Estimation and Data fusion Part I: Probability, State and Information Models Hugh Durrant-Whyte ARC Centre of Excellence for Autonomous Systems Australian Centre for Field Robotics The University of Sydney Introduction to Estimation and Data Fusion Slide 1

2 Introduction Estimation is the problem of determining the value of an unknown quantity from one or more observations Data fusion is the process of combing information from a number of different sources to provide a robust and complete description of an environment or process of interest This course provides a practical introduction to estimation and data fusion methods. The focus is on mathematical, probabilistic and decisiontheoretic, methods. The course is a cut down version of a full five-day course and includes computer-based laboratories which provide the opportunity to implement and evaluate algorithms

3 Rules of Engagement This course is developed to enable you to get up to speed with basic methods as quickly as possible. It is your course: ask questions, make suggestions, etc. If you do not understand something, please ask, or If something is already well known to you, ask me to move on. Use the labs effectively. These are the best means of understanding the mathematics. Introduction to Estimation and Data Fusion Slide 3

4 Course Content Probabilistic Models Probabilistic Methods Data Fusion with Bayes Theorem Information Measures and Information Fusion State models and noise Estimation The Linear Kalman Filter The Extended Kalman Filter Localisation and Map Building Probabilistic (Monte Carlo) Filters Data Fusion The Multi-Sensor Kalman Filter The Inverse Covariance Filter Decentralised Data Fusion Methods Introduction to Estimation and Data Fusion Slide 4

5 Laboratory Sessions Laboratory 1: Probabilistic and Information Data Fusion Methods Laboratory 2: The Linear Kalman Filter (tracking) Laboratory 3: The extended Kalman Filter (localisation) Laboratory 4: The SLAM algorithm Laboratory 5: Particle Filters Laboratory 6: Multi-sensor multi-target tracking Laboratory 7: Decentralised tracking and Sensor Networks Introduction to Estimation and Data Fusion Slide 5

6 Recommended Reference Material Maybeck: the best practical book in the field. Brown and Hwang: good introductory text. Barshalom: Issues of data association and tracking. Gelb: Useful components Grewal and Andrews: Advanced elements Data Fusion: Blackman, Waltz and Linas BarShalom. DDF Methods: Books by Manyika and Mutambara, papers in open literature. These Course Notes Papers included as part of this course Introduction to Estimation and Data Fusion Slide 6

7 Probabilistic Models Introduction to Estimation and Data Fusion Slide 7

8 Probabilistic Models Uncertainty lies at the heart of all descriptions of the sensing and data fusion process. Probabilistic models provide a powerful and consistent means of describing uncertainty and lead naturally into ideas of information fusion and decision making. Introduction to Estimation and Data Fusion Slide 8

9 Probabilistic Models Familiarity with essential probability theory is assumed A probability density function (pdf ) P y ( ) is defined on a random variable y, generally written as P y (y) orsimplyp (y) The random variable may be a scalar or vector quantity, and may be either discrete or continuous in measure. The pdf is a (probabilistic) model of the quantity y; observation or state. The pdf P (y) is considered valid if; 1. It is positive; P (y) 0 for all y, and 2. It sums (integrates) to a total probability of 1; y P (y)dy =1. The joint distribution P xy (x, y) is defined in a similar manner. Introduction to Estimation and Data Fusion Slide 9

10 Joint Probabilistic Models Integrating the pdf P xy (x, y) over the variable x gives the marginal pdf P y (y) as P y (y) = P x xy(x, y)dx, and similarly integrating over y gives the marginal pdf P x (x). The joint pdf over n variables, P (x 1,, x n ), may also be defined with analogous properties to the joint pdf of two variables. The conditional pdf P (x y) isdefinedby and is a pdf on x for each value of y P (x y) = P (x, y) P (y) P ( y) isnotapdf on y Introduction to Estimation and Data Fusion Slide 10

11 The Total Probability Theorem Chain-rule can be used to expand a joint pdf in terms of conditional and marginal distributions: P (x, y) =P (x y)p (y) The chain-rule can be extended to any number of variables P (x 1,, x n )=P(x 1 x 2, x n ) P(x n 1 x n )P (x n ) Expansion may be taken in any convenient order. The Total Probability Theorem P y (y) = P x x y(y x)p x (x)dx. The total probability in a state y can be obtained by considering the ways in which y can occur given that the state x takes a specific value (this is encoded in P x y (y x)), weighted by the probability that each of these values of x is true (encoded in P x (x)). Introduction to Estimation and Data Fusion Slide 11

12 Independence and Conditional Independence If knowledge of y provides no information about x then x and y are independent P (x y) =P (x) Or P (x, y) =P (x)p (y) Conditional independence: Given three random variables x, y and z, If knowledge of the value of z makesthevalueofxindependent of the value of y then P (x y, z) =P (x z) If z indirectly contains all the information contributed by y to the value of x (for example) Implies the intuitive result P (x, y z) =P (x z)p (y z), Introduction to Estimation and Data Fusion Slide 12

13 Independence and Conditional Independence Conditional independence underlies many data fusion algorithms Consider the state of a system x and two observations of this state z 1 and z 2. It should be clear that the two observations are not independent, P (z 1, z 2 ) P (z 1 )P (z 2 ), as they must both depend on the common state x. However the observations usually are conditionally independent given the state P (z 1, z 2 x) =P (z 1 x)p (z 2 x). For data fusion purposes this is a good definition of state Introduction to Estimation and Data Fusion Slide 13

14 Bayes Theorem Consider two random variables x and z on which is defined a joint probability density function P (x, z). The chain-rule of conditional probabilities can be used to expand this density function in two ways P (x, z) = P (x z)p (z) = P (z x)p (x). Bayes theorem is obtained as P (z x)p (x) P (x z) = P (z) Computes the posterior P (x z) given the prior P (x) and an observation P (z x) P (z x) takes the role of a sensor model: First building a sensor model: fix x = x and then ask what pdf on z results. Then use a sensor model: observe z = z and then ask what the pdf on x is. Practically P (z x) is constructed as a function of both variables (or a matrix in discrete form). For each fixed value of x, a distribution in z is defined. Therefore as x varies, a family of distributions in z is created. Introduction to Estimation and Data Fusion Slide 14

15 Bayes Theorem Example I A continuous valued state x, the range to target for example, An observation z of this state. A Gaussian observation model A function of both z and x. P (z x) = 1 exp 1 2πσ 2 z 2 (z x) 2 Building model: state is fixed, x = x, and distribution is a function of z. Using Model: observation is made, z = z, and distribution is a function of x. σ 2 z Prior P (x) = 1 exp 1 2πσ 2 x 2 (x x p ) 2 σ 2 x Introduction to Estimation and Data Fusion Slide 15

16 Bayes Theorem Example I Posterior after taking an observation P (x z) = C 1 exp 1 (x z) 2 2πσ 2 z 2 σz 2 1 = exp 1 (x x) 2 2πσ 2 2 σ 2 where and σ2 x. σ2 z 1 exp 2πσ 2 x x = z + x σx 2 + σz 2 σx 2 + σz 2 p, σ 2 = σ2 zσ 2 x σ 2 z + σ 2 x = 1 σz σ 2 x (x x p ) 2 σ 2 x Introduction to Estimation and Data Fusion Slide 16

17 Bayes Theorem Example IIa A single state x which can take on one of three values: x 1 : x is a type 1 target. x 2 : x is a type 2 target. x 3 : No visible target. Single sensor observes x and returns three possible values: z 1 : Observation of a type 1 target. z 2 : Observation of a type 2 target. z 3 : No target observed. Introduction to Estimation and Data Fusion Slide 17

18 The sensor model is described by the likelihood matrix P 1 (z x): z 1 z 2 z 3 x x x Likelihood matrix is a function of both x and z. For a fixed state, it describes the probability of a particular observation being made (the rows of the matrix). For an observation it describes a probability distribution over the values of true state (the columns) and is then the Likelihood Function Λ(x). Introduction to Estimation and Data Fusion Slide 18

19 Bayes Theorem Example IIb The posterior distribution of the true state x after making an observation z = z i is given by P (x z i )=αp 1 (z i x)p (x) α is a normalizing constant so that sum, over x, of posteriors is 1. Assume a non-informative prior: P (x) =(0.333, 0.333, 0.333) Observe z = z 1, then likelihood is P (z 1 x) =(0.45, 0.45, 0.15) and the posterior is P (x z 1 )=(0.4286, , ). Make this posterior the new prior and again observe z = z 1,then P (x z 1 ) = αp 1 (z 1 x)p (x) = α (0.45, 0.45, 0.15) (0.4286, , ) = (0.4737, , ). Note result is to increase the probability in both type 1 and type 2 targets at the expense of the no-target hypothesis. Introduction to Estimation and Data Fusion Slide 19

20 Consider the set of observations Data Fusion using Bayes Theorem Z n = {z 1 Z 1,, z n Z n }. Posterior distribution given observation set is naively P (x Z n ) = P (Zn x)p (x) P (Z n ) = P (z 1,, z n x)p (x) P (z 1,, z n ) Not easy directly as the joint distribution P (z 1,, z n x) must be known completely Assume conditional independence P (z 1,, z n x) =P (z 1 x) P (z n x) = n i=1 P (z i x) Introduction to Estimation and Data Fusion Slide 20

21 Data Fusion using Bayes Theorem So update becomes P (x Z n )=[P(Z n )] 1 P (x) n P (z i x) This is the independent likelihood pool. In practice, the conditional probabilities P (z i x) are stored a priori as functions of both z i and x. When an observation sequence Z n = {z 1,z 2,,z n } is made, the observed values are instantiated in this probability distribution and likelihood functions Λ i (x) are constructed. i=1 Introduction to Estimation and Data Fusion Slide 21

22 The Independent Likelihood Pool Central Processor P(x) n P(x Z n )=C P( x) Π Λ i (x) i=1 Λ 1 (x) Λ i (x) Λ n (x) P(z 1 x) P(z i x) P(z n x) z 1 z i z n X Introduction to Estimation and Data Fusion Slide 22

23 Data Fusion using Bayes Theorem The effectiveness of fusion relies on the assumption that the information obtained from different information sources is independent when conditioned on the true underlying state of the world. Clearly P (z 1,, z n ) P (z 1 ) P (z n ), As each piece of information depends on a common underlying state x X. Conversely, it is generally quite reasonable to assume that the underlying state is the only thing in common between information sources and so once the state has been specified it is correspondingly reasonable to assume that the information gathered is conditionally independent given this state. Introduction to Estimation and Data Fusion Slide 23

24 Data Fusion using Bayes Theorem Example Ia A second sensor which makes the same three observations as the first sensor, but whose likelihood matrix P 2 (z 2 x) is described by z 1 z 2 z 3 x x x Whereas the first sensor was good at detecting targets but not at distinguishing between different target types, this second sensor has poor overall detection probabilities but good target discrimination capabilities. With a uniform prior, observe z = z 1 then the posterior is (the first column of the likelihood matrix) P (x z 1 )=(0.45, 0.1, 0.45) Introduction to Estimation and Data Fusion Slide 24

25 Data Fusion using Bayes Theorem Example Ib Makes sense to combine the information from both sensors to provide both good detection and good discrimination capabilities. Overall likelihood function for the combined system is P 12 (z 1, z 2 x) =P 1 (z 1 x) :P 2 (z 2 x) x = x 1 z 2 = z 1 z 2 z 3 z 1 = z z 1 = z z 1 = z Introduction to Estimation and Data Fusion Slide 25

26 x = x 2 z 2 = z 1 z 2 z 3 z 1 = z z 1 = z z 1 = z x = x 3 z 2 = z 1 z 2 z 3 z 1 = x z 1 = x z 1 = x Introduction to Estimation and Data Fusion Slide 26

27 Data Fusion using Bayes Theorem Example Ic For each state x = {x 1,x 2,x 3 }, each sub-matrix represents the joint probability of the pair of observations {z i z 1,z j z 2 } being made. Note, each sub-matrix sums to one and so is indeed a valid pdf. Observe z 1 = z 1 and z 2 = z 1 and assuming a uniform prior, then posterior is P (x z 1,z 1 ) = αp 12 (z 1,z 1 x) = αp 1 (z 1 x)p 2 (z 1 x) = α (0.45, 0.45, 0.15) (0.45, 0.1, 0.45) = (0.6429, , ) Sensor 2 adds substantial target discrimination power at the cost of a slight loss of detection performance for the same number of observations. Introduction to Estimation and Data Fusion Slide 27

28 Data Fusion using Bayes Theorem Example Id Repeating this calculation for each z 1, z 2 observation pair: z 1 = z 1 z 2 = z 1 z 2 z 3 x x x z 1 = z 2 z 2 = z 1 z 2 z 3 x x x Introduction to Estimation and Data Fusion Slide 28

29 z 1 = z 3 z 2 = z 1 z 2 z 3 x x x Introduction to Estimation and Data Fusion Slide 29

30 Data Fusion using Bayes Theorem Example Ie The combined sensor provides substantial improvements in overall system performance EG: observe z 1 = z 1 and z 2 = z 1 P (x z 1, z 2 )=(0.6429, , ) Target 1 most likely as expected However observe z 1 = z 1 and z 2 = z 2.Get P (x z 1, z 2 )=(0.1429, , ) Target type 2 has high probability because sensor 1 does detection while sensor 2 does discrimination. If now we observe no target with sensor 2, having detected target type 1 (or 2) with the first sensor, the posterior is given by (0.4821, , ). That is there is a target (because we know sensor 1 is much better at target detection than sensor 2), but we still have no idea which of target 1 or 2 it is as sensor 2 did not make a valid detection. Introduction to Estimation and Data Fusion Slide 30

31 Data Fusion using Bayes Theorem Example If Finally, if sensor 1 gets no detection, but sensor 2 detects target type 1, then the posterior is given (0.1216, , ). That is we still believe there is no target (sensor 1 is better at providing this information) and perversely, sensor 2 confirms this. Practically, the joint likelihood matrix is never constructed (it is easy to see why) Rather, the likelihood matrix is constructed for each sensor and these are only combined when instantiated with an observation. Introduction to Estimation and Data Fusion Slide 31

32 Recursive Bayes Updating Bayes Theorem allows incremental or recursive addition of new information With Z k = {z k, Z k 1 }, expand two ways: P (x, Z k ) = P (x Z k )P (Z k ) = P (z k, Z k 1 x)p (x) = P (z k x)p (Z k 1 x)p (x) Assumed conditional independence of the observation sequence. Equating both sides gives P (x Z k )P (Z k ) = P (z k x)p (Z k 1 x)p (x) = P (z k x)p (x Z k 1 )P (Z k 1 ). Introduction to Estimation and Data Fusion Slide 32

33 Recursive Bayes Updating Noting that P (Z k )/P (Z k 1 )=P(z k Z k 1 ) and rearranging gives P (x Z k )= P (z k x)p (x Z k 1 ) P (z k Z k 1 ) Only need compute and store P (x Z k 1 ) Contains a complete summary of all past information. On arrival of a new P (z k x) posterior takes on the role of the current prior and the product of the two becomes the new posterior. Introduction to Estimation and Data Fusion Slide 33

34 Recursive Bayes Updating: An Example Ia Example of observations in independent Gaussian noise Scalar x, zero mean noise with variance σ 2 ; 1 P (z k x) = exp 1 (z k x) 2 2πσ 2 2 σ 2. Assume posterior distribution after first k 1 observations is Gaussian with mean x k 1 and variance σk 1, 2 P (x Z k 1 1 )= exp 1 (x k 1 x) 2. 2πσ 2 k 1 2 σ 2 k 1 Introduction to Estimation and Data Fusion Slide 34

35 Recursive Bayes Updating: An Example Ib Then the posterior distribution after k observations is P (x Z k ) = K 1 exp 1 (z k x) 2 2πσ 2 2 σ 2 = Where K is a constant and 1 exp 1 2πσ 2 k 2 x k = (x k x) 2 σ 2 k σ2 k 1 σ 2 k 1 + σ 2z k +. 1 exp 2πσ 2 k 1 σ 2 σ 2 k 1 + σ 2x k (x k 1 x) 2 σ 2 k 1 Gaussian distributions are conjugate. σ 2 k = σ2 σ 2 k 1 σ 2 + σ 2 k 1 Introduction to Estimation and Data Fusion Slide 35

36 Recursive Bayes Updating: An Example IIa Quite general prior and likelihood distributions can be handled by direct application of Bayes Theorem defined on a spatial grid. Consider the problem in which we are required to determine the location (x, y) of a target in a defined area. The distribution P (x, y) is simply defined as a set of probability values P (x i,y j )definedatagrid points x i,y j,of area δx i, δy j. The only constraints placed on this distribution are that P (x i,y j ) > 0, x i,y j and that i j P (x i,x j )δx i δy i =1 Introduction to Estimation and Data Fusion Slide 36

37 Recursive Bayes Updating: An Example IIb A passive sensor (sensor 1) located at xs 1 =15,ys 1 = 0km, measures bearings to the target. The sensor is modeled by a conditional probability distribution P 1 (z 1 x, y) which describes, for each possible true target location (x = x i, y = y j ), a probability distribution on observed bearings. In general this conditional density requires a function on z 1 to be defined for each possible target location In the discrete case this requires a three dimensional matrix defining a probability P 1 (z 1 = z 1 x = x i, y = y i ) for each combination of (z 1,x i,y i ). Introduction to Estimation and Data Fusion Slide 37

38 Recursive Bayes Updating: Example IIc In practice however, these functions normally take x and y as parametric inputs, for example, defining Θ = arctan y ys 1 x xs 1 a possible sensor model might be P (z 1 x, y) = α exp 1 (z k Θ) α exp 1 (z k Θ B) 2. 2πσ 2 a 2 2πσ 2 b 2 σ 2 a This density consist of a weighted sum of two Gaussians, each with different variances and one with a fixed bias B. The model is parametrised by the true target bearing Θ. σ 2 b Introduction to Estimation and Data Fusion Slide 38

39 Recursive Bayes Updating: An Example IId As in earlier examples, when a specific observation z 1 = z 1 is made, this is substituted into the sensor model which becomes the likelihood function on Θ, orx and y, only. Practically in this example, the likelihood function is obtained by simply substituting in the value of the observation, then computing and assigning the probability value P 1 (z 1 x i,y j )foreach possible (x i,y j ) combination by direct substitution into the sensor model. The Figure shows a likelihood function computed in this manner. The likelihood shows that the bearing resolution of the sensor is high, whereas it has almost no range accuracy (the likelihood is long and thin with probability mass concentrated on a line running from sensor to target). Introduction to Estimation and Data Fusion Slide 39

40 Recursive Bayes Updating: An Example IIe The posterior distribution can now be computed by simply taking the product of the prior probability P (x, y) with the likelihood P 1 (z 1 x, y) at each of the discrete locations (x = x i, y = y j ) and normalising. The result shows that the distribution defining target location is now approximately restrained to a line along the detected bearing. The posterior P (x, y z 1,z 2 ) following a second observation z 2 by the same sensor provides little improvement in location density following this second observation; this is to be expected as there is no range data available. Introduction to Estimation and Data Fusion Slide 40

41 Recursive Bayes Updating: Example IIf Prior Location Density Location likelihood from Sensor 1 x Y Range (km) X Range (km) Y Range (km) X Range (km) (a) (b) Figure 1: Generalised Bayes Theorem. The figures show plots of two-dimensional distribution functions defined on a grid of x and y points: (a) Prior distribution; (b) likelihood function for first sensor. Introduction to Estimation and Data Fusion Slide 41

42 Recursive Bayes Updating: Example IIg Posterior location density after one observation from sensor 1 Posterior location density after two observations from sensor Y Range (km) X Range (km) Y Range (km) X Range (km) (c) (d) Figure 2: Generalised Bayes Theorem. The figures show plots of two-dimensional distribution functions defined on a grid of x and y points: (c) posterior after one application of Bayes Theorem; (d) posterior after two applications. Introduction to Estimation and Data Fusion Slide 42

43 Recursive Bayes Updating: An Example IIh A second sensor (sensor 2) now takes observations of the target from a location xs 2 =50,ys 2 = 20km. Figure shows the target likelihood P 2 (z 3 x, y) following an observation z 3 by this sensor. It can be seen that this sensor (like sensor 1), has high bearing resolution, but almost no range resolution. However, because the sensor is located at a different site, we would expect that the combination of bearing information from the two sensors would provide accurate location data. Indeed, following point-wise multiplication of the second sensor likelihood with the new prior (the posterior P (x, y z 1,z 2 ) from the previous two observations of sensor 1), we obtain the posterior P (x, y z 1,z 2,z 3 ) shown in Figure which shows all probability mass highly concentrated around a single target location. Introduction to Estimation and Data Fusion Slide 43

44 Recursive Bayes Updating: Example IIi Location likelihood from Sensor 2 Posterior location density following update from sensor Y Range (km) X Range (km) Y Range (km) X Range (km) (e) (f) Figure 3: Generalised Bayes Theorem. The figures show plots of two-dimensional distribution functions defined on a grid of x and y points: (e) likelihood function for second sensor; (f) final posterior. Introduction to Estimation and Data Fusion Slide 44

45 Recursive Bayes Updating: An Example IIj The general approach demonstrated in this example has broad appeal in situations where case-specific prior knowledge can be obtained. For example, if the problem we are interested in is tracking in an underwater environment, with passive accoustics then we could add knowledge about no-go areas (such as land forms) by simply setting the prior to be zero in these areas; P (x = x land, y = y land )=0. In other examples, constraints such as road-ways could be used. Introduction to Estimation and Data Fusion Slide 45

46 Generalised Bayesian Filtering: Problem Statement x k : The state vector to be estimated at time k. u k : A control vector, assumed known, and applied at time k 1 to drive the state from x k 1 to x k at time k. z k : An observation taken of the state x k at time k. In addition, the following sets are also defined: The history of states: X k = {x 0, x 1,, x k } = {X k 1, x k }. The history of control inputs: U k = {u 1, u 2,, u k } = {U k 1, u k }. The history of state observations: Z k = {z 1, z 2,, z k } = {Z k 1, z k }. Recursively estimate Posterior P (x k Z k, U k, x 0 ). Introduction to Estimation and Data Fusion Slide 46

47 Sensor and Motion Models Observation model describes the probability of making an observation z k when the true state x(k) is known P (z k x k ). Assume conditional independence P (Z k X k )= k P (z i X k )= k P (z i x i ). i=1 i=1 Assume vehicle model is Markov: P (x k x k 1, u k ). Introduction to Estimation and Data Fusion Slide 47

48 Observation Update Step Expand joint distribution in terms of the state P (x k, z k Z k 1, U k, x 0 ) = P (x k z k, Z k 1, U k, x 0 )P (z k Z k 1, U k, x 0 ) = P (x k Z k, U k, x 0 )P (z k Z k 1 U k ) and the observation P (x k, z k Z k 1, U k, x 0 ) = P (z k x k, Z k 1, U k, x 0 )P (x k Z k 1, U k, x 0 ) = P (z k x k )P (x k Z k 1, U k, x 0 ) Rearranging: P (x k Z k, U k, x 0 )= P (z k x k )P (x k Z k 1, U k, x 0 ) P (z k Z k 1, U k. ) Introduction to Estimation and Data Fusion Slide 48

49 Observation Update Step P(z k x k = x 1 ) P(z k x k = x 2 ) P(z k = x 1 x k ) P( x k ) Z X Introduction to Estimation and Data Fusion Slide 49

50 Time Update Step: Total Probability Theorem: P (x k Z k 1, U k, x 0 ) = P (x k, x k 1 Z k 1, U k x 0 )dx k 1 = P (x k x k 1, Z k 1, U k, x 0 )P (x k 1 Z k 1, U k, x 0 )dx k 1 = P (x k x k 1, u k )P (x k 1 Z k 1, U k 1, x 0 )dx k 1 Introduction to Estimation and Data Fusion Slide 50

51 Time Update Step P(x k 1,x k ) P(x k ) P(x k,x k 1 )dx k 1 P(x k 1 ) P(x k x k 1 ) P(x k,x k 1 )dx k x k =f(x k 1,U k ) x k x k Introduction to Estimation and Data Fusion Slide 51

52 Recursive Solution Prediction: P (x k Z k 1, U k, x 0 )=P (x k 1 Z k 1, U k 1, x 0 )dx k 1 Update P (x k Z k, U k, x 0 )=K.P(z k x k )P (x k Z k 1, U k, x 0 ) Introduction to Estimation and Data Fusion Slide 52

53 Generalised Bayesian Filtering: Example Ia For low-dimensional problems, it is possible, and instructive, to implement the general Bayesian filter in a direct form. Consider a scalar-valued state x k indexed by time k. Assume that the state-transition is Markovian with defined state-transition probability P (x k x k 1,u k ), where u k is a known control applied to drive x k 1 to x k. Assume also a prior probability P (x k 1 ). The prior information can be predicted forward to time k as P (x k u k )= P (x k x k 1,u k )P (x k 1 )dx k 1. Introduction to Estimation and Data Fusion Slide 53

54 Generalised Bayesian Filtering: Example Ib This time-prediction step is essentially a convolution of two probability densities; P (x k 1 ) and P (x k x k 1,u k ). The process of convolution acts to blur or spread the prior density with the uncertainty arising from state transition. Figure is an example of this process. The time-prediction step clearly represents a loss of information as the prediction is more widely spread than the prior. More generally, convolution results in information loss. Introduction to Estimation and Data Fusion Slide 54

55 Generalised Bayesian Filtering: Example Ic Prediction Step for the Bayes Filter Prior Motion Model Prediction Introduction to Estimation and Data Fusion Slide 55

56 Generalised Bayesian Filtering: Example Id For low dimensional problems, the convolution for the time-prediction can be implemented directly. The direct approach scales exponentially with state dimension. An efficient implementation of convolution is to use multiplication in the frequency domain: % input is two N-length vectors of the two densities INPUT: fx[n], fy[n]; % time-reverse one to get fxy(tau-t) fy=fliplr(fy); % convolution in time is multiplication in frequency: X=fft(fx); Y=fft(fy); Fxy=X.*conj(Y)/N; % take inverse. Only use real part and centre fxy=ifftshift(real(ifft(fxy))); % normalise fxy=gnorm(fxy,x); Introduction to Estimation and Data Fusion Slide 56

57 Generalised Bayesian Filtering: Example Ie For the observation update step an observation model P (z k x k ) is required. This is generally a two-dimensional function of both z k and x k. When an observation z k = z is made, a likelihood function P (z k = z x k ) defined on the state only is generated. The observation update is then simply the normalised product of the prediction with this likelihood P (x k u k,z k = z) =C.P(z k = z x k )P (x k u k ). (1) This computation can be implemented as a point-wise product of the arrays, P (x k u k ) and P (z k = z x k ), both defined only on x k. Figure shows the peaks of the posterior distribution is a weighted sum of the peaks of the prediction and likelihood, and that the spread of the posterior distribution is less than the spread of either the prediction or likelihood. The observation-update step clearly represents a gain of information as the updated density is more compact than the prediction. More generally, multiplication of probability densities results in information gain. Introduction to Estimation and Data Fusion Slide 57

58 Generalised Bayesian Filtering: Example If Update Step for the Bayes Filter Prediction Observation Model Update Introduction to Estimation and Data Fusion Slide 58

59 Distributed Data Fusion with Bayes Theorem Providing basic conditional probability rules are followed, not difficult to construct data fusion architectures. Three Possible Approaches: Communicate observations Communicate likelihoods Communicate local posteriors Introduction to Estimation and Data Fusion Slide 59

60 The Independent Likelihood Pool Fusion Centre P(x) n P(x Z n )=C P( x) Π Λ i (x) i=1 Λ 1 (x) Λ i (x) Λ n (x) P(z 1 x) Sensor 1 Sensor i... P(z... i x) P(z n x) Sensor n z 1 z i z n X Figure 4: The distributed implementation of the independent likelihood pool. Each sensor maintains it s own model in the form of a conditional probability distribution P i (z i x). On arrival of a measurement, z i, the sensor model is instantiated with the associated observation to form a likelihood Λ i (x). This is transmitted to a central fusion centre were the normalised product of likelihoods and prior yields the posterior distribution P (x Z n ). Introduction to Estimation and Data Fusion Slide 60

61 The Independent Likelihood Pool I Likelihood Pool: for example local models: P 1 (z 1 x); P (x Z n )=P (x) i z 1 z 2 z 3 Λ i (x) and P 2 (z 2 x); x x x z 1 z 2 z 3 x x x Introduction to Estimation and Data Fusion Slide 61

62 The Independent Likelihood Pool II if z 1 = z 1 communicate Λ 1 (x) =(0.45, 0.45, 0.15) if z 2 = z 1 communicate Λ 2 (x) =(0.45, 0.1, 0.45) At the fusion processor, the information is combined by multiplication as: P (x z 1 = z 1, z 2 = z 1 ) = C.Λ 1 (x)λ 2 (x)p (x) = (0.45, 0.45, 0.15) (0.45, 0.1, 0.45) (1/3, 1/3, 1/3) = (0.6429, , ), The sensors have become anonymous, they are simply devices that communicate probability distributions on the common state. Introduction to Estimation and Data Fusion Slide 62

63 Distributed Data Fusion with Bayes Theorem k k-1 Fusion Centre P(x k Z n )=C P( x k-1 ) Π n i=1 P i (x k z) P(x k-1 Z n ) P 1 (x k z) P(x k Z n ) P n (x k z) Sensor 1 P(x k z 1 ) P(x k z n ) Sensor n P(z 1 x)... P(z n x) z 1 z n X Figure 5: A distributed implementation of the independent opinion pool in which each sensor maintains both it s own model and also computes a local posterior. The complete posterior is made available to all sensors and so they become, in some sense, autonomous. The figure shows Bayes Theorem in a recursive form. Introduction to Estimation and Data Fusion Slide 63

64 Pool with Local Posteriors Assume prior, P (x) =(1/3, 1/3, 1/3) communicated to the two sensors. Observe, say, z 1 = z 1 so local posterior P 1 (x z 1 )=(0.4286, , ) Observe, say, z 2 = z 1 so local posterior P 2 (x z 2 )=(0.45, 0.1, 0.45) Posterior fusion: P 12 (x z 1, z 2 )=P (x) A further local observation: P 1 (x z 1 ) P (x) P 2 (x z 2 ) =(0.6429, , ) P (x) P 12 (x z 1 = z 1, z 2 = z 1, z 2 = z 1 ) = C.P 2 (z 2 = z 1 x)p 12 (x z 1 = z 1, z 2 = z 1 ) = C.(0.45, 0.1, 0.45) (0.6429, , ) = (0.7232, , ). Introduction to Estimation and Data Fusion Slide 64

65 A Note on The Expectation Operator Expected value of a function of a random variable: E{G(x)} = G(x)f(x)dx E{G(x)} = G(x)f(x) Of note: E{x n },then th moment, and E{(x E{x} ) n },then th central moment. x X EG: the second central moment σ 2 =E{(x x) 2 } is the variance. When x is a vector, the variance is defined as Expectation is a linear operator. Σ =E{(x x)(x x) T }. E{AG(x)+BH(x)} = AE{G(x)} + BE{H(x)}. Introduction to Estimation and Data Fusion Slide 65

66 Log-Likelihoods and Information Methods Introduction to Estimation and Data Fusion Slide 66

67 Data Fusion with Log-Likelihoods Log-likelihoods have both the advantage of computational efficiency and are also more closely related to formal definitions of information. The log-likelihood or conditional log-likelihood are defined as; l(x) =logp (x), l(x y) =logp (x y). Log-likelihood: is always less than or equal to zero: l(x) 0. The log-likelihood is a useful and efficient means of implementing probability calculations. For example Bayes theorem: l(x z) =l(z x)+l(x) l(z). Introduction to Estimation and Data Fusion Slide 67

68 Log-Likelihood Example I Two-sensor discrete target identification example. The log-likelihood matrix for the first sensor (using natural logs) is z 1 z 2 z 3 x x x and for the second z 1 z 2 z 3 x x x Introduction to Estimation and Data Fusion Slide 68

69 The posterior likelihood (given a uniform prior) following observation of target 1 by sensor 1 and target 1 by sensor 2 is the sum of the first columns of each of the likelihood matrices l(x z 1,z 1 ) = l 1 (z 1 x)+l 2 (z 1 x)+c = ( 0.799, 0.799, 2.303) + ( , , ) + C = ( 1.597, 3.101, 2.696) + C = ( 0.442, 1.946, 1.540) where the constant C = is found through normalisation (which in this case requires that the anti-logs sum to one). Note ease of computation is obviously simpler than in probability. Also sufficient to indicate relative likelihoods. Introduction to Estimation and Data Fusion Slide 69

70 Log-Likelihood Example II Case of Gaussian: completing squares gives l(x Z k ) = 1 (x k x) 2 2 σk 2 = 1 (z k x) σ 2 2 x k = σ2 k 1 σ 2 k 1 + σ 2z k + (x k 1 x) 2 σ 2 k 1 σ 2 σ 2 k 1 + σ 2x k 1, + C. (2) σ2 σ 2 k 1 σk 2 = σ 2 + σk 1 2, thus the log-likelihood is quadratic in x; foreachvalueofx, a log-likelihood is specified as 1 (x k x) 2 2, modulo addition of a constant C. σk 2 Introduction to Estimation and Data Fusion Slide 70

71 Data Fusion with Log-Likelihoods Log-likelihoods are a convenient way of implementing distributed data fusion architectures. Fusion of information is simply a matter of summing log-likelihoods. Examples: Fully centralised fusion Independent opinion pool Independent opinion pool with local posterior log-likelihoods. Introduction to Estimation and Data Fusion Slide 71

72 Data Fusion with Log-Likelihoods z 1 (k) k k-1 log P( x {Z k-1 }) log P(z 1 (k) x) 1 z 2 (k) log P( x {Z k }) Σ log P(z 2 (k) x) 2 log P(z N (k) x) N z N (k) Central Processor Sensor Models Sensor Figure 6: A log-likelihood implementation of a fully centralised data fusion architecture. Introduction to Estimation and Data Fusion Slide 72

73 Data Fusion with Log-Likelihoods log P(z 1 (k) x) 1 z 1 (k) k k-1 log P( x {Z k-1 }) log P(z 2 (k) x) 2 z 2 (k) log P( x {Z k }) Σ Sensor Models Central Processor log P(z N (k) x) N z N (k) Sensor Figure 7: A log-likelihood implementation of the independent likelihood pool architecture. Introduction to Estimation and Data Fusion Slide 73

74 Data Fusion with Log-Likelihoods k k-1 Sensor 1 log P(z 1 (k) x) log P( x {Z k } 1 ) Σ z 1 (k) k k-1 Sensor 2 log P(z 2 (k) x) log P( x {Z k }) Σ log P( x {Z k } 2 ) Σ z 2 (k) Central Processor k k-1 Sensor N log P(z N (k) x) log P( x {Z k } N ) Σ z N (k) Figure 8: A log-likelihood implementation of the independent opinion pool architecture. Introduction to Estimation and Data Fusion Slide 74

75 Information Measures Probabilities and log-likelihoods are defined on states or observations. It is often valuable to also measure the amount of information contained in a given probability distribution. Formally, information is a measure of the compactness of a distribution; logically if a probability distribution is spread evenly across many states, then it s information content is low, Conversely, if a probability distribution is highly peaked on a few states, then it s information content is high. Information is thus a function of the distribution, rather than the underlying state. Information measures play an important role in designing and managing data fusion systems. Two probabilistic measures of information are of particular value in data fusion problems; the Shannon information (or entropy) and the Fisher information. Introduction to Estimation and Data Fusion Slide 75

76 Entropy Consider a discrete random variable x which has N possible outcomes {x 1,,x N }. Define p i = P (x = x i ) as the probability that a realisation of x is x i, i =1:,,N. The Shannon information content of an outcome x i is defined as (usually measured in bits, but we use nats ) h(x i ) = log 1 p i = log p i. As the p i are always less than one, the h(x i ) are always positive. As the probability p i becomes smaller, so h(x i ), the information content, becomes larger. Essentially, h(x i ) measures surprise. The more unlikely an event, the more surprising and informative it is when it occurs. Introduction to Estimation and Data Fusion Slide 76

77 Entropy of a Discrete Distribution The entropy or Shannon information H P (x) associated with a probability distribution P (x), defined on a random variable x, is the ensemble average of the Shannon information content of the outcomes, or equivalently, the expected value of minus the log-likelihood: H P (x) = 1 E log = E[logP (x)] P (x) = x X P (x)log 1 P (x) = P (x)logp(x) x X = i p i log 1 p i = i p i log p i for discrete valued random variables. Note that following convention, we have used x as an argument for H P ( ) even though the integral or sum is taken over values of x so H P ( ) is not strictly a function of x but is rather a function of the distribution P ( ). Introduction to Estimation and Data Fusion Slide 77

78 Entropy of a Continuous Distribution I For continuous-valued random variables x an entropy (Boltzman-Shannon Entropy) may also be defined H P (x) = E{log P (x)} = P (x)logp(x)dx. The relationship between continuous and discrete entropy is not immediate. Consider a scalar random variable x and let x i x i 1 P (x)dx = p i, so p i P (x), x i 1 x<x i. x i x i 1 Then the Boltzman-Shannon Entropy may be written H P (x) = n i=1 = n i=1 P (x)logp (x)dx x i x i 1 p i log The discrete analogue of the continuous entropy p i log x i x i 1 p i. x i x i 1 p i x i x i 1 dx Introduction to Estimation and Data Fusion Slide 78

79 Entropy of a Continuous Distribution II A distinction between continuous and discrete entropy is that the discrete case the variables are unequivocal but in the continuous case may be chosen with some freedom. In particular, a transformation of continuous variables x to y such as y = g(x) may be effected. In this case, the entropy on y may be found in terms of the entropy on x as: H(y) =H(x) x P (x)log g x(x) dx where is the determinant operation and g x (x) = g x x=x i is the Jacobian of g with respect to x evaluated at the roots x i = g 1 (y) Of particular interest are changes in coordinate systems. In this case, with y = Ax, the entropy relation is H(y) =H(x)+log A for rotations and other measure preserving (orthonormal) transformations, A = 1 and so H(y) =H(x). Introduction to Estimation and Data Fusion Slide 79

80 The Meaning of Entropic Information I The entropic or Shannon information is both subtle and powerful in its meaning and application. Fundamentally, the entropy H P ( ) measures the compactness of a density P ( ) on a state space. It achieves a minimum of zero when all probability mass is assigned to a single value of x. It achieves a maximum when probability mass is uniformly distributed over all states. In an estimation-theoretic context, it is most natural to think of the most informative probability distribution as that which assigns all probability to a single state; logically the most compact of probabilities. Conversely the least informative distribution is one in which probability is spread uniformly over all states and so entropy is a maximum. Introduction to Estimation and Data Fusion Slide 80

81 The Meaning of Entropic Information II Maximum entropy distributions are often used as prior distributions when no useful prior information is available For example, if the random variable x can take on at most n discrete values in the set X, then the least informative (maximum entropy) distribution on x is one which assigns a uniform probability 1/n to each value. This distribution will clearly have an entropy of log n. Whenx is continuous-valued, the least informative distribution is also uniform, although strictly improper as P (x) = 1 does not integrate to 1. Introduction to Estimation and Data Fusion Slide 81

82 The Meaning of Entropic Information III However, this reasoning is reversed in the context of communication and experimental design. Imagine a simple experiment in which a finite number of outcomes are possible, or equally imagine the receiving channel of a communications link with a finite alphabet of transmission symbols. If the result of the experiment is known with high probability a priori, then the actual occurrence of the event is not very informative. Conversely, if the possibility of each outcome is uniformly distributed, then the outcome itself is most informative. Succinctly; The outcome of a random experiment is guaranteed to be most informative if the probability distribution over outcomes is uniform Mathematically, maximising the Shannon information corresponds to this second interpretation of information. However, in this course we will tend to think about information maximisation as the process of compacting a density and is thus strictly equivalent to maximising the negative of the Shannon information. Introduction to Estimation and Data Fusion Slide 82

83 The Meaning of Entropic Information IV Up to a constant factor entropy turns out to be the only reasonable definition of informativeness Informally, three conditions on an information measure lead to this conclusion: 1. Continuity: The measure should be continuous in the p i. Thus the measure must be a continuous function. 2. Choice: The measure should be a monotonically increasing function of the number of possible outcomes of a random event. In particular, imagine a storage device consisting of N binary switches. The number of possible states is clearly 2 N and the logarithm of the number of states is proportional to N. As the number of switches increases we wish the information measure to also increase in monotonic relation a function proportional to the logarithm of the number of states clearly achieves this. 3. Composition: If a choice of outcome is broken down into two successive stages, the resulting information measure should be a weighted sum of the measures from each stage separately. As we have seen, this is true for log-likelihoods and will be true for linear operators, such as expectation, on log-likelihoods. Introduction to Estimation and Data Fusion Slide 83

84 The Meaning of Entropic Information V The implications of this in data fusion problems are many-fold. The fact the information, by definition, is linearly additive makes computation particularly simple and is fundamental in developing efficient decentralised data fusion algorithms. Introduction to Estimation and Data Fusion Slide 84

85 The Entropy of English (Part I) Classic and intuitive example of measuring the information content of the English language Written language is not random. Different letters have different probabilities of occurrence; we are not so surprised to see an a or b, but are relatively more surprised when we see a q or z. This is readily captured by the Shannon information measure. In this example, the text of Flatland by A. Square (Edwin Abbott) is employed as a sample of the English Language. The sample comprises approximately 200,000 letters and spaces. To make plotting easier, numerical assignments are made for letters with a =1,b=2,,z =26 and space= 27. All other characters are ignored. The probability of occurrence and information content of each character is plotted. It is clear that that letters such as j, q and z are least likely and therefore provide most information. Introduction to Estimation and Data Fusion Slide 85

86 The entropy for the ensemble as a whole is H P (x) = i p i log 1 p i =2.83 nats. Interestingly, most English texts have numerically similar information content. It is also interesting to compare this to the information content of letters (and space) chosen randomly: log N = log 27 = The redundancy of a sample is defined as: R(x) =1 H P(x) log N The redundancy of the example text is 0.14; these means that approximately 14% of letters are redundant. Introduction to Estimation and Data Fusion Slide 86

87 The Entropy of English (Part I) 0.2 Probability of Occurrence of Letter 8 Information Content of Letter Probability Information (nats) Number of Letter Number of Letter (a) (b) Figure 9: The probability and information content of letters in an English text. (a) The probability of occurrence of each letter. Highly probable letters are space, e, t and a (letters 5, 20 and 1 respectively). (b) The information content of each letter. Highly informative letters are z, j and q (letters 26, 10 and 17 respectively). Introduction to Estimation and Data Fusion Slide 87

88 Joint Entropy The joint entropy of one or more outcomes is defined through the joint probability density of these events. If x and y are two random variables with joint probability density P (x y), the the joint entropy is defined as H(x, y) = X in the discrete case, and H(x, y) = X in the continuous case. Y Y P (x, y)logp (x, y) P (x, y)logp (x, y)dxdy The definition of entropy can be extended to any number of random variables and outcomes in the obvious manner. Introduction to Estimation and Data Fusion Slide 88

89 English language example: The Entropy of English (Part II) Pairs of successive letters are sampled so that event x is the selection of the first letter and y the selection of the second letter in a sequence. These pairs of letters are termed bigrams. the joint entropy is found to be H(x, y) = The redundancy in this case is R(x, y) =1 H P(x, y) = log N 2 2 log 27 =0.22 Taking letters in pairs, the English Language has 22% redundancy. Taking letters in triples and so on leads to the conclusion that approximately 50% of written English is redundant (no surprises there!). Introduction to Estimation and Data Fusion Slide 89

90 The Entropy of English (Part II) Joint Probability Density Probability Second Letter First Letter Figure 10: The joint probability of a sequence of two letters in the example text. Two letter sequences are known as bigrams. Note sequences of high probability include the pair th (point [20, 8])and n (letter 14) followed by a vowel (a, e, i, etc). Introduction to Estimation and Data Fusion Slide 90

91 Entropy of an n-dimensional Gaussian is Entropy of Gaussian P (x) =N(x, P) = 2πP 1/2 exp 1 2 (x x)t P 1 (x x) H P (x) = E{log P (x)} = 1 2 E{(x x)t P 1 (x x)+log[(2π) n P ]} = 1 2 E{ ij(x i x i )P 1 ij (x j x j )} 1 2 log[(2π)n P ], = 1 2 = 1 2 = 1 2 = 1 2 ij j j j E{(x j x j )(x i x i )} P 1 ij 1 2 log[(2π)n P ] i P ji P 1 ij 1 2 log[(2π)n P ] (PP 1 ) jj 1 2 log[(2π)n P ] 1 jj 1 2 log[(2π)n P ] Introduction to Estimation and Data Fusion Slide 91

92 = n log[(2π)n P ] = 1 2 log[(2πe)n P ]. Entropy defined only by vector length n and the covariance P. The entropy is proportional to the log of the determinant of the covariance. The determinant of a matrix is a volume measure (determinant is product of eigenvalues) Entropy is a measure of volume enclosed by covariance matrix and consequently the compactness of the probability distribution. If the Gaussian is scalar with variance σ 2, then the entropy is simply given by H(x) =logσ 2πe. Entropy increases with increasing variance. Introduction to Estimation and Data Fusion Slide 92

93 Conditional Entropy The definition of entropy can be extended to include conditional entropy. Consider the information (entropy) about a state x contained in the distribution P (x y) given that the outcome y = y j has already been observed. For discrete random variables this is H P (x y j ) = E{log P (x y j )} = x P (x y j)logp (x y j ) and for continuous-valued random variables H P (x y j ) = E{log P (x y j )} = P (x y j)logp (x y j )dx. Introduction to Estimation and Data Fusion Slide 93

94 Conditional Entropy The conditional entropy is defined as the average or expected value of this entropy over all possible realisations of y: H(x y) = j P (y = y j )H(x y = y j ) = j x P (y = y j)p (x y = y j )logp(x y = y j ) = j x P (x, y = y j)logp (x y = y j ) = P (x, y)logp (x y). y x for discrete random variables and (similarly) H(x y) for continuous random variables. = E{H(x y)} = + + P (x, y)logp (x y)dxdy, Note that H(x y) is not a function of either x or y, rather it is a measure of the information that will be obtained about about x given knowledge of y; on the average before a specific value of y has been determined. Introduction to Estimation and Data Fusion Slide 94

95 Conditional Entropy The chain-rule for conditional probabilities can be employed to obtain a chain-rule for conditional entropy. Taking logs of the chain-rule: log P (x, y) = logp (x y)+logp (y) = logp (y x)+logp (x). Taking expected values of both sides of this equation over P (x, y) yields H(x, y) = H(x y)+h(y) = H(y x)+h(x). This quite naturally states that the entropy about the combined outcome is the sum of the entropy of the first outcome plus the entropy of the second outcome given the first. The chain-rule for conditional entropy can be extended to any number of random variables H(x 1, x 2,, x N )=H(x 1 x 2,, x N )+H(x 2, x 3,, x N )+ + H(x N ). Introduction to Estimation and Data Fusion Slide 95

Chapter I: Fundamental Information Theory

ECE-S622/T62 Notes Chapter I: Fundamental Information Theory Ruifeng Zhang Dept. of Electrical & Computer Eng. Drexel University. Information Source Information is the outcome of some physical processes.