Signal Processing - Lecture 7 - PDF Free Download

1 Introduction Signal Processing - Lecture 7 Fitting a function to a set of data gathered in time sequence can be viewed as signal processing or learning, and is an important topic in information theory. In this process, a signal is measured along with noise which has a known functional form. The estimator then filters the input, converging to a measure of the signal as the data is collected. Schematically the problem appears in Figure 1. In Figure 1, Z is the signal, Y is the measured signal, W is the noise, Z is the estimated signal, and E = Z Z. The signal estimator is designed as an optimum filter to obtain Z in the presence of the noise, W. Some knowledge of the signal generation is assumed, as well as an understanding of the noise. Estimators must be designed to work with a variety of inputs, and they must work with the inverted the model equations. Thus they need to avoid discontinuities or infinities. Note that the above development is inspired by Bayes theorem which allows probability updating. The best known example of such an estimator is the Kalman filter. In its simplest case, the generator is linear and invariant with time, and the noise is distributed normally about 0. The filter minimizes the mean square of the estimator error for each data step. Stochastic Input Error Generator Measurement Y Z E W Stochastic Input Z Estimator Comparitor Figure 1: An illustration of a time sequenced measure of a signal with an estimator which improves measurement by learning 1

H optimal estimators are another type of time invariant filter. In these estimators the quantity; k=0 E[ w(k) 2 ] is interpreted as the mean energy in a signal, w. This estimator minimizes the largest mean energy gain from the noise input, w, into the estimator error. 2 Kalman Filter The best known estimator of this type is the Kalman filter. This filter describes a set of mathematical equations that provide an efficient method to estimate parameters in an iterative process. It uses previous information to influence future calculations in the spirit of Bayes theorem. As an example, the free 3-D motion of a particle in space can be written as a linear set of equations given by the change of a state vector, (x, y, z, p x, p y, p z ) to the vector (x + (p x /m) t, y + (p y /m) t, z + (p z /m) t, p x, p y p z ). Generalization to more complicated motion is straightforward. The state vector is obtained by the linear operation; x k = A x k 1 + B u k 1 + w k i In the above, k numbers the order of the measurements, u represents a control input to the system, and w the noise input which is assumed to be random. An actual measurement of x at the time k is given by z k, and written z k = Hx k +ν k. In this later equation, ν k, is a normally distributed random variable associated with the noise in the measurement process. In this example A, B, and H are tunable parameters. We then proceed to take an average, removing the random variables and smoothing the variations; x k = x k 1 + K(z k Hx k 1 ) The 2 nd term corrects the value due to a measurement. Note if z k and (Hx k 1 ) are in agreement then the correction vanishes. The expectation of the square of the difference between the predicted value and the mean is the square of the standard deviation, σk 2. This is to be minimized by the choice of K. The solution is independent of the initial choice of starting values, and is iterated to convergence. However, convergence speed depends on the selection of the filter parameters. Examples of the use of the Kalman filter are shown in figures 2, 3. The Kalman filter has been mainly applied to signal processing, but has been used as a sophisticated way to determine the trajectory of particles in a background of many position measurements. Note that tuning of the filter parameters can be important for the speed of convergence, and will differ given the problem and the noise input to the data. 2

Figure 2: An illustration of the application the kalman filter to fit a set of voltage measurements showing the data points which fluctuate with noise, and the convergence of the filter to the data The Kalman filter has the following properties; 1. The average value of the estimate equals the average value of the state 2. The Kalman algorithm minimizes the expected value of the estimated error squared. Thus on average the algorithm gives the smallest possible error in the estimator The Kalman filter may be viewed as a predictor-corrector series. The filter recursively conditions the current estimate using all past measurements. In implementation, the noise covariance is usually determined prior to the filter operation. On the other hand, measurement of the generator noise is more difficult as it can t be directly measured. These could be viewed as parameters and tuned by the analysis, or in another set of data. The Kalman filter can be extended to include processes where the noise is not constant and/or the process is not linear. In the later case one linearizes the equations and restricts changes to small values, (ie use a Taylor expansion, solve the linear equations for small steps, and iterate). In the simplest case, a measure of the variable z may be obtained from the state variables, x k. In the case of a linear relationship; z k = Hx k + v k In the above, v k is a random variable, which for simplicity is assumed to be normally distributed with 0 mean. The assumption is that both generator and measurement noise are independent. Write the noise covariance matrix for w as Q and that for v as R. These are also assumed constant as k changes. 3

Figure 3: The same as in figure 2, but with different filter values showing slower convergence. R = Hv k H Q = Aw k A The matrix, A, relates the state at a previous time, k 1, to the state at k. The matrix H relates the state at x k to the measurement at z k. The prior and posterior are; a k = x k x k a k = x k x k The estimated covariance is; p k = E[a k a k ] Prior p k = E[a k a k ] Posterior p k = AP k 1A + Q The covariance matrix in general has the form; ( ) (x1 x p k = 1 )(x 1 x 1 ) (x 1 x 1 )(x 2 x 2 ) (x 1 x 1 )(x 2 x 2 ) (x 2 x 2 )(x 2 x 2 ) We are to obtain a posterior estimate, x k using the prior estimate, x k, and the actual measurement, z k. Thus using z k = Hx k + v k ; x k = x k + K(z k Hx k ) In the above, K is a weighting factor multiplying the difference between the value and the 4

prediction (Gain). Now we wish to minimize the error matrix, p k. A solution is; K = p k H Hp k H + R The solution is intuitive. As the covariance error R 0, the gain K weights the residual more heavily, lim R 0 K k = H 1 and z k = Hx k. If the prior estimate error p k 0, the residual is weighted less heavily, lim pk 0 K k = 0. Thus as the error covariance, R 0 the actual measurement is trusted more, but as the prior estimate error covariance, p k 0 the prediction is trusted more. As an example, measure the value of a constant voltage on which is imposed a normal distribution of noise. Write; x k = Ax k 1 + Bµ k 1 + w k = x k 1 + w k The measurement is; z k = H x k + v k = x k + v k In this case the state does not change so A = 1, there is no control input so B = 0, and the state is directly measured, so H = 1. The average values of the updated equations are; x k 1 = x k 1 The error covariance is; p k = p k 1 + Q The resulting equations are; K k = p k /(p k + R) x k = x k + K k(z k x k ) Look at Figure 2. In this calculation the noise covariance, R = 0.01, and the filter was slower to believe the measurements and relied on the calculation. In Figure 3 the noise covariance was chosen as 0.0001 so that the filter was quick to believe the measurements. In all cases, convergence to the same result occurs, but in the later case more slowly. Thus, an appropriate choice of parameters is important for convergence. 5

x2(l) 4 (x,y) 3 x2(l+1) x1(l) 1 2 x1(l+1) Figure 4: An example of a 2-D grid used to derive the interpolation equations 3 Interpolation We have been estimating functions by expansions about a point, and attempting to find values of a function using predictor-corrector methods. However, in most cases this is just a more complicated process than using interpolation. Immediately note that interpolation and extrapolation are not the same. In interpolation the function is bounded by assuming it is analytic within the bounds of the known values. When extrapolating, one is finding the value of a function outside of known values, and errors in this later case can be large. Suppose we want to find the value of y(x x, x 2, x n ) if we know the values on a grid. Suppose the grid is Cartesian and to simplify, use 2-D. Thus one knows the values; y = y(x i, x j ) x i = x 1 (i) x j = x 2 (j) To find y at some point within the grid, consider the diagram shown in Figure 4. Find the fractional displacement from the corners. t = u = Combining these; x 1 X1(L) X1(L + 1) X1(L) x 2 X2(L) X2(L + 1) X2(L) y(x 1, x 2 ) = (1 t)(1 u)y 1 + t(1 u)y 2 + tuy 3 + (1 t)uy 4 The equation can be easily developed by using a Taylor expansion in the variables and keeping the lowest terms. A better representation of the value is possible by keeping higher terms. Now suppose we consider the smoothness (apply the derivatives as well as the values of the function at the grid points), of the function. This gives a spline fit. Suppose we have not 6

only the values of the function on the grid, but also the gradient. The derivative terms if not given analytically can be determined numerically. A 1-D bilinear interpolation, as determined from the 2-D exposition above, is; y = Ay L + By L+1 A = X L+1 x X L+1 x L B = 1 A = x x L X L+1 x L y = y L + B(Y L+1 y L ) Now suppose we also have the 2 nd derivative, y. y = Ay L + By L+1 + Cy L + Dy L+1 C = (1/6)(A 3 A)(x L+1 X L ) 2 D = (1/6)(B 3 B)(x L+1 X L ) 2 Then we can find y which is the second derivative of the interpolating polynomial. Write all terms to the 0 th and 1 st order in the difference coefficients and solve the simultaneous equations. A spline fit (or interpolation) not only gives the interpolated value but also assures that the 1 st derivative at that point is continuous. 4 Measuring Information In probability there is some uncertainty, and a measure on uncertainty in knowledge is entropy. Here we are more interested in a measure of information as opposed to thermodynamic entropy, and these are not quite the same. If we know a source will transmit a poem by Frost or by Elliot, and we receive a poem by Frost then we have obtained 1 bit of information. The presumption of course is that we know the poems. However, both information entropy and thermodynamic entropy depend on probability distributions. Note that 2 observers may assign different values to the information and entropy of a source. Two students listening to a lecture can perceive different levels of information. One who has no prior knowledge of the subject could assign a completely random distribution to the information content, while another might be more prepared to assimilate the information. Thus think about a data stream as a measurement process. This is information transfer and we could assign a prior to the process which with a likelihood would produce a posterior. Thus this process could be iterated as previously discussed. 7

We introduce entropy as a measure of uncertainty contained in a probability distribution. Suppose a discrete distribution having a set of outcomes, A x = {a 1,, a N }. Then, if P(a i ) = p i we have completeness, P(a i ) = 1. Thus, there is a set of corresponding xǫa x probabilities, {p 1,, p N }. A fundamental theorem in information theory is Shannon s theorem which defines the measure of information as S(p 1,, p n ). S = N i=1 p i ln(p i ) Here S is information entropy. The principle of maximum entropy states that the set of probabilities (a distribution) is expected to occur in such a way as to maximize the entropy. If there are no constraints or the probabilities (other than p i = 1) then the maximum i entropy is equivalent to the principle of indifference. This principle states that without prior knowledge, we assign equal probabilities to all allowed values. The entropy of a probability distribution is the expectation value of the distribution. Suppose one has a die whose average value of a number of throws was 3.7 instead of 3.5. i p i = 3.7 i We would need to maximize S subject to the above constraint. For indifference we have p i = 1/6 and find that; i p i = 3.5 i Now as an example use the value of the entropy as S = i that p i = 1. Use this to write for the entropy; i p i ln(p 1 ), with the constraint S = N 1 i=1 For maximum entropy; p i ln(p i ) (1 N 1 i=1 p i ) ln[(1 N 1 j=1 p j )] S = ln(p p i ) 1 + ln(1 N 1 i j=1 ln(p i ) = ln(1 N 1 p j ) j=1 p j ) + 1 = 0 Thus for maximum entropy, all probabilities are equal. For example let p 1 + p 2 + p 3 = 1 8

p 1 = 1 p 1 p 3 p 2 = 1 p 1 p 2 5 Example Return to the example of the asymmetric die introduced above. The average value of trials is found to be; ip i = 3.7 i A completely symmetric die would give the average value of 3.5 with a probability of (1/6) for any face to show after a toss. For this case all probabilities are equal and we have for the information entropy; S = 6 [1/6]ln(6) = ln(6) = 1.79 i=1 If the ordering of the states is unimportant, we find that the number of possible states is; Number of States = 6! Now suppose we choose a set of probabilities which give an average of 3.7. This is not the only possible set of probabilities with this average, of course. (a 1 = 1, p 1 = 0.1), (a 2 = 1, p 2 = 0.1), (a 3 = 1, p 3 = 0.1) (a 4 = 1, p 1 = 0.5), (a 5 = 1, p 2 = 0.1), (a 6 = 1, p 6 = 0.1) S = 5 (1/10) ln(10) + (1/2) ln(2) = 1.5 i=1 i = i i p i = 3.7 The number of states is then 5!, and the entropy has decreased. 6 Information Information is defined for purposes here in terms of probability. It is measured by information entropy. The expectation value for a function, f is; 9

f = i f i p i Entropy will be defined as the expectation value of the information content. Thus the information is given by ln(p i ). From the above exposition, the probability which maximizes the entropy for N items is p i = 1/N for all i. S p i = 0 = ln(p i ) + 1. Thus 0 S(p) ln(n) Then S(p) = 0 if one p i = 1 and all others vanish, and S(p) = ln(n) if p i = 1/N for all i. There are different bases for the log operation but these only provide different measures of information (units). Thus; 1. For log 2 the units are bits 2. For log 3 the units are trits 3. For log e the units are nats 4. For log 10 the units are Hartleys Information is non-negative, and if the event has probability 1 it carries no information, thus I(1) = 0. If two independent events occur, the probability is the product of the separate probabilities. Then the information content is the sum of the information from each event. I(p 1 p 2 ) = I(p 1 ) + I(p 2 ) Information is a continuous, monotonic function of probability. Suppose we have N symbols, {a 1,, a N } which represent outcomes with probabilities, {p 1,, p n }. Define the average amount of information for each symbol as the outcome. For each symbol, a i, there is information, ln(1/p i ). In N observations there will be Np i occurrences of the symbol, a i. The total information is; I = i (Np i )ln(1/p i ) This is just the average number of observations of the symbol, a i, times the probability of occurrence summed over all events, ie the expectation value of the information content represented by the symbol, a 1. The average information per symbol is; Information N = i p i ln(p i ) 10

In the continuum limit; S(p) = dxp(x) ln(p(x)) 7 Review Assume an ensemble of random variables, x i, which produce a set of observables, A, with elements, a i. Each element has probability, p i. The constraint is N p i = 1. Also assume the joint probability where the outcome is the ordered pair, (x, y). This means; P(x) = P(x, y) yǫa This expresses the marginalization of the probability with respect to y. The conditional probability of x given y is; P(x y) = P(x, y) P(y) The product rule for the probability P(x, y) given z is; P(x, y z) = P(x y, z)[p(y z)] which leads to Bayes theorem; P(y x, z) = Then entropy is defined as; P(x y, z)p(y z) P(x z) S(X) = P(x) ln[p(x)] xǫa We must define P(x) ln[p(x)] = 0 when P(x) = 0. The joint entropy is; S(X, Y ) = x,yǫa P(x, y)ln[p(x, y)] Entropy is additive. For P(x, y) = P(x)P(y) ; i=1 S(X, Y ) = x,y P(x)P(y) ln[p(x)p(y)] = S(X) + S(Y ) The conditioned entropy of X given Y is sum of all values of S(X y) weighted by the probability of y. 11

S(X y) = P(x y) ln(p(x y xǫa S(X Y ) = P(y)[ S(X y)] = yǫa x,yǫa P(x, y) ln[p(x, y)] This measures the uncertainty in x when y is known.there is a chain rule for combining entropy written as; S(X, Y ) = S(X) + S(Y X) = S(Y ) + S(X Y ) The average reduction in uncertainty of X that results from learning the value of Y is; S(X, Y ) = S(X) S(Y X) Finally, entropy is maximized by variation of the probabilities subject to the constraints on the data. 8 Connection to thermodynamics and statistical mechanics Thermodynamics measures the macroscopic properties of systems that are microscopically complex. Suppose a macroscopic system with states {x 1,, x N } having probabilities, {p 1, p N } which are assigned by the principle of maximum entropy. In practice this is not possible because the number of microscopic states is huge. We can of course use expectation values, for example the system energy. The maximum entropy principle is useful in combinatorial problems. To illustrate, consider levels of measure of a complex system. Thus for a cube of sugar the first level would consist of all possible sizes and orientations of the crystals and the assignment of probabilities to these observables. At the next level the crystals are composed of molecules and one could study all possible molecular arrangements. In this case many arrangements are identical, but probabilities could deal with departures from the normal. Predictions of cleavage or heat conductivity are possible. At the third level, configuration of the molecular arrangements their rotations and vibrations could be considered. At this level, classical physics would need to be replaced with quantum mechanics. At the final level all quantum states would be studied, and the equilibrium macrostates would be the ones with greatest multiplicity, ie the states of greatest entropy. This is the level of quantum statistical mechanics. [At each level, probability is used to replace imperfect knowledge, and thus makes the connection to entropy]. Statistical mechanics is a probabilistic theory dealing with variables at the peak in the probability distributions - maximum entropy. 12

Thus statistical mechanics is a mixed micro-macroscopic theory. Data on the microscopic level is paired with information at the macroscopic information level as encoded in the partition function. All predictions are probabilistic and states of maximum entropy are chosen (peak of the probability distribution). This works because of the large number of states which means the variance is small. 8.1 Classical thermodynamics Suppose we define a discrete, stochastic variable, ǫ i, having a probability, p i. This is subject N to the constraint, = 1. Choose 2 functions of ǫ, r(ǫ) and η(ǫ), with expectation values; i=1 x = p i r(ǫ) y = p i η(ǫ) Assign probabilities constrained by the expectation values and normalization. Then define a general function of the form; F(λ 1, λ 2, λ 3 ) = p i ln(p i ) + λ 1 [ p i 1] + λ 2 [ p i r(ǫ i ) x] + λ 1 [ p i η(ǫ i ) y] In the above, the λ i are Lagrange multipliers used to apply the constraints. We interpret the probability as the likelihood of the occurrence of an event given all possible outcomes. From this we obtain a probability distribution using the maximum entropy principle. To eliminate the Lagrange multipliers, apply the calculus of variations. Identify; F p i = ln(p i ) + λ 1 + λ 2 r(ǫ i ) + λ 3 ηǫ p i = EXP[ (λ 2r(ǫ i ) + λ 3 ηǫ)] Normalization Normalization = EXP[ (λ 2 r(ǫ i ) + λ 3 ηǫ)] = e λ 1 Normalization = Z, which is the partition function. Therefore we have found the solution to the following problem. Given a set of N discrete observables, {x i } (outcomes of an experiment) and the expectation value of m variables, {g k (x i )}, we have a probability distribution, {p i } which describes the known information about the observables. That is, given g k (x i ) = p i g k (x i ) we have obtained obtain the best unbiased use of the available information for the choice of {p i }. Now suppose we allow ǫ i ǫ, ie to be a continuous variable. The sum in the partition function is replaced by an integral. 13

Z = dǫ w(ǫ) EXP[ (λ 2 r(ǫ i ) + λ 3 ηǫ)] In the above, w(ǫ) expresses the degeneracy of the states in the sum - ie the density of states which is the number of states between ǫ and ǫ + dǫ. Let; r = ǫ η = ǫ 2 Z = dǫ w(ǫ) EXP[ (λ 2 ǫ i + λ 3 ǫ 2 )] The probability distribution is then; P(ǫ) w(ǫ) EXP[ (λ 2 ǫ i + λ 3 ǫ 2 )] By evaluation of the Lagrange multipliers (taking the partials of F with respect to r and η and setting the result equal to zero) we find that λ 2 and λ 3. These two variables are conjugate to r and η. Thus knowing the expectation values is equivalent to providing the entropy and partition function. This can be interpreted that the Normal Distribution is the best choice of a probability distribution given the mean and variance. As a further example, suppose that the number of events in a time interval is, k. Divide the interval into N sub-intervals so that only 0 or 1 events occur in each sub-interval. The number of ways to distribute k events in N sub-intervals is; (N k)! k! NN k! N k The partition function with one constraint is; Z = N k=0 N k n! e kλ = EXP[Ne λ ] Solve for the Lagrange multiplier. Define the mean of k to be m (mean number of events in a time interval which is equivalent to the expectation value of k k = p k k). Let m = Ne λ. λ = ln(m/n) Z = e m P(k k = m) = enλ Z = e m m k /k! This illustrates that for an event-counting distribution where only the mean is known, the best choice for the probability distribution is Poisson. Finally look at the case of Fermi-Dirac statistics. The probability of an occupied state is either 0 or 1. The constraints are; 14

1. The probability of occupancy is 0 or 1, j = 0, or 1 2. The Average occupancy over all states is j = a 3. The average energy of a system is je i = b where E i is the energy of the i th state. We maximize the entropy subject to the above conditions. This leads to; p ij = e j(λ 1+λ 2 E i ) Z i Z i = EXP[0(λ 1 + λ 2 E i )] + [ 1(λ 1 + λ 2 E i )] Z i = 1 + EXP[ (λ 1 + λ 2 E i )] λ 1 = Z j = µ/kt This is the definition of the chemical potential. λ 2 = Z je i = 1/kT This is the definition of temperature. In the above, k is the Boltzman constant. Thus the expected occupancy of the i th state is; j i = 0p i0 + 1p i1 j i = 1 EXP[(E i µ)/kt] + 1 15