Linear Prediction Theory Joseph A. O Sullivan ESE 524 Spring 29 March 3, 29 Overview The problem of estimating a value of a random process given other values of the random process is pervasive. Many problems in forecasting fall into this category. The Kalman filter applies if there is a nown underlying state space model for the system. If no such state space model exists, then some approximation must be used. If the covariance function for a zero mean random process is nown, then the use of a linear predictor is often a good choice. In linear prediction, the next value is estimated as a linear combination of past values, the parameters being chosen to minimize the mean square error between the next value and its estimate. If the true distribution is Gaussian, then the linear estimator of a given order minimizes the mean square error over all estimators of that order. This ability to use the algorithms independent of the true distribution maes them universal in the sense that they achieve performance for any true distributions. When the true distribution is an autoregressive model of order, then the th order linear filter is optimal. In that case, the previous inputs can be considered the state, and the Kalman filter applied. The steady state version of the Kalman filter is equivalent to the th order linear filters here. The choice of the order of a linear estimator (that is, the number of past values used to predict the current value) is important. The mean square error monotonically decreases with increasing order of the estimator. Including too many coefficients increases complexity unnecessarily. In addition, if there is some uncertainty in the covariance function, then increasing the model order beyond some critical value may lead to overfitting the data, increasing the mean square error. When the model order increases, the coefficients used may be computed recursively. These recursive computations are classical, being based on efficient inversion due to the Toeplitz structure of the data covariance matrix. For the theoretical analysis, we assume Gaussian statistics and derive the optimal estimators. The linear predictors defined here form the basis for much of modern adaptive signal processing, including the least mean square (LMS) and recursive least squares (RLS) algorithms and their many variants. The algorithms are also instructive for many other estimation problems including array signal processing. The filter structures that result include the transversal and lattice filters. 2 Summary of Recursive Estimation Equations Let r n be a stationary, zero mean Gaussian random process with covariance function c l Er n r n l. ()
The problem of interest is estimating (predicting) one value of the random process given previous values, Er n r n,r n 2,...,r n. Linearity and Gaussian statistics yield ˆr n w r n + w 2 r n 2 +...+ w r n (2) w T r (n ). (3) The coefficients do not depend on time due to stationarity. In this equation, The orthogonality principle states that and thus that r (n ) r n r n +... r n 2 r n T (4) w w w... w 2 w T. (5) E(r n w T r (n ))r (n ) T (6) γ T w T Γ (7) w Γ γ. (8) Here, Γ is a array with (i, j) elementequaltoc i j. The vector γ is a vector γ c c... c 2 c T. (9) Note the following recursive structures: γ + c+ γ () and Γ + γ T c c (Jγ ) T Γ γ Jγ Γ, () (2) where J is called an exchange matrix and has ones along its antidiagonal and zeros elsewhere; the matrix J has the property that J times a vector equals that vector with its entries reordered from bottom to top. These two decompositions of Γ + yield two different, but closely related ways of recursively computing its inverse. These two ways rely on defining the ( +) vector a w Using this vector, the forward estimation (prediction) error may be written F (n) r n. (3) w l r n l (4) l a T r +(n). (5) 2
The inverse of Γ + using () may be written as where Γ + Γ + p a a T, (6) p c γ T Γ γ (7) E(r n w T r (n )) 2 (8) EF (n) 2 (9) is the th order forward prediction error variance. Note that γ Γ Γ + a w c γ T w c γ T Γ p γ (2) (2) (22) This result actually verifies the inversion formula above because ( Γ Γ + I γ T Γ I γ T Γ + p a a T ) (23) + p p a T (24) + γ T Γ (25) I. (26) To write the inverse related to the second decomposition of Γ + in a similar form, we define a new vector b to have the elements of a in the opposite order. Then we have the following equations: The vector b determines the bacward prediction error G (n), JΓ + J Γ + (27) Ja b (28) JJ I (29) G (n) r n Er n r n +,...,r n,r n (3) r n (Jw) T r (n) (3) r T + b. (32) This vector satisfies b Jw (33) 3
and JΓ + a JΓ + JJa (34) JΓ + Jb (35) Γ + b. (36) We also have so JΓ + a J p p Γ + b p (37) (38) (39) where The inverse of Γ + using (2) may be written as Γ + Γ The recursive structure is further clarified through + p b b T. (4) w + Γ + γ + (4) c+ Γ + b γ b T p γ + (42) + Δ b, (43) w p Δ b T γ + (44) a T Jγ + (45) Plugging this last form for w + into the definition of a + yields w+ a + (46) w Δ b (47) p Δ b. (48) a p Similarly, plugging into the definition of b + yields b + Jw + (49) 4
Jw b Δ Δ p p a Jb (5). (5) 2. Recursive Transversal Filter Coefficient Computation Inputs: c,c,c 2,... Outputs: prediction error filters in transversal and lattice forms.. Initialization step: p c ; a b ;γ c ;Γ c ;. 2. Reflection coefficient and prediction variance computation: Δ a T Jγ + (52) p a T γ (53) c 3. Update forward and bacward prediction error filters: a + a b b + Δ p Δ p b a (54). (55) 4. Recursion step: +; γ + c+ γ ; (56) return to reflection coefficient computation. The computational complexity of this algorithm is determined by the computations for the reflection coefficient and the filter update steps. In the reflection coefficient and prediction variance computation, there are 2 multiplies and 2 additions. In the two filter update equations, the multiplies are all the same and the resulting values are just reordered versions of each other (reordered using J). Thus, there is one division and there are multiplies and additions. The number of computations is thus 3 multiplications, division, and 3 additions per stage. The total number of computations from stage through stage is 3( + )/2 multiplies, divisions, and 3( + )/2 additions. Some divisions can be avoided by using as input the sequence of correlation coefficients,c /c,c 2 /c,... instead of c,c,c 2,..., the sequence of correlations. 2.2 Lattice Filter Structure Two equivalent representations of the filters described above are the transversal and lattice filter representations. The transversal filters are described in terms of the coefficients a, a 2,...,a and b, b 2,...,b.The lattice filters are described in terms of the coefficients Δ p, Δ p,..., Δ p. 5
The update equations (54) and (55) may be used to describe the lattice structure in terms of the forward prediction errors F (n) and the bacward prediction errors G (n): F + (n) a T + r +2(n) (57) T r a +2(n) Δ T b R p +2(n) (58) a T r + (n) Δ b T r + (n ) p (59) F (n) Δ G (n ), p (6) where the ey step is recognizing that zeros in a T and b T correspond to reducing the length of r +2 (n) from +2to +, and shifting one time unit in the latter case. Similarly, G + (n) b T + r +2(n) (6) T b R +2(n) Δ T R p a +2(n) (62) b T r +(n ) Δ a T p R +(n) (63) G (n ) Δ p F (n). (64) In matrix form, one stage of the lattice filter has the form F+ (n) Δ p G + (n) Δ p F (n) G (n ). (65) If the lattice filter structure is used in the implementation, then the multiplications needed to update a and b are not needed. The computational complexity of the algorithm may be reduced even further as described below. 2.3 Faster Computations In the recursive algorithm, the computations associated with the reflection coefficient and the prediction variance can essentially be eliminated. To see this, consider the forward and bacward filters used with inputs equal to the correlations c n. At time (noting that c n c n ), the output of the forward prediction error filter of order equals c w T γ which equals p.attime, the output of the bacward prediction error filter of order equals c (Jw ) T γ b T γ + (66) Δ. (67) In this interpretation of the computations, the filters up through order may be used to compute the quantities needed for the next update by simply running the correlation coefficients through the filters. This saves computations by using the lattice filter structure: p Δ a T b T γ+ c, (68) 6
where the zeros in the matrix on the right side are scalars; that is, the matrix is 2 ( + 2) and has a zero in the upper left and bottom right corners. Δ p Δ p F (n) G (n ). (69) 2.4 Key Properties 2.5 Autoregressive Gaussian Processes A zero mean, stationary, Gaussian random process r,r 2,... is an mth order autoregressive process if r n a r n a 2 r n 2... a m r n m + w n, (7) for all n, wherew n are independent and identically distributed Gaussian random variables with zero mean and variance σ 2. An mth order autoregressive process is mth order Marov in the sense that the probability density function of r n given r n,r n 2,...,r equals the probability density function of r n given r n,r n 2,...,r n m. Defining the vector a m a m a m... a 2 a T, (7) may be rewritten as Let the covariance function for the random process be C,so a T m r m+(n) w n. (7) C E{r n r n }. (72) Comment: In order for this equation to model a stationary random process and to be viewed as a generative model for the data, the corresponding discrete time system must be stable. That is, if one were to compute the transfer function in the Z-transform domain, then all of the poles of the transfer function must be inside of the unit dis in the complex plane. These poles are obviously the roots of the characteristic equation with coefficients a j. a. Using the autoregressive model in equation (??), show that the covariance function satisfies the equations C + a C + a 2 C 2 +...+ a m C m σ 2 (73) C + a C + a 2 C 2 +...+ a m C m, (74) where the second equation holds for all >. Hint: Multiply both sides of (??) by a value of the random sequence and tae expected values. Use the symmetry property of covariance functions for the first equality. 2.6 Bacground and Understanding of Autoregressive Models Suppose that r,r 2,...is a stationary sequence of Gaussian random variables with zero mean. The covariance function is determined by an autoregressive model which the random variables satisfy. The autoregressive model is an mth order Marov model, meaning that More specifically, suppose that b. Derive a recursive structure for computing the logarithm of the probability density function of r n,r n 2,...,r. More specifically, let v n lnp(r,r 2,...,r n ). (75) 7
Derive an expression for v n in terms of v n and an update. Focus on the case where n>m. Hint: This is a ey part of the problem, so mae sure you do it correctly. It obviously relates to the Marov property expressed through the autoregressive model in (??). c. Consider the special case of m. Suppose that C. Find a relationship between a and σ 2 (essentially you must solve (74) in this general case). Comment: Note that the stability requirement implies that a <. 2.7 Recursive Detection for Autoregressive Models Suppose that one has to decide whether data arise from an autoregressive model or from white noise. In this problem, the log-lielihood ratio is computed recursively. Under hypothesis H, the data arise from the autoregressive model (??). Under hypothesis H, the data R n are i.i.d. Gaussian with zero mean and variance C. That is, under either hypothesis the marginal distribution on any sample R n is the same. The only difference between the two models is in the covariance structure. a. Find the log-lielihood ratio for n samples. Call this log-lielihood ratio l n. Derive a recursive expression for l n in terms of l n and an update. Focus on the case n>m. b. Consider the special case of m. Write down the recursive structure for this case. c. The performance increases as n grows. This can be quantified in various ways. One way is to compute the information rate functions for each n. In this problem, you will compute a special case. Consider again m. Find the log-moment generating function for the difference between l n and l n conditioned on each hypothesis, and conditioned on previous measurements; call these two log-moment generating functions m (s) andm (s): m (s) lne{e s(ln ln ) H,r,r 2,...,r n }. (76) Compute and plot the information rate functions I (x) andi (x) for these two log-moment generating functions. Comment: These two functions quantify the increase in information for detection provided by the new measurement. 2.8 Recursive Estimation for Autoregressive Models In this problem, you will estimate the parameters in an autoregressive model given observations of the data r n,r n,...,r. a. First, assume that the maximum lielihood estimate for the parameters given data r n,r n 2,...,r satisfies B n â n d n, (77) where the vector â n is the maximum lielihood estimate of the parameter vector a a a 2... a m T. (78) Find the update equations for B n and d n. These may be obtained by writing down the lielihood equation using the recursive update for the log-lielihood function, and taing the derivative with respect to the parameter vector. b. The computation for â n may also be written in recursive form. This is accomplished using the matrix inversion lemma. The matrix inversion lemma states that a ran one update to a matrix yields a ran one 8
update to its inverse. More specifically, if A is an m m symmetric, invertible matrix and f is an m vector, then (A + ff T ) A A f +f T A f f T A. (79) Use this equation to derive an equation for the estimate â n in terms of â n. Hint: The final form should loo lie â n â n + g n r n + â T n (r n r n 2...r n m ) T, (8) where an auxiliary equation defines the vector g n in terms of B n and the appropriate definition of f. 2.9 Recursive Detection: Order Versus Order 2 Autoregressive Model A decision must be made between two models for a sequence of Gaussian distributed random variables. Each model is an autoregressive model. The first model is autoregressive of order one, while the second is autoregressive of order two. There are two goals here as outlined below. First, the optimal test statistic for a Neyman-Pearson test must be computed for a fixed number N of consecutive samples of a realization. Second, an efficient update of this test statistic to the case with N + samples must be derived. Consider the following two hypotheses. Under H, the model for the measurements is y i.75y i + w i, (8) where w i are independent and identically distributed Gaussian random variables with zero mean and variance equal to 7/4.75; w i are independent of y for all i; andy is Gaussian distributed with zero mean and variance 4. Under H 2, the model for the measurements is y i.75y i +.2y i 2 + w i, (82) where w i are independent and identically distributed Gaussian random variables with zero mean and variance equal to.75; w i are independent of y for all i; andy is Gaussian distributed with zero mean and variance 4. y.75y + w where w is a zero mean Gaussian random variable with zero mean and variance.75. a. Given y,y,...,y N, find the optimal test statistic for a Neyman-Pearson test. Simplify the expression as much as possible. Interpret your answer. b. Denote the test statistic computed in part a by l N. The optimal test statistic for N + measurements is l N+. Find an efficient update rule for computing l N+ from l N. 9