Probability and Random Processes

Size: px

Start display at page:

Download "Probability and Random Processes"

Arlene Lambert
5 years ago
Views:

1 Probability and Random Processes Probability and Random Processes Christian Schlegel, co-author: Ali M. Bassam Ultra Maritime Digital Communications Center website:umdcc.ca So far as mathematics do not tend to make men more sober and rational thinkers, wiser and better men, they are only to be considered an amusement, which ought not to take us off from serious business Bayes, 760

2 The Normal Gaussian Distribution Let us revisit the binomial distribution from Page?? for p = q = 0.5. Also, let n = ν. Then we have for the binomial term ν k ν a k = =, ν k ν ν ν + k ν+k ν ν + k a k is the k-th term from the center of the symmetric binomial distribution and a 0 is the central term. Expanding we obtain ν! ν a k = = ν! ν νν ν k + ν + k!ν k! ν!ν! ν + ν + ν + k } {{ } a 0 For large ν we can simplify as follows: k ν a k = a 0 + = a 0 exp ν k ν a 0 + k ν ν k j= j k ν = a 0 exp j= e j/ν k j= ej/ν k ν where we were using the limiting expressions for the exponential function, i.e., [ + j ν ν ] ν e j/ν We now also need a limit expression for the term a 0, which is afforded us by Sterling s formula: n! πn n+/ e n Applying Sterling to the central term we obtain a 0 πν, and a k πν exp k ν

3 The Normal Gaussian Distribution Normalizing as follows we can relate the binomial coefficients to a real-valued function in x by letting x = k/ ν/, and we obtain a k ν N x, where N x = π e x / The Probability Density Function PDF N x is called the Gaussian or Normal probability distribution function. We have our first limit theorem, which is a version of the Central Limit Theorem: a k ν ν N x 0.0 Example for n = 0 illustrating the close fit: ν N x

4 Random Walks Let the random variable S n be defined as the sum n { +, pxi x i = = 0.5 S n = X i, X i =, p Xi x i = = 0.5 i= The partial sum S n is a discrete, non-stationary, random process: 40 Examples of Random Walks 0 Sample Process # Sample Process # Sample Process # Applications: A random process can model the behavior of a phase-locked loop outside the pull-in range. This model describes the random motion of particles observed under thermal motion, called Brownian motion. The output of an integrator driven by zero-mean random noise is described a random walk. 3

5 The Laws of Large Numbers Weak Law of Large Numbers: Let S n = n i= X i, where the X i are identically and independently distributed random variables and EX i = m X and varx i = σ X are the mean and variance of X i. Then Pr S n n m X < ɛ where ɛ > 0 is arbitrarily small. Proof: We need Chebycheff s Inequality: Given X with zero mean and finite variance σx, we find that σ X = varx i = ɛ Hence, applying this to S n we find E var Sn n Sn n = n = E = E n i= n x m X f X xdx x m X ɛ EX i = m X n f X xdx = ɛ Pr X m X ɛ n X i m X i= n i= n j= X i X j m X n n i= EX i + m X = EX + n n m X m X = n EX m X = σ x n Applying these terms in the Chebycheff inequality we obtain Pr S n n m X ɛ σ X nɛ 4 n 0 Q.E.D.

6 Strong Law of Large Numbers Define the event A k as S k /k m X < ɛ, that is, the k-th partial sum S k /k is within ɛ of the mean m X m X + ɛ m X m X ɛ r n We are looking at the simultaneous occurrences of A r+,, A r+ν, for ν. The the strong law of large numbers says the following: Strong Law of Large Numbers: Given the conditions as for the weak law of large numbers, there exists an r <, such that Pr i= A r+i > ρ where ρ > 0 can be made as small as desired. The average S n /n tends to the mean m X with probability : S n n with probability m X 5

7 Strong Law of Large Numbers: Proof The proof of this statement has several steps and involves the union bound and Chernoff s inequality. First we apply the union bound: Pr A r+i = Pr A r+i i= i= i= Pr A r+i In order to proceed we need a bound for Pr X m X ɛ = f X x dx = x m X ɛ Ixf X xdx where Ix is the important indicator function. exp λx m X ɛ expλx m X ɛ Ix m x ɛ m x + ɛ x The indicator function itself is not the big invention, however, we are free to apply any suitable upper bound on the indicator function such that gx Ix and write Pr X m X ɛ gxf X xdx The key is that integrating this integral may be much simpler then exact evaluation. 6

8 Strong Law of Large Numbers: Proof We can over use the Chernoff bound, which is using the exponential function as overbound of Ix: Ix exp λx m X ɛ + exp λx m X ɛ where λ is the Chernoff factor, which can be optimized to tighten the bound. See previous page. Concentrating on the upper half of the indicator function, we write P X m X ɛ expλx m X + ɛf X xdx = E exp λx m X ɛ The parameter λ is arbitrary and can be chosen to minimize the right hand side and tighten the bound accordingly: 0 = d dλ E exp λx m X ɛ d = E dλ exp λx m X ɛ = E X m X ɛ exp λx m X ɛ which lead the implicit equation for the minimizing λ 0, given by E X m X exp λ 0 X m X E exp λ 0 X m X = ɛ From this we can now find λ 0 and obtain P X m X ɛ E exp λ 0 X m X ɛ = β ɛ 7

9 Strong Law of Large Numbers: Proof In the next step we now apply the Chernoff bound to S n /n, whose mean is also m X : n λ Sn Pr n m X ɛ = E n i= exp n i= X i m X ɛ E exp λ 0 X i m X ɛ = β n ɛ We know go back to the union bound term we wish to compute ν Pr A r+i i= = ν i= β r+i ɛ = β r ɛ ν i= β i ɛ ν β r ɛ β ɛ β ɛ r 0, iff β ɛ < This will prove the strong law of large numbers, as long as we can show that there exists a λ 0 such that β ɛ < strictly less. To show this fact, consider d dλ fλ E X m X ɛ exp λx m X exp λɛ λ=0 = E X m X ɛ = ɛ Hence, fλ has negative slope at λ = 0, and furthermore, fλ = 0 =. Therefore, the must exist a λ 0 > 0, such that fλ 0 <, since fλ is continuous in λ. f0 = β ɛ λ d dλ fλ λ=0 < 0 fλ λ 0 λ 8

10 Transformation of Random Variables Consider a random variable X that is subject to a transformation operation, for example y = gx = x. Clearly, Y is also a random variable, but has a different CDF and PDF. We proceed to compute its CDF as follows: Pr y < Y y + y = F Y y + y F Y y f Y y Y = Pr x : y < x y + y There exist two cases as shown here: x y = x y = x x x Pr Y < y = Pr X < x Pr X < x F Y y = F X x F X x = F X y F X y f Y y = df Y y dy = f X y y = df Xx y dx dx dy F X x y dx dx dy + f X y y 9

11 Transformation of Random Variables General Form If the function y = gx can be broken up into k invertible sections, that is, the inverse function g y = x has k solutions x,..., x k, then f Y y = k i= f x x i = g y g x i Example : Let f X x = π exp x, and Y = X. We apply the formula from above and obtain f Y y = f X y y = = + f X y y exp y/ π + exp y/ y y πy exp y/ This PDF is known as χ with one degree of freedom. Example : Let f X x = { / if x < 0 elsewhere and, again Y = X. We compute: f Y y = f X y y + f X y y = { y 0 y 0 elsewhere 0

12 Sum of Random Variables y y = z x Let X and Y be two independent RVs, and consider Z = X + Y : y z x x We proceed via the cumulative density function of Z as follows: Pr Z z = Pr X + Y z = lim x 0 = i x i Pr x i X x i + x Pr Y z x f X xf Y z xdx Taking the derivative on both sides w.r.t. z we obtain the probability density function: f Z z = f X xf Y z xdx Convolution Theorem: The addition of two independent random variables X + Y = Z has PDF f Z z = f X x f Y y = f X xf Y z xdx = Φ x ωφ y ω

13 Transformation of Multiple Random Variables In general, the transformation we may be faced with is between multiple RVs, for example, let y = g x, x and y = g x, x, where X, X are two original random variable, and Y, Y are two dependent random variable: y x + x x + x x x x w z x x x + x x x + x y We proceed by matching areas and therefore probability mass, that is, the events A x and A y have the same probability. A x has area probability mass given by x x. A y has area probability mass given by A y = z w = z w sinα = z w z w = z w z w The two tangential vectors bounding the parallelogram in the y, y -space are given as z = [ g, g ] T [ g w =, g ] T x x x x The transformed area A y is related to the original area A x by the Jacobian: A y = g g x g x g x x = Jx, x

14 Transformation of Multiple Random Variables The amount of probability that comes to lie in the infinitesimal areas in both the x, x -space as well as the y, y -space can be computed from the relative twodimensional density functions as: and Pr x < X x + x, x < X x + x = f X,X x, x x x Pr y < Y y + y, y < Y y + y = f Y,Y y, y y y y x + x x + x x x x y y x x x + x x x + x y However, the amount of probability in area A x transforms into all of area A y, and only an amount f X,X x, x x x Jx, x = f Y,Y y, y y y transforms into the infinitesimal area A y. As a consequence, the probability densities are related as f Y,Y y, y = k i= f X,X x i i Jx, xi, xi where we assumed that there are k solutions to y = g x, x, y = g x, x. 3

15 Polar Transformation of Random Variables As example let us study the ubiquitous polar coordinate transformation, where y and y are given by y x, x = x + x = g x, x = r y x, x = tan x x = g x, x = φ We calculate the Jacobian of this transformation as follows: Jx, x = g g x g x g x x = x x +x x x +x /x +x /x x /x +x /x = x /x x + = x + x /x r Consequently f Y,Y y, y = f X,X x, x x + x and we find the distribution in terms of polar coordinates as f R,Φ r, φ = f X,X r cosφ, r sinφ r Example 3: Let f X,X x, x be a product of Gaussian RVs: f X,X x, x = f R,Φ r, φ = π πσ x σ x exp x x σx σx r σ exp r = f σ R rf Φ φ and phase and amplitude are independently distributed with respective PDFs: f R r = { r σ exp r σ r 0 0 otherwise f Φ φ = { π 0 φ π 0 otherwise 4

16 Random Processes: Definition A Random Process is simply a random variable that is indexed by a time variable, either discrete or continuous. Statistics of N=500 Wiener process traces sliced at t= -3 fx[t=] x N 0, A complete characterization of a random process would involve giving the joint pdf at all possible sample points:. fx[t ],,X[tk ] x,, xk for all sample points k. 5

17 A Simple Example: Random Sine Wave Consider the following random process Xt = A cos πft + Θ where A is an unknown random Amplitude and Θ is an unknown random phase with uniform distribution in [0, π]. After sampling the random process Xt we obtain as discrete random process X[i] = A cos Ωi + Θ = X X where Ω = πft 0 and P X x = P A x and P X x is the density of cosθ, which has to be found via transformation as f X = = f Θ θ cosθ = /π sinθ /π cos cos x = /π x We find the pdf of X[i] via the transformation y x, x = x x, and y x, x = x : f Y,Y y, y = f X x f X x x = f X x f X x y f Y y = = f X y = f πy y /y X y π y y y f Y Y y y dy = f Xy dy y y y Special Case: if f A a is a constant, then f Y y = /π y 6

18 Statistical Descriptions of Random Processes Since it is in general exceedingly difficult to completely characterize a random process by characterizing the complete joint PDF of all possible sample points, one often resorts to statistical descriptions of the process. The Mean of a Random Process is defined as µ X t = EXt, or µ X [k] = EX[k] The Autocorrelation of a Random Process is defined as R XX t, t = EXt Xt, or R XX [l, m] = EX[l]X[m] The Autocovariance of a Random Process can be defined in terms of the mean and autocorrelation as C XX t, t = EXt µ X t Xt µ X t = R XX t, t µ X t µ X t As example, consider the random sine wave from before, where we obtain µ X t = EAEcos πft + Θ = 0 The autocorrelation is given as R XX t, t = E A cos πft + Θ A cos πft + Θ = E A E cos πft + t + Θ cos πft t = E A cos πfτ, τ = t t 7

19 Stationarity of Random Processes In the previous example, we note that neither µ x t, nor R XX t, t depend on the specific time samples t, t, t. This property is called Wide-Sense Stationarity A Random Process is Wide-Sense Stationary WSS if and only if µ X t = µ X independent of t, and R XX t, t = R X t t dependent only on difference τ = t t A Random Process is Strict-Sense Stationary SSS if and only if the complete joint PDF of the process is invariant under a time translation t 0 : f X[t +t 0 ],,X[t k +t 0 ]x,, x k = f X[t ],,X[t k ]x,, x k Basic Properties:. Strict-sense stationarity implies wide-sense stationarity.. The autocorrelation function is event, that is, R X τ = R X τ. 3. The autocorrelation achieves its maximum at τ = 0, i.e., R X τ R X Finding µ X t and R XX t, t is often the only hope one has in characterizing random processes. 8

20 Power Spectral Density of a Stationary Random Process The power spectral density PSD of a wide-sense stationary process is an extremely important and powerful characterization and analysis tool. It is simply defined as the Fourier transform of the autocorrelation function: The PSD of a Wide-Sense Stationary WSS Random Process is given as S X f = S X Ω = k= R X τe jπfτ dτ R X [k]e jkω for a continuous process, and for a discrete process Basic Properties:. S X f is real, non-negative, and even.. The Fourier transform property allows the autocorrelation to be recaptured from the PSD: R X t = S X fe jπfτ df for a continuous process, and R X [k] = π π S X Ωe jkω for a discrete process 3. R X 0 = EX t = Xt. S Xfdf is the average power of the process f 4. The integral S f X fdf gives the average power of the process in the frequency band [f, f ]. 9

21 Stationary Random Process Example: White Noise Thermal noise can be modeled as a Gaussian random process. We approximate the noise waveform by a sequence of random pulses wt of duration T s, nt = i= n i wt it s δ, where the weighing coefficients n i are independent Gaussian random variables and T s is the discrete time increment. δ is a random delay, uniformly distributed in [0, T s [. nt t We compute its autocorrelation function as Rτ = E [ntnt + τ] = E [ i= j= = σ E δ [ i= = σ Ts τ, n i n j wt it s δwt + τ jt s δ wt it s δwt + τ jt s δ ] ] 0

22 White Noise continued Rτ is a triangular function as shown below. Rτ Nf σ /T s T s fourier transform T s T s τ T s T s T s T s T s T s f As T s 0 in the limit, Nf σ. In the limit therefore white noise has an even distribution of power over all frequencies. As nt n w t, the correlation function Rτ will degenerate into an impulse of width zero and infinite height as T s 0, that is Rτ σ δτ, where δτ is known as Dirac s impulse function. We typically only need the sifting property of δt: δt αftdt = fα, where ft is an arbitrary function, which is continuous at t = α. White Noise n w t. is a stationary process with En w t = 0, and autocorrelation Rτ = σ δτ.. has infinite power En wt = and can thus only be used in system when filtered. 3. has a Gaussian distribution, and there is not only WSS but also strict-sense stationary. 4. is an excellent description of thermal noise and other wide-spread noise sources.

23 Filtering of Random Processes Let ht be the impulse response of a linear time-invariant system see lecture notes ECED 35, that is, an input signal xt undergoes the convolution yt = xαht αdα. Now if Xt is a random process, the output Y t = is also a random process. It s mean is m Y t = E = Likewise, its autocorrelation function is R Y Y t, t = E = = Xαht αdα Xαht αdα m X tht αdα Xαht αdα Xβht βdβ E XαXβ ht αht βdα dβ R XX α, βht αht βdα dβ If the input process Xt to a filter ht is stationary, the output process Y t = Xαht αdα is also stationary m Y = m X hαdα R Y τ = R X τ + α βhαhβdα dβ

24 The Power Spectrum of a Stationary Random Process For a wide-sense stationary random process, we can define the power spectrum. For a filtered stationary process Y t we therefore obtain R Y τ = = = = R X τ + α βhαhβdαdβ S X fe jπfτ+α β dα dβdf S X fe jπfτ df hα e jπfα dα hβ e jπfβ dβ S X f Hf e jπfτ df } {{ } H f } {{ } Hf Consequently, the output power spectral density of a filtered stationary process with power spectral density S X f is given as S Y f = S X f Hf This illustrates the meaning of the term Power Spectral Density, since the power of an output signal yt is given by R Y 0 = S X f Hf df = EY t and so S X f describes the mean power as a function of the frequency of the signal Xt. 3

25 White Gaussian Noise Through Brickwall Filter White noise n w t is theoretical concept with no real-world analog. In order to obtain useful models, we pass white noise with power spectral density N 0 / through a filter with characteristics { f W Hf = 0 otherwise and we then generate filtered noise nt = n t ht. We compute the autocorrelation function as R N τ = N 0 = N 0 = N 0 W W sin πfτ πfτ S n f Hf e jπfτ dτ cos πfτ dτ W W = W N 0 sin πw τ πw τ x We compute the power of th process as: R N 0 = W N 0. Now let us sample the noise at times which are multiples of /W, that is n i = nt i, t i = i W The sequence n i of samples is a discrete random process consisting of the sequence of random variables n i. The correlation of these random variable can be computed as follows E n i n j = R N t j t i = R N j i W sinπj i = W N 0 πj i { W N0 i = j = 0 i j 4

26 Ergodicity of Random Processes Ergodicity is a very important concept which connects experimental observation with the stationary properties of a random process. First, for a process to be ergodic in a parameter, it has to be stationary in that parameter. Consider for example the average o f an ergodic process Xt: µ X t = EXt = µ X, if process is stationary in the mean If we observe a given sample process xt, we can alternately compute the sample mean xt T = T T/ T/ xtdt A random process Xt is ergodic in the mean if and only off xt T ore formally, if lim T xt T = µ X lim T T T/ T/ xt µ X dt = 0 = l.i.m. T T T/ T/ µ X, or, xtdt = µ X where l.i.m. stands for limit in the mean square. Let us consider the random variable xt T E xt T = T T/ T/ = T T/ T/ T/ T/ T/ E xtxt dtdt C XX t, t dtdt T/ t t = ν var xt T = T = T T/ T/ T/ C XX νdν = T R XX νdν T/ R XX νdν xt T 5

27 Ergodicity of Random Processes continued From above we note that a stationary random process can be ergodic in the mean only if its autocorrelation function averages out to zero. This mean that the process has limited memory, or that the memory disappears as T becomes large. Ergodicity is an important concept. It means that we can replace statistical ensemble averages with averages over a single observed process. Ergodicity has to be established for each parameter separately. Ergodicity of the Autocorrelation: Let the sample autocorrelation function be given by R XX α T = T T/ T/ xtxt + αdt = Y α R XX α T is a random variable, whose expectation is given by E R XX α T = T = T T/ T/ T/ T/ E xtxt + α dt R XX αdt = R XX α To show convergence in the mean square sense we need the variance of Y α to vanish var Y α = T T/ = T T/ = T T/ T/ T/ T/ T/ T/ T/ T/ E xtxt + α xt xt + α dt dt Y α } {{ } zt R ZZ t, t dt dt R ZZ τ dτ 0 } {{ } zt In order for a random process to be ergodic in the autocorrelation function its 4-th statistics needs to vanish. 6

28 Brownian Motion Let s revisit the discrete random process S n = n i= X n, and compute µ S = 0 R SS [l, m] = minl, m This is shown as follows assuming l m: R SS [l, m] = E S l S m = E = E S l + S l S l l i= X i + m X i i=l+ m X i = E Sl = l i=l+ From this we can define the classical Wiener Process as Xt = { 0, t = 0 ds n, n T < t nt E Xt = 0 t E X t = T d E X t = nt = td T = αt If we let T 0, d 0, we obtain the Wiener Process wt : f W wt = exp w t παt αt The Wiener process is a non-stationary random process with. Ewt = 0, Ew t = αt. R W W t, t = α mint, t 7

29 Bayesian Estimation Contrary to Classical Estimation, which uses no prior information about the parameters to be estimated, Bayesian estimation makes use of probabilistic prior information. This enables Bayesian estimation to significantly improve estimation accuracy on average. Classical Parameter θ x x is a statistic Estimator gx ˆθ PDF θ Bayesian In the Bayesian approach, the parameter θ is considered to be a random variable whose PDF is known, or approximately known. We hence work with the joint PDF f X,Θ x, θ = f Θ X θ xf X x Minimum Mean-Square Error MMSE estimation: What was not possible in classical estimation theory is now feasible: Minimize E gx θ = = f X xdx gx θ f X,Θ x, θdxdθ gx θ f Θ X θ xdθ We can now minimize the second integral above w.r.t. g, and thus minimize the entire expression and therefore the squared error. 8

30 MMSE Estimator We proceed with the minimization as follows: E gx θ = E gx E gxθ + E θ = = f X xdx gx gx θf Θ X θ xdθ }{{} + conditional estimator: Eθ x f X xdx gx Eθ x + varθ x θ f Θ X θ xdθ The conditional variance varθ x represents an irreducible error, but the other term is minimized by setting Conditional Expectation: g MMSE x = Eθ x that is, the Conditional Expectation minimizes the mean-squared error of a Bayesian estimator. However, the conditional expecation is often difficult to find, and in practice another estimator, the Maximum-Likelihood Estimator is often easier to find. f Θ X Maximum-Likelihood Estimator: g ML x = arg max f Θ X θ x x Conditional Expectation Estimator Maximum Likelihood Estimator For symmetrical conditional density functions the two estimators coincide, but in general g ML x g MMSE x. 9

31 MMSE Estimator: Example Example: Consider the familiar example of observing an unknown amplitude a in Gaussian noise, i.e., x i = a + n i ; i n In this case the PDF of the observation vector x = x,..., x n is now interpreted as a conditional probability density function, given by f X A x a = exp n x πσ n/ σ i a The critical assumption in Bayesian estimation is now the PDF of a. Let us assume here that a is distributed according to an apriori Gaussian PDF following f A a = i= πσ exp A / σa a µ A where µ A is, of course, the mean of the distribution and σ A is its variance. We now need to compute the conditional expectation Â = EA x We deviate briefly and discuss the form of Jointly Distributed Gaussian Random Variables GRV If two GRV s X and Y are uncorrelated, then their jiont PDF is simply the product f X,Y x, y = πσ x σ y exp x µ x σ x y µ y σ y However, if they are corrleated, the product is no longer separable, and a correlation coefficient connects the two, and f X,Y x, y = πσ x σ y exp ρ = E X µ xy µ y σ x σ y x µ x ρ σx ρx µ xy µ y σ x σ y + y µ y σy 30

32 Multivariate Gaussian Random Variables A Multivariate Gaussian Random Variable distribution is one where n RV s are all jointly Gaussian distributed, as f X x = C xx / π exp x µ x T C xx x µ x n/ In this equation, x = x,, x n is the vector of n GRV s, and C xx is the pairwise correlation matrix C xx = C C C n C C C n... C n C n C nn and C ij = E X i µ xi X j µ xj is the co-variance between x i and x j. The following extremely important formula applies to vectors of jointly Gaussian random variables. Separate the vector x = [y, z] into to partial vectors, with rearrangements if required. Now the conditional expectation of one part, given knowledge of the other is given by Conditional Expectation of Gaussian Random Variables: EY z = EY + C yz C zz z Ez C y z = C yy C yz C zz C zy where C y z is the Conditional Covariance Matrix C y z = E Y µ y z Y µ y z T X = z = E Y X X = z E µy z µ T y z Note: The conditional expectation equation in the case of jointly Gaussian random vectors is a linear transformation of the statistic. 3

33 MMSE Estimator: Example Continued We now let our vector of joint GRV s be x = [A, x,..., x n ] and apply the conditional expectation formula from above. With this we obtain C zz = σ A T + σ I C yy = σa C zyz = σa T + σ I keeping in mind that the first part y is only the scalar RV A. A straightforward application of the conditional expectation formula for GRV s now gives the optimal estimator as Â = EA x = µ A + σ A T σ A T + σ I x µa Using the Woodbury Identity, a version of the Matrix Inversion Lemma, we can avoid the matrix inverse above. Woodbury s Identity: I + c T = I c T + nc We obtain Â = EA x = µ A + σ A σ I σ A σ T + n σ A σ x µ A which can be manipulated into Â = αµ }{{} x + αµ }{{ A } ; α = Data Part Prior Knowledge σ A σ A + σ /n The estimator thus naturally weighs the information coming from the observation, and what is known about the parameters to be estimated according to the reliabilities of both. 3

34 MMSE Estimator: Example Continued We can also quite easily compute the mean squared error of our estimator as E Â A = E α n n A + n i + αµ A i= = E α A µ A + E = α A µ A + α σ n α Therefore, the Bayesian estimator can give a substantially lower error then the unbiased classical estimator, whose minimum mean-square error equals σ /n..0 n n i= n i Mean Square Error MSE Classical: σ /n Bayesian: α σ /n α = σ A < σ A +σ σa 0; α 0. If A is known well. n ; α. If we trust the measurements A The picture illustrates the error as a function of the actual parameter A; if A happens to be too far away from its expectation µ A, the Bayesian estimator can actually generate an error which is worse then that of the classical estimator, but the average and the minimum error both can be substantially smaller. 33

35 Linear Estimator A linear estimator for a single parameter θ is one that is a linear combination of the observed data, i.e., ˆθ = n i= a i x i + b where the coefficients a i and b are constants to be optimized for minimum error in the estimate. We first find b by derivation b E θ ˆθ = E θ b = E θ n i= n i= a i x i b a i µ xi = 0 This takes care of the bias, that is Eˆθ = µθ. The coefficients a i are now also found via partial differentiation as a i E θ ˆθ = E θ n j= a j x j b }{{} error x i }{{} data = 0 This leads to the famous Orthogonality Principle: In an optimal linear estimator the data is orthogonal to the estimation error: E θ n j= a j x j b xi = 0 34

36 Linear Prediction Note that the parameter θ is arbitrary, it does not necessarily have to be a causal parameter of the random process x, it simply has to be probabilistically correlated with x. Linear Prediction: we choose θ = x n+ as the next RV in random process x = x, x, x 3,, that is: where N is the predictor order. ˆx n+ = N j= a j x n j A direct application of the orthogonality principle, we find the optimal coefficients as a = R r where r = [R n+,n,, R n+,n N+ ] and R is an N N matrix of similar values R i,j, where R i,j = EX i X j is the correlation between the i-th and the j-th element in the sequence. These are the Wiener-Hopf Equations. For a stationary process x, we have R i,j = R j i = R i j, and these equation become Wiener-Hopf Equation for Stationary Processes: R R. R N = R 0 R R N R R 0 R N. R N R N R 0. a a. a N r = Ra These equations can be solved very efficiently due to the structure of R by the Levinsion-Durbin Algorithm. 35

37 Linear Prediction: Mean-Square Error To compute the mean-square error MSE of the predictor we calculate E x n+ ˆx n+ = E x n+ E xn+ˆx n+ + E ˆx n+ = σ x E x n+ˆx n+ = σ x N j=0 = σ x N j=0 a j E x n+ x n j a j R j+ = σx at r = σx rt R r Since 0 r T R r σx the power of the predictor depends on the correlation between successive samples in x. The example predictor below tracks a fading process with doppler frequency f d Matlab Code: LinearPredictor.m. = 5Hz.8.6 Process Amplitude N = 5 N = Time in Seconds 36

38 Linear Estimators: The Linear MMSE If we need to estimate m parameters θ j from the same data, we proceed as ˆθ j = n i= a ji x i + b j ; j m which we can succinctly collect into a single matrix equation ˆθ = Ax + b where A is now an m n matrix of coefficients. Proceeding analogously to the sinlge parameter case we find and from there we find a j E b = Eθ A T µ x θj µ θj a T j x µ x = = E θ j µ θj a T j x µ x T x µ x T = 0 E θ j µ θj x µx = E a T j x µ xx µ x T from which we compute the optimal coefficients to estimate θ j as a T j = C θ j xc xx and the correlation matrices are defined implicitly above. Putting all the pieces together, we obtain for the optimal linear estimator: Linear Minimum Mean-Square Error Estimator: ˆθ = Ax + b = Eθ + C θx C xx x Eµ x Note: The linear LMMSE estimator has exactly the same mathematical form as the optimal estimator for jointly Gaussian RVs. 37

39 Linear Estimators: The Mean-Square Error The Mean-Square Error MSE of the linear estimator can be computed quite easily as follows: ˆθ = Ax + b MSE Θ = E Θ ˆΘ Θ ˆΘ + = E Θ µ Θ A X µ X Θ µ Θ A X µ X + = C θθ + AC xx A + AC + xx C θθ + A + Noting that C xx = E X µ X X µ X + and C xθ and C θx are defined analogously, we use to obtain A = C θx C xx MSE Θ = C θθ C θx C xx C xθ Note that MSE Θ measures not only the expected squared error of all ˆθj, but also the correlation of errors Eθ j ˆθj θ j ˆθj. Putting all the pieces together, we obtain for the optimal linear estimator: Linear Minimum Mean-Square Error Estimator: ˆθ = Ax + b = Eθ + C θx C xx x Eµ x with error MSE Θ = C θθ C θx C xx C xθ 38

40 Example: Channel Estimation Consider the case of an unknown discrete channel as shown here: u i..... h 0 h h h P w i..... x i The unknown parameters are the tap values h p of the channel, while the observed output data is given by x = u 0 u u 0 u. u. u 0.. u n u n... u n P + h 0 h h. h P + w = Uh + w The data covariance matrix required in the Bayesian approach is given by C xx = E X µ x X µ x + = E Uh µ h Uh µ h + = UE h µ h h µ h + U + + E ww + = UC hh U + + C ww The optimal Bayesian estimator for this problem is given by: ĥ = C hh U + UC hh U + + C ww x Uµh + µ h If the channel gains taps are uncorrelated with zero mean, C hh = I, µ h = 0, then ĥ = U + UU + + C ww x 39

41 Linear Models The previous example for an instance of a linear model not to be confused with a linear estimator. In a linear model, the vector of observed values, the statistic, is given by the linear relation from the previous page. If no statistical information about the parameters θ is known, we apply classical estimation and compute the CRLB as follows ln px, θ = θ θ σ x Uθ+ x Uθ = θ σ x + x + x + Uθ + θu + Uθ = σ U + x U + Uθ = U+ U σ U + U U + x θ The last equation is in the canonical CRLB form which allows us to immediately extract both the minimum variance unbiased estimator, as well as its error as: Minimum Variance Estimator: ˆθ = U + U U + x and its minimum variance error matrix E ˆθ θ ˆθ θ + = σ U + U Note that in this derivation we have assumed that the noise w is white with variance σ per coordinate. The Bayesian estimator for the equivalent problem, also assuming uncorrelated parameters C θθ = I is ˆθ = U + UU + + σ I x which, with the help of the matrix inversion lemma can be moved into the form ˆθ = U + U + σ I U + x 40

42 An Alternate Approach & Its Geometric View The method of Least Square LS computes the solution θ with the smalles squared distance to the received measurements, i.e., ˆθ = arg min x Uθ θ has exactly the same solution as the MVUE from a classical consideration, namely ˆθ = U + U U + x = U + Uˆθ = U + x = U + Uˆθ x = 0 The first product on the left-hand side, namely Uˆθ, generates a vector which is a linear combination of the columns {u k } of U. this vector lives in the span of U, as shown below: x P u Uˆθ u Span{u k } It s easy to see that Uˆθ is orthogonal to the error Uˆθ x second expansion above, and therefore Uˆθ = U U + U U + x is the operation of orthogonally projecting x onto the span of U. The matrix P = U U + U U + = P is the orthogonal projection matrix onto U. 4

43 Linear Channel Model: Revisited Revisiting the linear channel model in the classical approach, we know that the CRLB error bound can be achieved in this case, and the error is given by Cĥĥ = σ U + U If we use the unit coordinate vector e i = 0,, 0,, 0,, 0 with a single entry at the i-th position, then the variance of the estimate of the i-th channel gain is given as varĥ i = σ e i + Cĥĥ e i = σ e i + D + De i where we have used a symmetric decomposition, such as the Cholesky decomposition to obtain Cĥĥ = D+ D. This can always be done with a semi-definite matrix, such as Cĥĥ. We now use a form of the famous Chauchy-Schwartz Inequality: and from that i x + y x + xy + y x i y i i x i e i + e i = = e i + D + D + e i varĥ i i y i e + i D + De + i ei D D + e i = e + i Cĥĥ i e + ei C ĥĥe i σ e i+ C ĥĥe i = σ [U + U] ii σ This effect is known as noise enhancement of the pseudo-inverse. to obtain 4

44 Wiener Filtering Contrary to the cases considered with the exception of the predictor, the parameters to be estimated were fixed and affected the random process from outside. However, any random quantity which is correlated with the observation can serve as parameter, and therefore let us consider the case where these parameters evolve continuously: x = θ + w = s + w We have now as many parameters as measurements, and successful estimation is only possible if there is a strong correlation among the components of s. Using the optimal linear estimator we obtain the Wiener Estimator ŝ = C ss C ss + C ww x Filter Interpretation The equation above is basically a matrix-vector multiplication, i.e., As the process evolves, we compute ŝ = C ss C ss + C ww x = Fx ŝ n = where r T ss is the last row of C ss. n k=0 f n n k x k = r T ss C ss + C ww x This equation can be brought into the form of the Wiener-Hopf equations: C ss + C ww f n = r ss For stationary process the correlation values must decay as n, which leads to stationary coefficients f that no longer evolve with n. This equation above for ŝ n then describes a time-invariant discrete filter the Wiener Filter. 43

45 Sequential Linear Minimum Mean-Square Error Estimator The LMMSE is an important step towards the modern recursive estimation and tracking algorithms such as the Kalman, and the recursive least-squares estimators. Stating point is the Minimum Mean-Square Error Estimator for the linear system x = Uθ + w We are interested in updating the estimate after one new observation, i.e., [ x ] [ U ] x[n] = u T [n] θ + w The orthogonality principle states that the data is orthogonal to the estimation error: E [ θ i ˆθi x ] = 0 We now propose the following form for the updated estimator at time n ˆθ i [n] = ˆθi [n ] + c[n] The orthogonality principle leads to the following conditions: a E[c[n]x[i]] = 0, i [0, n ] b E [ θ i ˆθi c[n]x[n] ] = 0 One-Step Prediction: We start with a one-step predictor for x[n], denoted by ˆx[n]. From before, we know that the conditional expectation is the best estimator: ˆx[n] = E [ x[n] x ] c[n] = ki [n]x[n] ˆx[n] Furthermore, since ˆx[n] and x are jointly Gaussian: ˆx[n] = E[x[n]] + C x[n]s C xx x = u T [n]µ θ + u T [n]c θθ U T C xxx = u T [n]µ θ + u T [n]ˆθ[n ] 44

46 Sequential Linear Minimum Mean-Square Error Estimator Condition b is now used to determine the correction term c[n], which is a multiple of the prediction error. The multiplication factor k i [n] is the Kalman gain, and needs to be evaluated. From condition b we proceed as E [ θ i ˆθi c[n]x[n] ] = E [ θ i ˆθi k i [n]x[n] ˆx[n]x[n] ˆx[n] ] [ ] E θ i ˆθi u T [n]θ ˆθ[n ] = [ u T [n]θ ˆθ[n ] ] T + w[n] θ ˆθ[n ] + w[n] u[n] k i [n]e k i [n] = E [ θ i ˆθi u T [n]θ ˆθ[n ] ] u T [n]m[n ]u[n] + σ [n] If we combine all the individual Kalman gains into a gain vector k[n] we obtain k[n] = k 0 [n] k [n]. k P [n] = M[n ]u[n] u T [n]m[n ]u[n] + σ [n] The matrix M[n] is the error covariance matrix at time n, and is needed in the algorithm. Both M[n] and k[n] are computed recursively in the algorithm, which is summarized here: The sequential linear minimum mean-square error estimator update equations are given by Gain Update: k[n] = M[n ]u[n] u T [n]m[n ]u[n] + σ [n] Error Variance Update: M[n] = I k[n]u T [n] M[n ] Estimator Update: ˆθ[n] = ˆθ[n ] + k[n] x[n] u T [n]ˆθ[n ] The complexity of this one-step update is dominated by the matrix multiplications in the gain and error variance update steps. Both of these are P P matrix multiplies with a complexity of OP On 3. 45

47 The Kalman Filter The State-Space Model The primary difference between the Kalman Filter and the sequential MMSE estimator is that the linear system model is no longer driven by a fixed parameter vector, but by the update equation x[n] = u T [n]s[n] + w[n] that is, the formerly fixed parameter θ is now also evolving with time, and is denoted by s[n]. The key of the Kalman filter is the model for s[n], the so-called State-Space Model. Its accuracy determines how well the filter works, and its general form is given by The Kalman State-Space Model is defined as s[n] = As[n ] + Bu[n] where A is a P P model evolution description, u[n] is white innovation noise of unit variance, and B is a noise impact matrix. Example: Vehicle Tracking: We wish to track a vehicle in the x y plane, where r x [n], r y [n] are the x y coordinates at time n, and v x [n], v y [n] are the speeds in the x y direction. Our system state is defined as the four parameters s T [n] = [r x [n], r y [n], v x [n], v y [n]], and the state-space is given by s[n] = r x [n] r y [n] v x [n] v y [n] = 0 δt δt r x [n ] r y [n ] v x [n ] v y [n ] u x [n] u y [n] Here, the state-space model captures the motion of the vehicle via position and speed, and uncertainties, i.e., accelerations, are added to the model as uncertainties via the innovations in the speed components. Note that uncertainties in the measurements of the positions, for example, are modeled in the linear system equation at the top of the page. 46

48 The Kalman Filter The Algorithm The actual algorithm development for the Kalman Filter is identical to that of the sequential LMMSE. We need the first-order measurement prediction ˆx[n] = C x[n]x C xx x = u T [n]ac s[n ]x C xx x = u T [n]ŝ[n]c xx x which requires the first-order prediction ŝ[n] of the system state. Similar to the parameter update of the sequential LMMSE, the system state update is given as a correction of the state prediction at time n ŝ[n n] = ŝ[n n ] + k[n] x[n] u T [n]ŝ[n n ] The remainder of the development is identical to the one discussed before, and we obtain the Complete Kalman Filter Equations: State Prediction: ŝ[n n ] = Aŝ[n n ] MSE Prediction: ˆM[n n ] = AM[n n ]A T + BB T Gain Update: k[n] = Error Variance Update: M[n Estimator Update: ŝ[n M[n n ]u[n] u T [n]m[n n ]u[n] + σ [n] n] = I k[n]u T [n] M[n n ] n] = ŝ[n n ] + k[n] x[n] u T [n]ŝ[n n ] The only function difference between the Kalman filter and the sequential LMMSE is that in the former, both the new state, and the estimated output are predicted, and therefore there are also an update, and a prediction equation for the error covariance. 47

Stochastic Processes

Stochastic Processes Elements of Lecture II Hamid R. Rabiee with thanks to Ali Jalali Overview Reading Assignment Chapter 9 of textbook Further Resources MIT Open Course Ware S. Karlin and H. M. Taylor, A First Course in Stochastic