DS-GA 12 Lecture notes 6 Fall 216 Convergence of random processes 1 Introducton In these notes we study convergence of dscrete random processes. Ths allows to characterze phenomena such as the law of large numbers, the central lmt theorem and the convergence of Markov chans, whch are fundamental n statstcal estmaton and probablstc modelng. 2 Types of convergence Convergence for a determnstc sequence of real numbers x 1, x 2,... s smple to defne: lm x = x (1) f x s arbtrarly close to x as grows. More formally, for any ɛ > there s an such that for all > x x < ɛ. Ths allows to defne convergence for a realzaton of a dscrete random process X (ω, ),.e. when we fx the outcome ω and X (ω, ) s just a determnstc functon of. It s more challengng to defne convergence of the random process to a random varable X, snce both of these objects are only defned through ther dstrbutons. In ths secton we descrbe several alternatve defntons of convergence for random processes. 2.1 Convergence wth probablty one Consder a dscrete random process X and a random varable X defned on the same probablty space. If we fx an element ω of the sample space Ω, then X (, ω) s a determnstc sequence and X (ω) s a constant. It s consequently possble to verfy whether X (, ω) converges determnstcally to X (ω) as for that partcular value of ω. In fact, we can ask: what s the probablty that ths happens? To be precse, ths would be the probablty that f we draw ω we have lm X (, ω) = X (ω). (2) If ths probablty equals one then we say that X () converges to X wth probablty one. Defnton 2.1 (Convergence wth probablty one). A dscrete random vector X converges wth probablty one to a random varable X belongng to the same probablty space (Ω, F, P)
1.8 D (ω, ).6.4.2 5 1 15 2 25 3 35 4 45 5 Fgure 1: Convergence to zero of the dscrete random process D defned n Example 2.2 of Lecture Notes 5. f ({ P ω ω Ω, }) lm X (ω, ) = X (ω) = 1. (3) Recall that n general the sample space Ω s very dffcult to defne and manpulate explctly, except for very smple cases. Example 2.2 (Puddle (contnued)). Let us consder the dscrete random process D defned n Example 2.2 of Lecture Notes 5. If we fx ω (, 1) lm D (ω, ) = lm (4) =. (5) ω It turns out the realzatons tend to zero for all possble values of ω n the sample space. Ths mples that D converges to zero wth probablty one. 2.2 Convergence n mean square and n probablty To verfy convergence wth probablty one we fx the outcome ω and check whether the correspondng realzatons of the random process converge determnstcally. An alternatve vewpont s to fx the ndexng varable and consder how close the random varable X () s to another random varable X as we ncrease. 2
A possble measure of the dstance between two random varables s the mean square of ther dfference. Recall that f E ( (X Y ) 2) = then X = Y wth probablty one by Chebyshev s nequalty. The mean square devaton between X () and X s a determnstc quantty (a number), so we can evaluate ts convergence as. If t converges to zero then we say that the random sequence converges n mean square. Defnton 2.3 (Convergence n mean square). A dscrete random process X converges n mean square to a random varable X belongng to the same probablty space f ( ( ) lm E X X 2 ()) =. (6) Alternatvely, we can consder the probablty that X () s separated from X by a certan fxed ɛ >. If for any ɛ, no matter how small, ths probablty converges to zero as then we say that the random sequence converges n probablty. Defnton 2.4 (Convergence n probablty). A dscrete random process X converges n probablty to another random varable X belongng to the same probablty space f for any ɛ > ( ) X lm P X () > ɛ =. (7) Note that as n the case of convergence n mean square, the lmt n ths defnton s determnstc, as t s a lmt of probabltes, whch are just real numbers. As a drect consequence of Markov s nequalty, convergence n mean square mples convergence n probablty. Theorem 2.5. Convergence n mean square mples convergence n probablty. Proof. We have ( ) ( ( X lm P X () > ɛ = lm P X X ) 2 () > ɛ 2) ( ( ) X X 2 ()) lm ɛ 2 by Markov s nequalty (9) =, (1) E f the sequence converges n mean square. It turns out that convergence wth probablty one also mples convergence n probablty. Convergence n probablty one does not mply convergence n mean square or vce versa. The dfference between these three types of convergence s not very mportant for the purposes of ths course. (8) 3
2.3 Convergence n dstrbuton In some cases, a random process X does not converge to the value of any random varable, but the pdf or pmf of X () converges pontwse to the pdf or pmf of another random varable X. In that case, the actual values of X () and X wll not necessarly be close, but they have the same dstrbuton. We say that X converges n dstrbuton to X. Defnton 2.6 (Convergence n dstrbuton). A dscrete-state dscrete random process X converges n dstrbuton to a dscrete random varable X belongng to the same probablty space f where R X s the range of X. lm p X() (x) = p X (x) for all x R X, (11) A contnuous-state dscrete random process X converges n dstrbuton to a contnuous random varable X belongng to the same probablty space f lm f X() (x) = f X (x) for all x R, (12) assumng the pdfs are well defned (otherwse we can use the cdfs 1 ). Note that convergence n dstrbuton s a much weaker noton than convergence wth probablty one, n mean square or n probablty. If a dscrete random process X converges to a random varable X n dstrbuton, ths only means that as becomes large the dstrbuton of X () tends to the dstrbuton of X, not that the values of the two random varables are close. However, convergence n probablty (and hence convergence wth probablty one or n mean square) does mply convergence n dstrbuton. Example 2.7 (Bnomal converges to Posson). Let us defne a dscrete random process X () such that the dstrbuton of X () s bnomal wth parameters and p := λ/. X () and X (j) are ndependent for j, whch completely characterzes the n-order dstrbutons of the process for all n > 1. Consder a Posson random varable X wth parameter λ that s ndependent of X () for all. Do you expect the values of X and X () to be close as? No! In fact even X () and X ( + 1) wll not be close n general. However, X converges n 1 One can also defne convergence n dstrbuton of a dscrete-state random process to a contnuous random varable through the determnstc convergence of the cdfs. 4
dstrbuton to X, as establshed n Example 3.7 of Lecture Notes 2: ( ) lm p X() (x) = lm p x (1 p) ( x) (13) x = λx e λ x! (14) = p X (x). (15) 3 Law of Large Numbers Let us defne the average of a dscrete random process. Defnton 3.1 (Movng average). The movng or runnng average à of a dscrete random process X, defned for = 1, 2,... (.e. 1 s the startng pont), s equal to à () := 1 j=1 X (j). (16) Consder an d sequence. A very natural nterpretaton for the movng average s that t s a real-tme estmate of the mean. In fact, n statstcal terms the movng average s the emprcal mean of the process up to tme (we wll dscuss the emprcal mean later on n the course). The notorous law of large numbers establshes that the average does ndeed converge to the mean of the d sequence. Theorem 3.2 (Weak law of large numbers). Let X be an d dscrete random process wth mean µ X := µ such that the varance of X () σ 2 s bounded. Then the average à of X converges n mean square to µ. Proof. Frst, we establsh that the mean of à () s constant and equal to µ, ( ) ) 1 E (à () = E X (j) = 1 j=1 ( ) E X (j) j=1 (17) (18) = µ. (19) 5
Due to the ndependence assumpton, the varance scales lnearly n. Recall that for ndependent random varables the varance of the sum equals the sum of the varances, ( ) ) 1 Var (à () = Var X (j) (2) We conclude that ( (à ) ) 2 lm E () µ j=1 = 1 ( ) Var X (j) 2 j=1 (21) = σ2. (22) ( (à )) ) 2 = lm E () E (à () ) (à () = lm Var by (19) (23) (24) σ 2 = lm by (22) (25) =. (26) By Theorem 2.5 the average also converges to the mean of the d sequence n probablty. In fact, one can also prove convergence wth probablty one under the same assumptons. Ths result s known as the strong law of large numbers, but the proof s beyond the scope of these notes. We refer the nterested reader to more advanced texts n probablty theory. Fgure 2 shows averages of realzatons of several d sequences. When the d sequence s Gaussan or geometrc we observe convergence to the mean of the dstrbuton, however when the sequence s Cauchy the movng average dverges. The reason s that, as we saw n Example 3.2 of Lecture Notes 4, the Cauchy dstrbuton does not have a well defned mean! Intutvely, extreme values have non-neglgeable probablty under the Cauchy dstrbuton so from tme to tme the d sequence takes values wth very large magntudes and ths makes the movng average dverge. 4 Central Lmt Theorem In the prevous secton we establshed that the movng average of a sequence of d random varables converges to the mean of ther dstrbuton (as long as the mean s well defned and the varance s fnte). In ths secton, we characterze the dstrbuton of the average à () 6
2. 1.5 1. Movng average Mean of d seq..5..5 1. 1.5 2. 1 2 3 4 5 Standard Gaussan (d) 2. 1.5 1. Movng average Mean of d seq..5..5 1. 1.5 2. 1 2 3 4 5 2. 1.5 1. Movng average Mean of d seq..5..5 1. 1.5 2. 1 2 3 4 5 12 1 8 6 4 2 Movng average Mean of d seq. 1 2 3 4 5 Geometrc wth p =.4 (d) 12 1 8 6 4 2 Movng average Mean of d seq. 1 2 3 4 5 12 1 8 6 4 2 Movng average Mean of d seq. 1 2 3 4 5 3 25 2 Movng average Medan of d seq. 15 1 5 5 1 2 3 4 5 1 5 5 Cauchy (d) Movng average Medan of d seq. 1 1 2 3 4 5 3 2 1 Movng average Medan of d seq. 1 2 3 4 5 6 1 2 3 4 5 Fgure 2: Realzaton of the movng average of an d standard Gaussan sequence (top), an d geometrc sequence wth parameter p =.4 (center) and an d Cauchy sequence (bottom). 7
as ncreases. It turns out that à converges to a Gaussan random varable n dstrbuton, whch s very useful n statstcs as we wll see later on. Ths result, known as the central lmt theorem, justfes the use of Gaussan dstrbutons to model data that are the result of many dfferent ndependent factors. For example, the dstrbuton of heght or weght of people n a certan populaton often has a Gaussan shape as llustrated by Fgure 1 of Lecture Notes 2 because the heght and weght of a person depends on many dfferent factors that are roughly ndependent. In many sgnal-processng applcatons nose s well modeled as havng a Gaussan dstrbuton for the same reason. Theorem 4.1 (Central Lmt Theorem). Let X be an d dscrete random process wth mean µ X := µ such that the varance of X () σ 2 s bounded. The random process ) n (à µ, whch corresponds to the centered and scaled movng average of X, converges n dstrbuton to a Gaussan random varable wth mean and varance σ 2. Proof. The proof of ths remarkable result s beyond the scope of these notes. It can be found n any advanced text on probablty theory. However, we would stll lke to provde some ntuton as to why the theorem holds. In Theorem 3.18 of Lecture Notes 3, we establsh that the pdf of the sum of two ndependent random varables s equal to the convolutons of ther ndvdual pdfs. The same holds for dscrete random varables: the pmf of the sum s equal to the convoluton of the pmfs, as long as the random varables are ndependent. If each of the entres of the d sequence has pdf f, then the pdf of the sum of the frst elements can be obtaned by convolvng f wth tself tmes f 1 j=1 X(j) (x) = (f f f) (x). (27) If the sequence has a dscrete state and each of the entres has pmf p, the pmf of the sum of the frst elements can be obtaned by convolvng p wth tself tmes p 1 j=1 X(j) (x) = (p p p) (x). (28) Normalzng by just results n scalng the result of the convoluton, so the pmf or pdf of the movng mean à s the result of repeated convolutons of a fxed functon. These convolutons have a smoothng effect, whch eventually transforms the pmf/pdf nto a Gaussan! We show ths numercally n Fgure 3 for two very dfferent dstrbutons: a unform dstrbuton and a very rregular one. Both converge to Gaussan-lke shapes after just 3 or 4 convolutons. The Central Lmt Theorem makes ths precse, establshng that the shape of the pmf or pdf becomes Gaussan asymptotcally. In statstcs the central lmt theorem s often nvoked to justfy treatng averages as f they have a Gaussan dstrbuton. The dea s that for large enough n ) n (à µ s 8
= 1 = 2 = 3 = 4 = 5 = 1 = 2 = 3 = 4 = 5 Fgure 3: Result of convolvng two dfferent dstrbutons wth themselves several tmes. shapes quckly become Gaussan-lke. The 9
9 8 7 6 5 4 3 2 1.3.35.4.45.5.55.6.65 Exponental wth λ = 2 (d) 3 25 2 15 1 5.3.35.4.45.5.55.6.65 9 8 7 6 5 4 3 2 1.3.35.4.45.5.55.6.65 = 1 2 = 1 3 = 1 4 Geometrc wth p =.4 (d) 2.5 2. 1.5 1..5 7 6 5 4 3 2 1 25 2 15 1 5 1.8 2. 2.2 2.4 2.6 2.8 3. 3.2 1.8 2. 2.2 2.4 2.6 2.8 3. 3.2 1.8 2. 2.2 2.4 2.6 2.8 3. 3.2 = 1 2 = 1 3 = 1 4 Cauchy (d).3.25.2.15.1.5.3.25.2.15.1.5.3.25.2.15.1.5 2 15 1 5 5 1 15 2 15 1 5 5 1 15 2 15 1 5 5 1 15 = 1 2 = 1 3 = 1 4 Fgure 4: Emprcal dstrbuton of the movng average of an d standard Gaussan sequence (top), an d geometrc sequence wth parameter p =.4 (center) and an d Cauchy sequence (bottom). The emprcal dstrbuton s computed from 1 4 samples n all cases. For the two frst rows the estmate provded by the central lmt theorem s plotted n red. 1
approxmately Gaussan wth mean and varance σ 2, whch mples that à s approxmately Gaussan wth mean µ and varance σ 2 /n. It s mportant to remember that we have not establshed ths rgorously. The rate of convergence wll depend on the partcular dstrbuton of the entres of the d sequence. In practce convergence s usually very fast. Fgure 4 shows the emprcal dstrbuton of the movng average of an exponental and a geometrc d sequence. In both cases the approxmaton obtaned by the central lmt theory s very accurate even for an average of 1 samples. The fgure also shows that for a Cauchy d sequence, the dstrbuton of the movng average does not become Gaussan, whch does not contradct the central lmt theorem as the dstrbuton does not have a well defned mean. To close ths secton we derve a useful approxmaton to the bnomal dstrbuton usng the central lmt theorem. Example 4.2 (Gaussan approxmaton to the bnomal dstrbuton). Let X have a bnomal dstrbuton wth parameters n and p, such that n s large. Computng the probablty that X s n a certan nterval requres summng ts pmf over all the values n that nterval. Alternatvely, we can obtan a quck approxmaton usng the fact that for large n the dstrbuton of a bnomal random varable s approxmately Gaussan. Indeed, we can wrte X as the sum of n ndependent Bernoull random varables wth parameter p, X = n B. (29) =1 The mean of B s p and ts varance s p (1 p). By the central lmt theorem 1 n X s approxmately Gaussan wth mean p and varance p (1 p) /n. Equvalently, by Lemma 6.1 n Lecture Notes 2, X s approxmately Gaussan wth mean np and varance np (1 p). Assume that a basketball player makes each shot she takes wth probablty p =.4. If we assume that each shot s ndependent, what s the probablty that she makes more than 42 shots out of 1? We can model the shots made as a bnomal X wth parameters 1 and.4. The exact answer s P (X 42) = = 1 x=42 1 x=42 p X (x) (3) ( 1 x ).4 x.6 (n x) (31) = 1.4 1 2. (32) If we apply the Gaussan approxmaton, by Lemma 6.1 n Lecture Notes 2 X beng larger than 42 s the same as a standard Gaussan U beng larger than 42 µ where µ and σ are the σ 11
mean and standard devaton of X, equal to np = 4 and np(1 p) = 15.5 respectvely. ( ) P (X 42) P np (1 p)u + np 42 (33) = P (U 1.29) (34) = 1 Φ (1.29) (35) = 9.85 1 2. (36) 5 Convergence of Markov chans In ths secton we study under what condtons a fnte-state tme-homogeneous Markov chan X converges n dstrbuton. If a Markov chan converges n dstrbuton, then ts state vector p X(), whch contans the frst order pmf of X, converges to a fxed vector p, p := lm p X(). (37) Ths mples that the probablty of the Markov chan beng n each state tends to a specfc value. By Lemma 4.1 n Lecture Notes 5, we can express (37) n terms of the ntal state vector and the transton matrx of the Markov chan p = lm T ĩ p X X(). (38) Computng ths lmt analytcally for a partcular T X and p X() may seem challengng at frst sght. However, t s often possble to leverage the egendecomposton of the transton matrx (f t exsts) to fnd p. Ths s llustrated n the followng example. Example 5.1 (Moble phones). A company that makes moble phones wants to model the sales of a new model they have just released. At the moment 9% of the phones are n stock, 1% have been sold locally and none have been exported. Based on past data, the company determnes that each day a phone s sold wth probablty.2 and exported wth probablty.1. The ntal state vector and the transton matrx of the Markov chan are.9.7 a :=.1, T X =.2 1. (39).1 1 12
1 Exported Sold 1.7.1 In stock.2 Exported Exported Exported Sold Sold Sold In stock In stock In stock 5 1 15 2 Day 5 1 15 2 Day 5 1 15 2 Day Fgure 5: State dagram of the Markov chan descrbed n Example (5.1) (top). Below we show three realzatons of the Markov chan. 13
a b c 1..8.6.4.2. 5 1 15 2 Day 1. In stock.8 Sold Exported.6.4.2. 5 1 15 2 Day 1..8.6.4.2. 5 1 15 2 Day Fgure 6: Evoluton of the state vector of the Markov chan n Example (5.1) for dfferent values of the ntal state vector p X(). We have used a to denote p X() because later we wll consder other possble ntal state vectors. Fgure 6 shows the state dagram and some realzatons of the Markov chan. The company s nterested n the fate of the new model. In partcular, t would lke to compute what fracton of moble phones wll end up exported and what fracton wll be sold locally. Ths s equvalent to computng lm p X() = lm T ĩ p X X() (4) = lm T ĩ X a. (41) The transton matrx T X has three egenvectors q 1 :=, q 2 := 1,.8 q 3 :=.53. (42) 1.27 The correspondng egenvalues are λ 1 := 1, λ 2 := 1 and λ 3 :=.7. We gather the egenvectors and egenvalues nto two matrces Q := [ ] λ 1 q 1 q 2 q 3, Λ := λ 2, (43) λ 3 so that the egendecomposton of T X s T X := QΛQ 1. (44) 14
It wll be useful to express the ntal state vector a n terms of the dfferent egenvectors. Ths s acheved by computng.3 Q 1 p X() =.7, (45) 1.122 so that We conclude that lm T ĩ X a =.3 q 1 +.7 q 2 + 1.122 q 3. (46) a = lm T ĩ (.3 q X 1 +.7 q 2 + 1.122 q 3 ) (47) = lm.3 T ĩ X 1 +.7 T ĩ X 2 + 1.122 T ĩ X 3 (48) = lm.3 λ1 q 1 +.7 λ2 q 2 + 1.122 λ3 q 3 (49) = lm.3 q 1 +.7 q 2 + 1.122.5 q 3 (5) =.3 q 1 +.7 q 2 (51) =.7. (52).3 Ths means that eventually the probablty that each phone has been sold locally s.7 and the probablty that t has been exported s.3. The left graph n Fgure 6 shows the evoluton of the state vector. As predcted, t eventually converges to the vector n equaton (52). In general, because of the specal structure of the two egenvectors wth egenvalues equal to one n ths example, we have ) lm T ĩ p X X() = (Q 1 p X() ) 2 (Q. (53) 1 p X() Ths s llustrated n Fgure 6 where you can see the evoluton of the state vector f t s ntalzed to these other two dstrbutons:.6.6 b :=, Q 1 b =.4, (54).4.75 1.4.23 c :=.5, Q 1 c =.77. (55).1.5 15
The transton matrx of the Markov chan n Example 5.1 has two egenvectors wth egenvalue equal to one. If we set the ntal state vector to equal ether of these egenvectors (note that we must make sure to normalze them so that the state vector contans a vald pmf) then so that for all. In partcular, T X p X() = p X(), (56) p X() = T ĩ X X() (57) = p X() (58) lm p X() = p X(), (59) so X converges to a random varable wth pmf p X() n dstrbuton. A dstrbuton that satsfes (59) s called a statonary dstrbuton of the Markov chan. Defnton 5.2 (Statonary dstrbuton). Let X be a fnte-state tme-homogeneous Markov chan and let p stat be a state vector contanng a vald pmf over the possble states of X. If p stat s an egenvector assocated to an egenvalue equal to one, so that T X p stat = p stat, (6) then the dstrbuton correspondng to p stat s a statonary or steady-state dstrbuton of X. Establshng whether a dstrbuton s statonary by checkng whether (6) holds may be challengng computatonally f the state space s very large. We now derve an alternatve condton that mples statonarty. Let us frst defne reversblty of Markov chans. Defnton 5.3 (Reversblty). Let X be a fnte-state tme-homogeneous Markov chan wth s states and transton matrx T X. Assume that X () s dstrbuted accordng to the state vector p R s. If ( P X () = xj, X ) ( ( + 1) = x k = P X () = xk, X ) ( + 1) = x j, for all 1 j, k s, then we say that X s reversble wth respect to p. Ths s equvalent to the detaled-balance condton ( T X) p kj j = ( ) T X p jk k, for all 1 j, k s. (62) (61) 16
As proved n the followng theorem, reversblty mples statonarty, but the converse does not hold. A Markov chan s not necessarly reversble wth respect to a statonary dstrbuton (and often won t be). The detaled-balance condton therefore only provdes a suffcent condton for statonarty. Theorem 5.4 (Reversblty mples statonarty). If a tme-homogeneous Markov chan X s reversble wth respect to a dstrbuton p X, then p X s a statonary dstrbuton of X. Proof. Let p be the state vector contanng p X. By assumpton T X and p satsfy (62), so for 1 j s ( ) s T X p = ( j T X) = k=1 s ( T X) k=1 s ( = p j T X) k=1 jk p k (63) kj p j (64) kj (65) = p j. (66) The last step follows from the fact that the columns of a vald transton matrx must add to one (the chan always has to go somewhere). In Example 5.1 the Markov chan has two statonary dstrbutons. It turns out that ths s not possble for rreducble Markov chans. Theorem 5.5. Irreducble Markov chans have a sngle statonary dstrbuton. Proof. Ths follows from the Perron-Frobenus theorem, whch states that the transton matrx of an rreducble Markov chan has a sngle egenvector wth egenvalue equal to one and nonnegatve entres. If n addton, the Markov chan s aperodc, then t s guaranteed to converge n dstrbuton to a random varable wth ts statonary dstrbuton for any ntal state vector. Such Markov chans are called ergodc. Theorem 5.6 (Convergence of Markov chans). If a dscrete-tme tme-homogeneous Markov chan X s rreducble and aperodc ts state vector converges to the statonary dstrbuton p stat of X for any ntal state vector p X(). Ths mples that X converges n dstrbuton to a random varable wth pmf gven by p stat. 17
1/3 1.1 p X() = 1/3 p X() = p X() =.2 1/3.7 1. SF.8 LA SJ.6.4.2. 5 1 15 2 Customer 1. SF.8 LA SJ.6.4.2. 5 1 15 2 Customer 1. SF.8 LA SJ.6.4.2. 5 1 15 2 Customer Fgure 7: Evoluton of the state vector of the Markov chan n Example (5.7). The proof of ths result s beyond the scope of the course. Example 5.7 (Car rental (contnued)). The Markov chan n the car rental example s rreducble and aperodc. We wll now check that t ndeed converges n dstrbuton. Its transton matrx has the followng egenvectors.273 q 1 :=.545,.577 q 2 :=.789,.577 q 3 :=.211. (67).182.211.789 The correspondng egenvalues are λ 1 := 1, λ 2 :=.573 and λ 3 :=.227. As predcted by Theorem 5.5 the Markov chan has a sngle statonary dstrbuton. For any ntal state vector, the component that s collnear wth q 1 wll be preserved by the transtons of the Markov chan, but the other two components wll become neglgble after a whle. The chan consequently converges n dstrbuton to a random varable wth pmf q 1 (note that q 1 has been normalzed to be a vald pmf), as predcted by Theorem 5.6. Ths s llustrated n Fgure 7. No matter how the company allocates the new cars, eventually 27.3% wll end up n San Francsco, 54.5% n LA and 18.2% n San Jose. 18