RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES

Size: px

Start display at page:

Download "RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES"

Gwendoline Barker
6 years ago
Views:

1 Chapter 7 RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES 7.1 Introduction Definition 7.1. Let {X i ; i 1} be a sequence of IID random variables, and let S n = X 1 + X X n. The integer-time stochastic process {S n ; n 1} is called a random walk, or, more precisely, the random walk based on {X i ; i 1}. For any given n, S n is simply a sum of IID random variables, but here the behavior of the entire random walk process, {S n ; n 1}, is of interest. Thus, for a given real number α > 0, we might try to find the probabiity that S n α for any n, or given that S n α for one or more values of n, we might want to find the distribution of the smallest n such that S n α. typical questions about random walks are finding the smallest n such that S n reaches or exceeds a threshold, and finding the probability that the threshold is ever reached or crossed. Since S n tends to drift downward with increasing n if E [X] = X < 0, and tends to drift upward if X > 0, the results to be obtained depend critically on whether X < 0, X > 0, or X = 0. Since results for X < 0 can be easily translated into results for X > 0 by considering { S n ; n 0}, we will focus on the case X < 0. As one might expect, both the results and the techniques have a very different flavor when X = 0, since here the random walk does not drift but typically wanders around in a rather aimless fashion. The following three subsections discuss three special cases of random walks. The first two, simple random walks and integer random walks, will be useful throughout as examples, since they can be easily visualized and analyzed. The third special case is that of renewal processes, which we have already studied and which will provide additional insight into the general study of random walks. After this, Sections 3.2 and 3.3 show how two major application areas, G/G/1 queues and 279

2 280 CHAPTER 7. RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES hypothesis testing, can be treated in terms of random walks. These sections also show why questions related to threshold crossings are so important in random walks. Section 3.4 then develops the theory of threshold crossing for general random walks and Section 3.5 extends and in many ways simplifies these results through the use of stopping rules and a powerful generalization of Wald s equality known as Wald s identity. The remainder of the chapter is devoted to a rather general type of stochastic process called martingales. The topic of martingales is both a subject of interest in its own right and also a tool that provides additional insight into random walks, laws of large numbers, and other basic topics in probability and stochastic processes Simple random walks Suppose X 1, X 2,... are IID binary random variables, each taking on the value 1 with probability p and 1 with probability q = 1 p. Letting S n = X X n, the sequence of sums {S n ; n 1}, is called a simple random walk. S n is the difference between positive and negative occurrences in the first n trials. Thus, if there are j positive occurrences for 0 j n, then S n = 2j n, and Pr{S n = 2j n} = n! j!(n j)! pj (1 p) n j. (7.1) This distribution allows us to answer questions about S n for any given n, but it is not very helpful in answering such questions as the following: for any given integer k > 0, what is the probability that the sequence S 1, S 2,... ever reaches or exceeds k? This probability can be expressed as 1 Pr{ S 1 n=1 {S n k}} and is referred to as the probability that the random walk crosses a threshold at k. Exercise 7.1 demonstrates the surprisingly simple result that for a simple random walk with p < 1/2, this threshold crossing probability is ( 1 ) [ µ p k Pr {S n k} =. (7.2) 1 p n=1 Sections 7.4 and 7.5 treat this same question for general random walks. They also treat questions such as the overshoot given a threshold crossing, the time at which the threshold is crossed given that it is crossed, and the probability of crossing such a positive threshold before crossing a given negative threshold Integer-valued random walks Suppose next that X 1, X 2,... are arbitrary IID integer-valued random variables. We can again ask for the probability that such an integer valued random walk crosses a threshold 1 This same probability is often expressed as as Pr{sup 1 n=1 S n k}. For a general random walk, the event S n 1 {Sn k} is slightly different from sup n 1 S n k. The latter event can include sample sequences s 1, s 2,... in which a subsequence of values s n approach k as a limit but never quite reach k. This is impossible for a simple random walk since all s k must be integers. It is possible, but can be shown to have probability zero, for general random walks. We will avoid this silliness by not using the sup notation to refer to threshold crossings.

3 7.2. THE WAITING TIME IN A G/G/1 QUEUE: 281 at k, i.e., that the event S 1 n=1 {S n k} occurs, but the question is considerably harder than for simple random walks. Since this random walk takes on only integer values, it can be represented as a Markov chain with the set of integers forming the state space. In the Markov chain representation, threshold crossing problems are first passage-time problems. These problems can be attacked by the Markov chain tools we already know, but the special structure of the random walk provides new approaches and simplifications that will be explained in Sections 7.4 and Renewal processes as special cases of random walks If X 1, X 2,... are IID positive random variables, then {S n ; n 1} is both a special case of a random walk and also the sequence of arrival epochs of a renewal counting process, {N(t); t 0}. In this special case, the sequence {S n ; n 1} must eventually cross a threshold at any given positive value α, and the question of whether the threshold is ever crossed becomes uninteresting. However, the trial on which a threshold is crossed and the overshoot when it is crossed are familiar questions from the study of renewal theory. For the renewal counting process, N(α) is the largest n for which S n α and N(α) + 1 is the smallest n for which S n > α, i.e., the smallest n for which the threshold at α is strictly exceeded. Thus the trial at which α is crossed is a central issue in renewal theory. Also the overshoot, which is S N(α)+1 α is familiar as the residual life at α. Figure 7.1 illustrates the difference between general random walks and positive random walks, i.e., renewal processes. Note that the renewal process is illustrated with the axes reversed from usual representation. We usually view each renewal epoch as a time (epoch) and view N(α) as the number of trials up to time α, whereas with random walks, we usually view the number of trials as a discrete time variable and view the sum of rv s as some kind of amplitude or cost. Mathematically this makes no difference and it is often valuable to move from one point of view to another. 7.2 The waiting time in a G/G/1 queue: This section and the next introduce two important problems that are best solved by viewing them as random walks. In this section we represent the waiting time in a G/G/1 queue as a threshold crossing problem in a random walk. In the next section, we represent the error probability in a standard type of detection problem as a random walk problem. This detection problem will later be generalized to a sequential detection problem based on threshold crossings in a random walk. Consider a G/G/1 queue with first come first serve (FCFS) service. We shall find how to associate the probability that a customer must wait more than some given time α in the queue with the probability that a certain random walk crosses a threshold at α. Let X 1, X 2,... be the inter-arrival times of a G/G/1 queueing system; thus these variables are IID with a given distribution function F X (x) = Pr{X i x}. Assume that arrival 0 enters an empty system at time 0, so that S n = X 1 + X X n is the epoch of the n th arrival after time 0. Let Y 0, Y 1,..., be the service times of the successive customers. These are IID

4 282 CHAPTER 7. RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES α S 4 Epoch S 3 q S 1 q S 2 q Trial q S5 q α Epoch S S 3 2 q q S 1 q Trial S S 5 4 q q Sq 5 q Trial Sq 3 S 2 S 4 q Sq 1 Epoch α (a) (b) (c) Figure 7.1: The sample function in (a) above illustrates a random walk with arbitrary (positive and negative) step sizes {X i ; i 1}. The sample function in (b) illustrates a random walk restricted to positive step sizes {X i > 0; i 1}, i.e., a renewal process. Note that the axes are reversed from the usual depiction of a renewal process. The same sample function is shown in part (c) using the customary axes for a renewal process. For both the arbitrary random walk of part (a) and the random walk with positive step sizes of parts (b) and (c), a threshold at α is crossed on trial 4 with an overshoot S 4 α. with some given distribution function F Y (y) and are independent of {X i ; i 1}. Figure 7.2 shows a sample path of arrivals and departures and illustrates the waiting time in queue for each arrival. To analyze the waiting time, note that the system time, i.e., the time in queue plus the time in service, for any given customer n is W n + Y n, where W n is the queueing time and Y n is the service time. As illustrated in Figure 7.2, customer n + 1 arrives X n+1 time units after the beginning of this interval, i.e., after the arrival of customer n. If X n+1 < W n + Y n, then customer n + 1 arrives while customer n is still in the system, and thus must wait in the queue until n finishes service (in the figure, for example, customer 2 arrives while customer 1 is still in the queue). Thus W n+1 = W n + Y n X n+1 if X n+1 W n + Y n. (7.3) On the other hand, if X n+1 > W n + Y n, then customer n (and all earlier customers) have departed when n + 1 arrives. Thus n + 1 starts service immediately and W n+1 = 0. This is the case for customer 3 in the figure. These two cases can be combined in the single equation W n+1 = max[w n + Y n X n+1, 0]; for n 0; W 0 = 0. (7.4) Since Y n and X n+1 are coupled together in this equation for each n, it is convenient to define U n+1 = Y n X n+1. Note that {U i ; i 1} is a sequence of IID random variables. From (7.4), W n = max[w n 1 + U n, 0], and iterating on this equation, W n = max[max[w n 2 +U n 1, 0]+U n, 0] (7.5) = max[(w n 2 + U n 1 + U n ), U n, 0] = max[(w n 3 +U n 2 +U n 1 +U n ), (U n 1 +U n ), U n, 0] =

5 7.2. THE WAITING TIME IN A G/G/1 QUEUE: 283 s 3 s 2 x 3 Arrivals s 1 x 2 w 2 y 2 0 x 1 w 1 y 1 Departures y 0 x 2 + w 2 = w 1 + y 1 Figure 7.2: Sample path of arrivals and departures from a G/G/1 queue. Customer 0 arrives at time 0 and enters service immediately. Customer 1 arrives at time s 1 = x 1. For the case shown above, customer 0 has not yet departed, i.e., x 1 < y 0, so customer 1 s time in queue is w 1 = y 0 x 1. As illustrated, customer 1 s sytem time (queueing time plus service time) is w 1 + y 1. Customer 2 arrives at s 2 = x 1 + x 2. For the case shown above, this is before customer 1 departs at y 0 + y 1. Thus, customer 2 s wait in queue is w 2 = y 0 + y 1 x 1 x 2. As illustrated above, x 2 +w 2 is also equal to customer 1 s system time, so w 2 = w 1 +y 1 x 2. Customer 3 arrives when the system is empty, so it enters service immediately with no wait in queue, i.e., w 3 = 0. = max[(u 1 +U U n ), (U 2 +U U n ),..., (U n 1 +U n ), U n, 0]. (7.6) It is not necessary for the theorem below, but we can understand this maximization better by realizing that if the maximization is achieved at U i + U i U n, then a busy period must start with the arrival of customer i 1 and continue at least through the service of customer n. To see this intuitively, note that the analysis above starts with the arrival of customer 0 to an empty system at time 0, but the choice of 0 time and customer number 0 has nothing to do with the analysis, and thus the analysis is valid for any arrival to an empty system. Choosing the largest customer number before n that starts a busy period must then give the correct waiting time and thus maximize (7.5). Exercise 7.2 provides further insight into this maximization. Define Z n 1 = U n, define Z n 2 = U n +U n 1, and in general, for i n, define Z n i = U n +U n U n i+1. Thus Z n n = U n + + U 1. With these definitions, (7.5) becomes W n = max[0, Z n 1, Z n 2,..., Z n n]. (7.7) Note that the terms in {Zi n ; 1 i n} are the first n terms of a random walk, but it is not the random walk based on U 1, U 2,..., but rather the random walk going backward, starting with U n. Note also that W n+1, for example, is the maximum of a different set of variables, i.e., it is the walk going backward from U n+1. Fortunately, this doesn t matter for the analysis since the ordered variables (U n, U n 1..., U 1 ) are statistically identical to (U 1,..., U n ). The probability that the wait is greater than or equal to a given value α is Pr{W n α} = Pr{max[0, Z n 1, Z n 2,..., Z n n] α}. (7.8) This says that, for the n th customer, Pr{W n α} is equal to the probability that the random walk {Z n i ; 1 i n} crosses a threshold at α by the nth trial. Because of the

6 284 CHAPTER 7. RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES initialization used in the analysis, we see that W n is the waiting time in queue of the n th arrival after the beginning of a busy period (although this n th arrival might belong to a a later busy period than that initial busy period). As noted above, (U n, U n 1,..., U 1 ) is statistically identical to (U 1,..., U n ) and thus Pr{W n α} is the same as the probability that the first n terms of a random walk based on {U i ; i 1} crosses a threshold at α. Since the first n + 1 terms of this random walk provide one more opportunity to cross α than the first n terms, we see that Pr{W n α} Pr{W n+1 α} 1. (7.9) Since this sequence of probabilities is non-decreasing, it must have a limit as n 1, and this limit is denoted Pr{W α}. Mathematically, 2 this limit is the probability that a random walk based on {U i ; i 1} ever crosses a threshold at α. Physically, this limit is the probability that the waiting time in queue is at least α for any given very large-numbered customer (i.e., for customer n when the influence of a busy period starting n customers earlier has died out). These results are summarized in the following theorem. Theorem 7.1. Let {X i ; i 1} be the interarrival intervals of a G/G/1 queue, let {Y i ; i 0} be the service times, and assume that the system is empty at time 0 when customer 0 arrives. Let W n be the time that the n th customer waits in the queue. Let U n = Y n 1 X n for n 1. Then for any α > 0, and n 1, W n is given by (7.7). Also, Pr{W n α} is equal to the probability that the random walk based on {U i ; i 1} crosses a threshold at α by the n th trial. Finally, Pr{W α} = lim n 1 Pr{W n α} is equal to the probability that the random walk based on {U i ; i 1} ever crosses a threshold at α. Note that the theorem specifies the distribution function of W n for each n, but says nothing about the joint distribution of successive waiting times. These are not the same as the distribution of successive terms in a random walk because of the reversal of terms above. We shall find a relatively simple solution to the probability that a random walk crosses a positive threshold in Section 7.4. From Theorem 7.1, this also solves for the distribution of queueing delay for the G/G/1 queue (and thus also for the M/G/1 and M/M/1 queues). 7.3 Detection, Decisions, and Hypothesis testing Consider a situation in which we make n noisy observations of the outcome of a single binary random variable H and then guess, on the basis of the observations alone, which binary outcome occurred. In communication technology, this is called a detection problem. It models, for example, the situation in which a single binary digit is transmitted over some time interval but a noisy vector depending on that binary digit is received. It similarly models the problem of detecting whether or not a target is present in a radar observation. In control theory, such situations are usually referred to as decision problems, whereas in statistics, they are referred to as hypothesis testing. 2 More precisely, the sequence of waiting times W 1, W 2..., have distribution functions F Wn that converge to F W, the generic distribution of the given threshold crossing problem with unlimited trials. As n increases, the distribution of W n approaches F W, and we refer to W as the waiting time in steady state.

7 7.3. DETECTION, DECISIONS, AND HYPOTHESIS TESTING 285 Specifically, let H 0 and H 1 be the names for the two possible values of the binary random variable H and let p 0 = Pr{H 0 } and p 1 = 1 p 0 = Pr{H 1 }. Thus p 0 and p 1 are the a priori probabilities 3 for the random variable H. Let Y 1, Y 2,..., Y n be the n observations. We assume that, conditional on H 0, the observations Y 1,... Y n are IID random variables. Suppose, to be specific, that these variables have a density f(y H 0 ). Conditional on H 0, the joint density of a sample n-tuple y = (y 1, y 2,..., y n ) of observations is given by f(y H 0 ) = ny i=1 f(y i H 0 ). (7.10) Similarly, conditional on H 1, we assume that Y 1,... Y n are IID random variables with a conditional joint density given by (7.10) with H 1 in place of H 0. In summary then, the model is that H is a rv with PMF {p 0, p 1 }. Conditional on H, Y = (Y 1,..., Y n ) is an n-tuple of IID rv s. Given a particular sample of n observations y = y 1, y 2,..., y n, we can evaluate Pr{H 1 y} as Pr{H 1 y} = p 1 Q n i=1 f(y i H 1 ) p 1 Q n i=1 f(y i H 1 ) + p 0 Q n i=1 f(y i H 0 ). (7.11) We can evaluate Pr{H 0 y} in the same way, and the ratio of these quantities is given by Pr{H 1 y} Pr{H 0 y} = p Q n 1 i=1 f(y i H 1 ) Q p n 0 i=1 f(y i H 0 ). (7.12) If we observe y and choose H 0, then Pr{H 1 y} is the resulting probability of error, and conversely if we choose H 1, then Pr{H 0 y} is the resulting probability of error. Thus the probability of error is minimized, for a given y, by evaluating the above ratio and choosing H 1 if the ratio is greater than 1 and choosing H 0 otherwise. If the ratio is equal to 1, the error probability is the same whether H 0 or H 1 is chosen. The above rule for choosing H 0 or H 1 is called the Maximum a posteriori probability detection rule, usually abbreviated as the MAP rule. The rule has a more attractive form (and also brings us back to random walks) if we take the logarithm of each side of (7.12), getting ln Pr{H 1 y} Pr{H 0 y} = ln p 1 p 0 + nx i=1 z i ; where z i = ln f(y i H 1 ) f(y i H 0 ). (7.13) The quantity z i in (7.13) is called a log likelihood ratio. Note that z i is a function only of y i, and that this same function is used for each i. For simplicity, we assume that this 3 Statisticians have argued since the beginning of statistics about the validity of choosing a priori probabilities for an hypothesis to be tested. Bayesian statisticians are comfortable with this practice and non-bayesians are not. Both are comfortable with choosing a probability model for the observations conditional on each hypothesis. We take a Bayesian approach here, partly to take advantage of the power of a complete probability model, and partly because non-bayesian results, i.e., results that do not depend on the a priori probabilies are much easier to derive and interpret within a full probability model. As will be seen, the Bayesian approach also makes it natural to incorporate the results of early observations into updated a priori probabilities for analyzing later observations.

8 286 CHAPTER 7. RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES function is finite for all y. The MAP rule is to choose H 1 or H 0 depending on whether the quantity on the right is positive or negative, i.e., nx > ln(p 0 /p 1 ) ; choose H 1 z i < ln(p 0 /p 1 ) ; choose H 0 (7.14) i=1 = ln(p 0 /p 1 ) ; don t care, choose either Conditional on H 0, the rv s {Y i ; 1 i n} are IID. Since Z i = ln[f(y i H 1 )/f(y i H 0 )] for 1 i n, and since Z i is the same finite function of Y i for all i, we see that each Z i is a rv and that Z 1,..., Z n are IID conditional on H 0. Similarly, Z 1,..., Z n are IID conditional on H 1. Without conditioning on H 0 or H 1, neither the rv s Y 1,..., Y n nor the rv s Z 1,..., Z n are IID. Thus it is important to keep in mind the basic structure of this problem: initially a sample value is chosen for H. Then n observations, IID conditional on H, are made. Naturally the observer does not observe the original selection for H Conditional on H 0, the sum on the left in (7.14) is thus the sample value of the nth term in the random walk S n = Z Z n based on the rv s {Z i ; i 1} conditional on H 0. The MAP rule chooses H 1, thus making an error conditional on H 0, if S n is greater than the threshold ln[p 0 /p 1 ]. Similarly, conditional on H 1, S n = Z Z n is the n th term in a random walk with the conditional probabilities from H 1, and an error is made, conditional on H 1, if S n is less than the threshold ln[p 0 /p 1 ]. It is interesting to observe that P i z i in (7.14) depends only on the observations but not on p 0, whereas the threshold ln(p 0 /p 1 ) depends only on p 0 and not on the observations. Naturally the marginal probability distribution of P i Z i does depend on p 0 (and on the conditioning), but P i z i is a function only of the observations, so its value does not depend on p 0. The decision rule in (7.14) is called a threshold test in the sense that P i z i is compared with a threshold to make a decision. There are a number of other formulations of the problem that also lead to threshold tests. For example, maximum likelihood (ML) detection chooses the hypothesis i that maximizes f(y H i ), and thus corresponds to a threshold at 0. The ML rule has the property that it minimizes the maximum of Pr{H 0 Y } and Pr{H 1 Y }; this has obvious benefits when one is unsure of the a priori probabilities. In many detection situations there are unequal costs associated with the two kinds of errors. For example one kind of error in a medical test could lead to death of the patient and the other to an unneeded medical procedure. A minimum cost decision minimizes the expected cost over the two types of errors. As shown in Exercise 7.5, this is also a threshold test. Finally, one might impose the constraint that Pr{error H 1 } must be less than some tolerable limit α, and then minimize Pr{error H 0 } subject to this constraint. The solution to this is called a Neyman-Pearson threshold test (see Exercise 7.6). The Neyman-Pearson test is of particular interest since it does not require any assumptions about a priori probabilities. So far we have assumed that a decision is made after n observations. In many situations there is a cost associated with observations and one would prefer, after a given number of observations, to make a decision if the resulting probability of error is small enough, and to

9 7.4. THRESHOLD CROSSING PROBABILITIES IN RANDOM WALKS 287 continue with more observations otherwise. Common sense dictates such a strategy, and the branch of probability theory analyzing such strategies is called sequential analysis, which is based on the results in the next section. Essentially, we will see that the appropriate way to vary the number of observations based on the result of the observations is as follows: The probability of error under either hypothesis is based on S n = Z Z n. Thus we will see that the appropriate rule is to choose H 0 if the sample value of S n is less than some negative threshold β, to choose H 1 if the sample value of S n α for some positive threshold α and to continue testing if the sample value has not exceeded either threshold. The previous examples have all involved random walks crossing thresholds, and we now turn to the systematic study of threshold crossing problems. First we look at single thresholds, so that one question of interest is to find Pr{S n α} for an arbitrary integer n 1 and arbitrary α > 0. Another question is whether S n α for any n 1. We then turn to random walks with both a positive and negative threshold. Here, some questions of interest are to find the probability that the positive threshold is crossed before the negative threshold, to find the distribution of the threshold crossing time given the particular threshold crossed, and to find the overshoot when a threshold is crossed. 7.4 Threshold crossing probabilities in random walks Let {X i ; i 1} be a sequence of IID random variables with the distribution function F X (x), and let {S n ; n 1} be a random walk with S n = X X n. We assume throughout that E [X] exists and is finite. The reader should focus on the case E [X] = X < 0 on a first reading, and consider X = 0 and X > 0 later. For X < 0 and α > 0, we shall develop upper bounds on Pr{S n α} that are exponentially decreasing in n and α. These bounds, and many similar results to follow, are examples of large deviation theory, i.e., probabilities of highly unlikely events. We assume throughout this section that X has a moment generating function g(r) = E e rx = R e rx df X (x), and that g(r) is finite in some open interval around r = 0. As pointed out in Chapter 1, X must then have moments of all orders and the tails of its distribution function F X (x) must decay at least exponentially in x as x 1 and as x +1. Note that e rx is increasing in r for x > 0, so that if R 1 0 e rx df X (x) blows up for some r + > 0, it remains infinite for all r > r +. Similarly, for x < 0, e rx is increasing in r, so that if R x 0 erx df X (x) blows up at some r < 0, it is infinite for all r < r. Thus if r and r + are the smallest and largest values such that g(r) is finite for r < r < r +, then g(r) is infinite for r > r + and for r < r. The end points r and r + can each be finite or infinite, and the values g(r + ) and g(r ) can each be finite or infinite. Note that if X is bounded in the sense that Pr{X < B} = 0 and Pr{X > B} = 0 for some B < 1, then g(r) exists for all r. Such rv s are said to have finite support and include all discrete rv s with a finite set of possible values. Another simple example is that if X is a non-negative rv with F X (x) = 1 exp( αx) for x 0, then r + = α. Similarly, if X is a negative rv with F X = exp(βx) for x < 0, then r = β. Exercise 7.7 provides further examples of these possibilities.

10 288 CHAPTER 7. RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES The moment generating function of S n = X X n is given by g Sn (r) = E [exp(rs n )] = E [exp(r(x X n )] = {E [exp(rx)]} n = {g(r)} n. (7.15) It follows that g Sn (r) is finite in the same interval (r, r + ) as g(r). First we look at the probability, Pr{S n α}, that the n th step of the random walk satisfies S n α for some threshold α > 0. We could actually find the distribution of S n either by convolving the density of X with itself n times or by going through the transform domain. This would not give us much insight, however, and would be computationally tedious for large n. Instead, we explore the exponential bound, (1.38). For any r 0, in the region where g(r) is finite, i.e., for 0 r < r +, we have Pr{S n α} g Sn (r)e rα = [g(r)] n e rα. (7.16) It is convenient to rewrite (7.16) in terms of the semi-invariant moment generating function (r) = ln[g(r)]. Pr{S n α} exp[n (r) rα] ; any r, 0 r < r +. (7.17) The first two derivatives of with respect to r are given by 0 (r) = g0 (r) g(r) ; 00 (r) = g(r)g00 (r) [g 0 (r)] 2 [g(r)] 2. (7.18) Recall from (1.32) that g 0 (0) = E [X] and g 00 (0) = E X 2. Substituting this into (7.18), we can evaluate 0 (0) and 00 (0) as 0 (0) = X = E [X] ; 00 (0) = σ 2 X. (7.19) The fact that 00 (0) is the second central moment of X is why is called a semi-invariant moment generating function. Unfortunately, the higher-order derivatives of, evaluated at r = 0, are not equal to the higher-order central moments. Over the range of r where g(r) < 1, it is shown in Exercise 7.8 that 00 (r) 0, with strict inequality except in the very special (and uninteresting) case where X is deterministic. If X is deterministic, then S n is also and there is no point to considering a probabilistic model. We thus assume in what follows that X is non-deterministic and thus 00 (r) > 0 for all r between r and r + Figure 7.3 sketches (r) assuming that X < 0 and r + = 1 We can now minimize the exponent in (7.17) over r 0. For simplicity, first assume that r + = 1. Since 00 (r) > 0, the exponent is minimized by setting its derivative equal to 0. The minimum (if it exists) occurs at the r, say r o for which 0 (r) = α/n. As seen from Figure 7.3, this is satisfied with r 0 only if α/n X. Thus Pr{S n α} exp n (r o ) r o 0 (r o ) where 0 (r o ) = α/n E [X] (7.20) Ω æ (ro ) = exp α 0 (r o ) r o. (7.21)

11 7.4. THRESHOLD CROSSING PROBABILITIES IN RANDOM WALKS r slope = E [X] (r) Figure 7.3: Semi-invariant moment generating function (r) for a rv X such that E [X] < 0 and r + = 1. Note that (r) is tangent to the line of slope E [X] < 0 at 0 and has a positive second derivative everywhere. 0 r r o r r o (r o)(n/α) (r) rα/n (r) (r o ) slope = 0 (r o ) = α/n (r o) r oα/n Figure 7.4: Graphical minimization of (r) (α/n)r. For any r, (r) (α/n)r is found by drawing a line of slope (α/n) from the point (r, (r)) to the vertical axis. The minimum occurs when the line of slope α/n is tangent to the curve. The first of these inequalities shows how Pr{S n α} decreases exponentially with n for fixed α/n = 0 (r o ) and the second shows how it decreases with α for the same ratio α/n = 0 (r o ). We now give a graphical interpretation in Figure 7.4 of what these exponents mean, and return subsequently to discuss whether α/n = 0 (r) actually has a solution. The function (r) has a strictly positive second derivative, and thus any tangent to the function must lie below the function everywhere except at the point of tangency. The particular tangent shown is tangent at the point r = r o where 0 (r) = α/n. Thus this tangent line has slope α/n = 0 (r o ) and meets the vertical axis at the point (r o ) r o 0 (r o ). As illustrated, this vertical axis intercept is smaller than (r) (α/n)r for any other choice of r. This is the exponent in (7.20). This exponent is negative and shows that for a fixed ratio α/n, Pr{S n α} decays exponentially in n. Our primary interest is in the probability that S n exceeds a positive threshold, α > 0, but it can be seen that both the algebraic and graphical arguments above apply whenever α > ne [X]. Since E [X] < 0, we might also be interested in the probability that S n exceeds the mean by some amount, while also being negative. Figure 7.4 also gives a geometric interpretation of (7.21) for the case α > 0. The exponent in α is given by (7.21) to be r o + (r o )/ 0 (r o ) where r o satisfies 0 (r o ) = α/n. The negative

12 290 CHAPTER 7. RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES of this is seen to be the horizontal axis intercept of the tangent to (r) at r o, and thus this intercept gives the exponential decay rate of Pr{S n α} in α for fixed α/n. It is interesting to observe what happens to (7.21) as n is changed while holding α > 0 fixed. This is an important question for threshold crossings, since it provides an upper bound on crossing a fixed α for different values of n. For the α and n illustrated in Figure 7.4, note that as n increases with fixed α, the slope of the tangent decreases, moving the horizontal axis intercept to the right, i.e., increasing the exponential decay rate in α. Conversely, as n is decreased, the intercept moves to the left, decreasing the exponential decay rate. Note, however, that when the slope increases to the point where the intercept reaches the point where (r) = 0, i.e., the point labelled r in Figure 7.4, then further reductions in n move the tangent point to where (r) is positive. At this point, the intercept starts to move to the right again. This means that for all n, an upper bound to Pr{S n α} is given by Pr{S n α} exp( r α) for arbitrary α > 0, n 1. (7.22) We now must return to the question of whether the equation α/n = 0 (r) has a solution. From the assumption that E [X] < 0, we know that 0 (0) < 0. We have not yet shown why 0 (r) should become positive as r increases. To see this in the simplest case, assume that X is discrete and assume that X takes on positive values (if X were a non-positive rv, there would be no point to discussing the probability of crossing a positive threshold). Let x max be the largest such value. Then g(r) = P x p(x)erx p(x max )e rxmax. It follows that (r) rx max + ln(p(x max )). Since has a positive second derivative, it follows that 0 (r) must be increasing with r and must approach x max in the limit as r 1. Thus α/n = 0 (r) has a solution whenever α/n < x max. It is also clear that Pr{S n α} = 0 for α/n > x max. Thus 0 (r) = α/n has a solution over the range of interest. One can extend this argument to the case where X has an arbitrary distribution function with negative mean. Although we have only established (7.20, 7.21, 7.22) as upper bounds, Exercise 7.10 shows that for any fixed ratio a = α/n, and any ε > 0, there is an n 0 (ε) such that for all n n 0 (ε), Pr{S n n(α ε)} > exp{ n[ra (r) + ε]} where r satisfies 0 (r) = a. This means that for fixed a = α/n, (7.20) is exponentially tight, i.e., Pr{S n na} decays exponentially with increasing n at the asymptotic rate ra + (r) where r satisfies 0 (r) = a. The above discussion has treated only the case where r + = 1. Figure 5 illustrates the minimization of (7.17) for the case where r + < 1. We have assumed that (r) < 0 for r < r +, since the previous argument applies if (r) crosses 0 at some r r +. To include this case, (7.20) is generalized to Pr{S n α} exp n (r o ) roα n Ω µ exp n lim (r) r (r,r + ) r + rα n æ if α/n = 0 (r o ) for r o < r + otherwise. (7.23)

13 7.5. THRESHOLDS, STOPPING RULES, AND WALD S IDENTITY r r = r + r (r )(n/α) (r) rα/n (r) (r ) slope = α/n (r ) r α/n Figure 7.5: Graphical minimization of (r) (α/n)r for the case where r + < 1. As before, for any r < r +, (r) rα/n is found by drawing a line of slope (α/n) from the point (r, (r)) to the vertical axis. The minimum occurs when the line of slope α/n is tangent to the curve or when it touches the curve at r = r. If we extend the definition of r as the supremum of r such that (r) 0, then Pr{S n α} exp( r α) still holds for arbitrary α > 0, n 1. The next section establishes Wald s identity, which shows, among other things, that if X < 0, then exp( r α) is an upper bound (and a reasonable approximation) to the probability that the walk ever crosses a threshold at α > 0. Note that we have already found an upper bound to Pr{S n α} for any α > 0, n 1, but this new result bounds Pr{ S n {S n α}} for any α > 0. Both the threshold-crossing bounds in this section and Wald s identity in the next suggest that for large n or large α, the most important parameter of the IID rv s X making up the walk is the positive root r of (r), rather than the mean, variance, or other moments of X. As a prelude to developing these large deviation results about threshold crossings, we define stopping rules in a way that is both simpler and more general than the treatment in Chapter Thresholds, stopping rules, and Wald s identity The following lemma shows that a random walk with two thresholds, say α > 0 and β < 0, eventually crosses one of the threshold. Figure 7.6 illustrates two sample paths and how they cross thresholds. More specifically, the random walk first crosses a thresholds at trial n if β < S i < α for 1 i < n and either S n α or S n β. The lemma shows that this random number of trials N is finite with probability 1 (i.e., N is a rv) and that N has moments of all orders. Lemma 7.1. Let {X i ; i 1} be IID and not identically 0. For each n 1, let S n = X X n. Let α > 0 and β < 0 be arbitrary real numbers, and let N be the smallest n for which either S n α or S n β. Consider a random walk with two thresholds, α > 0 and β < 0, and assume that X is not identically zero. Then N is a random variable (i.e., lim m 1 Pr{N m} = 0) and has finite moments of all orders.

14 292 CHAPTER 7. RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES α S 3 r S 4 r S 5 r S6 r α S 2 r S 2 r β S 1 r β S 1 r S 3 r S 6 r S 4 r S 5 r Figure 7.6: Two sample paths of a random walk with two thresholds. In the first, the threshold at α is crossed at N = 5. In the second, the threshold at β is crossed at N = 4 Proof: Since X is not identically 0, there is some n for which either Pr{S n α + β} > 0 or for which Pr{S n α β} > 0. For any such n, let ε = max[pr{s n α + β}, Pr{S n α β}]. For any integer k 1, given that N > n(k 1), and given any value of S n(k 1) in (β, α), a threshold will be crossed by time nk with probability at least ε. Thus, Iterating on k, Pr{N > nk N > n(k 1)} 1 ε, Pr{N > nk} (1 ε) k. This shows that N is finite with probability 1 and that Pr{N j} goes to 0 at least geometrically in j. It follows that the moment generating function g N (r) of N is finite in a region around r = 0, and that N has moments of all orders Stopping rules In this section, we start with a definition of stopping rules that is more fundamental and quite different from that in Chapter 3. We then use this definition to establish Wald s identity, which is the basis for all of our subsequent results about random walks and threshold crossings. First consider a simple example. Consider a sequence {X n ; n 1} of binary random variables taking on only the values ±1. Suppose we are interested in the first occurrence of the string (+1, 1), and we view this condition as a stopping rule. Figure 7.7 illustrates this stopped process by viewing it as the truncation of a tree of possible sequences. Aside from the complexity of the tree, the same approach can be taken when considering a random walk with a stopping rule that stops at the first trial in which the random walk reaches either α > 0 or β < 0. In this case also, the stopping node is the initial segment for which the first crossing occurs at the final trial of that segment.

15 7.5. THRESHOLDS, STOPPING RULES, AND WALD S IDENTITY 293 r 1-1 r r r s r r r s r r s r r r s r s r s r Figure 7.7: A tree representing the collection of binary (1, -1) sequences, with a stopping rule viewed as a pruning of the tree. The particular stopping rule here is to stop on the first occurrence of the string (+1, 1). The leaves of the tree (i.e., the nodes at which stopping occurs) are marked with large dots and the intermediate nodes (the other nodes) with small dots. Note that each leaf in the tree has a oneto-one correspondence with an initial segment of the tree, so the stopping nodes can be unambiguously viewed either as leaves of the tree or initial segments of the sample sequences. Note that in both of these examples, the stopping rule determines which initial segment of any given sample sequence satisifes the rule. The distribution of each X n, and even whether or not the sequence is IID, is usually not relevant for defining these stopping rules. In other words, the conditions about statistical independence used in Chapter 3 for the indicator functions of stopping rules is quite unnatural for most applications. The essence of a stopping rule, however, is illustrated quite well in Figure 7.7. If one stops at some initial segment of a sample sequence, then one cannot stop again at some longer initial segment of the same sample sequence. This leads us to the following definitions of stopping nodes, stopping rules, and stopping times. Definition 7.2 (Stopping nodes). Given a sequence {X n ; n 1} of rv s, a collection of stopping nodes is a collection of initial segments of the sample sequences of {X n ; n 1}. If an initial segment of one sequence is a stopping node, then it is a stopping node for all sequences with that same initial segment. Also, no stopping node can be an initial segment of any other stopping node. This definition is less abstract when each X n is dicrete with a finite number, say m of possible values. In this case, as illustrated in Figure 7.7, the set of seqeunces is represented by a tree in which each node has one branch coming in from the root and m branches going out. Each stopping node corresponds to pruning the tree at that node. All the sequences with that given initial segment can then be ignored since they all have that same initial

16 294 CHAPTER 7. RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES segment, i.e., stopping node. In this sense, every pruning of the tree corresponds to a collection of stopping nodes. In information theory, such a collection of stopping nodes is called a prefix-free source code. Each segment corresponding to a stopping node is used as a codeword for some given message. If a sequence of consecutive segments is transmitted, a receiver can parse the incoming letters into segments by using the fact that no stopping node is an initial segment of any other stopping node. Definition 7.3 (Stopping rule and stopping time). A stopping rule for {X n ; n 1} is a rule that determines a collection of stopping nodes. A stopping time is a perhaps defective rv whose value, for a sample sequence with a stopping node, is the length of the initial segment for that node. Its value, for a sample sequence with no stopping node, is infinite. For most interesting stopping rules, sample sequences exist that have no stopping nodes. For the example of a random walk with two thresholds, there are many sequences that stay inside the thresholds forever. As shown by Lemma 7.1 however, this set of sequences has zero probability and thus the stopping time is a (non-defective) rv. We see from this that, although stopping rules are generally defined without the use of a probability measure, and the mapping from sample sequences to stopping nodes is similarly independent of the probability measure, the question of whether the stopping time is defective and whether it has moments is very dependent on the probability measure. Theorem 7.2 (Wald s identity). Let {X i ; i 1} be IID and let (r) = ln{e e rx } be the semi-invariant moment generating function of each X i. Assume (r) is finite in an open interval (r, r + ) with r < 0 < r +. For each n 1, let S n = X X n. Let α > 0 and β < 0 be arbitrary real numbers, and let N be the smallest n for which either S n α or S n β. Then for all r (r, r + ), E [exp(rs N N (r))] = 1. (7.24) We first show how to use and interpret this theorem, and then prove it. The proof is quite simple, but will mean more after understanding the surprising power of this result. Wald s identity can be thought of as a generating function form of Wald s equality as established in Theorem 3.3. First note that the trial N at which a threshold is crossed in the theorem is a stopping time in the terminology of Chapter 3. Also, if we take the derivative with respect to r of both sides of (7.24), we get E [S N N 0 (r) exp{rs N N (r)} = 0. Setting r = 0 and recalling that (0) = 0 and 0 (0) = X, this becomes Wald s equality, E [S N ] = E [N] X. (7.25) Note that this derivation of Wald s equality is restricted to a random walk with two thresholds (and this automatically satisfies the constraint in Wald s equality that E [N] < 1). The result in Chapter 3 was more general, applying to any stopping time such that E [N] < 1.

17 7.5. THRESHOLDS, STOPPING RULES, AND WALD S IDENTITY 295 The second derivative of (7.24) with respect to r is E (S N N 0 (r)) 2 N 00 (r) exp{rs N N (r)} = 0. At r = 0, this is h E SN 2 2NS N X + N 2 X 2i = E [N] σx. 2 (7.26) This equation is often difficult to use because of the cross term between S N and N, but its main application comes in the case where X = 0. In this case, Wald s equality provides no information about E [N], but (7.26) simplifies to E S 2 N = E [N] σ 2 X. (7.27) Example (Simple random walks again). As an example, consider the simple random walk of Section with Pr{X=1} = Pr{X= 1} = 1/2, and assume that α > 0 and β < 0 are integers. Since S n takes on only integer values and changes only by ±1, it takes on the value α or β before exceeding either of these values. Thus S N = α or S N = β. Let q α denote Pr{S N = α}. The expected value of S N is then αq α +β(1 q α ). From Wald s equality, E [S N ] = 0, so From (7.27), q α = β α β ; 1 q α = α α β. (7.28) E [N] σ 2 X = E S 2 N = α 2 q α + β 2 (1 q α ). (7.29) Using the value of q α from (7.28) and recognizing that σ 2 X = 1, E [N] = βα/σ 2 X = βα. (7.30) As a sanity check, note that if α and β are each multiplied by some large constant k, then E [N] increases by k 2. Since σs 2 n = n, we would expect S n to fluctuate with increasing n with typical values growing as n, and thus it is reasonable for the time to reach a threshold to increase with the square of the distance to the threshold. We also notice that if β is decreased toward 1, while holding α constant, then q α 1 and E [N] 1, which helps explain the possibility of winning one coin with probability 1 in a coin tossing game, assuming we have an infinite capital to risk and an infinite time to wait. For more general random walks with X = 0, there is usually an overshoot when the threshold is crossed. If the magnitudes of α and β are large relative to the range of X, however, it is often reasonable to ignore the overshoots, and then βα/σx 2 becomes a good approximation to E [N]. If one wants to include the overshoot, then the effect of the overshoot must be taken into account both in (7.28) and (7.29). We next apply Wald s identity to upper bound Pr{S N α} for the case where X < 0.

18 296 CHAPTER 7. RANDOM WALKS, LARGE DEVIATIONS, AND MARTINGALES Corollary 7.1. Under the conditions of 7.2, assume that (r) has a root at r > 0. Then Pr{S N α} exp( r α). (7.31) Proof: Wald s identity, with r = r, reduces to E [exp(r S N )] = 1. We can express this as Pr{S N α} E [exp(r S N ) S N α] + Pr{S N β} E [exp(r S N ) S N β] = 1. (7.32) Since the second term on the left is non-negative, Pr{S N α} E [exp(r S N ) S N α] 1. (7.33) Given that S N α, we see that exp(r S N ) exp(r α). Thus which is equivalent to (7.31). Pr{S N α} exp(r α) 1. (7.34) This bound is valid for all β < 0, and thus is also valid in the limit β 1 (see Exercise 7.12 for a more careful demonstration that (7.31) is valid without a lower threshold). Equation (7.31) is also valid for the case of Figure 7.5, where (r) < 0 for all r (0, r + ). The exponential bound in (7.22) shows that Pr{S n α} exp( r α) for each n; (7.31) is stronger than this. It shows that Pr{ S n {S n α}} exp( r α). This also holds in the limit β 1. When Corollary 7.1 is applied to the G/G/1 queue in Theorem 7.1, (7.31) is referred to as the Kingman Bound. Corollary 7.2 (Kingman Bound). Let {X i ; i 1} and {Y i ; i 0} be the interarrival intervals and service times of a G/G/1 queue that is empty at time 0 when customer 0 arrives. Let {U i = Y i 1 X i ; i 1}, and let (r) = ln{e e Ur } be the semi-invariant moment generating function of each U i. Assume that (r) has a root at r > 0. Then W n, the waiting time of the nth arrival and W, the steady state waiting time, satisfy Pr{W n α} Pr{W α} exp( r α) ; for all α > 0. (7.35) In most applications, a positive threshold crossing for a random walk with a negative drift corresponds to some exceptional, and usually undesirable, circumstance (for example an error in the hypothesis testing problem or an overflow in the G/G/1 queue). Thus an upper bound such as (7.31) provides an assurance of a certain level of performance and is often more useful than either an approximation or an exact expression that is very difficult to evaluate. For a random walk with X > 0, the exceptional circumstance is Pr{S N β}. This can be analyzed by changing the sign of X and β and using the results for a negative expected value. These exponential bounds do not work for X = 0, and we will not analyze that case here. Note that (7.31) is an upper bound because, first, the effect of the second threshold in (7.32) was set to 0, and, second, the overshoot in the threshold crossing at α was set to 0 in going

19 7.5. THRESHOLDS, STOPPING RULES, AND WALD S IDENTITY 297 from (7.33) to (7.34). It is easy to account for the second threshold by recognizing that Pr{S N β} = 1 Pr{S N α}. Then (7.32) can be solved, getting Pr{S N α} = 1 E [exp(r S N ) S N β] E [exp(r S N ) S N α] E [exp(r S N ) S N β]. (7.36) Accounting for the overshoots is much more difficult. For the case of the simple random walk, overshoots never occur since the random walk always changes in unit steps. Thus, for α and β integers, we have E [exp(r S N ) S N β] = exp(r β) and E [exp(r S N ) S N α] = exp(r α). Substituting this in (7.36) yields the exact solution Pr{S N α} = exp( r α)[1 exp(r β)] 1 exp[ r. (7.37) (α β)] Solving the equation (r ) = 0 for the simple random walk with probabilities p and q yields r = ln(q/p). This is also valid if X takes on the three values 1, 0, and +1 with p = Pr{X = 1}, q = Pr{X = 1}, and 1 p q = Pr{X = 0}. It can be seen that if α and β are large positive integers, then the simple bound of (7.31) is almost exact for this example. Equation (7.37) is sometimes taken as an approximation for (7.36). Unfortunately, for many applications, the overshoots are more significant than the effect of the opposite threshold so that (7.37) is only negligibly better than (7.31) as an approximation, and has the disadvantage of not being a bound. If Pr{S N α} must actually be calculated, then the overshoots in (7.36) must be taken into account. See Chapter 12 of [9] for a treatment of overshoots Joint distribution of N and barrier Next we look at Pr{N n, S N α}, where again we assume that X < 0 and that (r ) = 0 for some r > 0. For any r in the region where (r) 0 (i.e., for 0 r r ), we have N (r) n (r) for N n. Thus, from the Wald identity, we have 1 E [exp[rs N N (r)] N n, S N α] Pr{N n, SN α} exp[rα n (r)]pr{n n, S N α} Pr{N n, S N α} exp[ rα + n (r)] ; for all r such that 0 r r.(7.38) Under our assumption that X < 0, we have (r) 0 in the range 0 r r, and (7.38) is valid for all r in this range. To obtain the tightest bound of this form, we should minimize the right hand side of (7.38). This is the same minimization (except for the constraint r r ) as in Figure 7.4, and the result, if α/n < 0 (r ), is Pr{N n, S N α} exp[ r o α + n (r o )]. (7.39) where r o satisfies 0 (r o ) = α/n. This is the same as the bound on Pr{S n α} in (7.20) except that r r in (7.39). For the special case described in Figure 7.5 where (r) < 0 for all r < r +, (7.39) is modified in the same way as used in (7.23).

DISCRETE STOCHASTIC PROCESSES Draft of 2nd Edition

DISCRETE STOCHASTIC PROCESSES Draft of 2nd Edition R. G. Gallager January 31, 2011 i ii Preface These notes are a draft of a major rewrite of a text [9] of the same name. The notes and the text are outgrowths