q P (T b < T a Z 0 = z, X 1 = 1) = p P (T b < T a Z 0 = z + 1) + q P (T b < T a Z 0 = z 1)

Random Walks Suppose X 0 is fixed at some value, and X 1, X 2,..., are iid Bernoulli trials with P (X i = 1) = p and P (X i = 1) = q = 1 p. Let Z n = X 1 + + X n (the n th partial sum of the X i ). The sequence Z n is called a simple random walk. Geometrically, Z n can be viewed as determining a path in the integer lattice from (0, 0) to (n, Z n ). The path is continuous, and is formed entirely from diagonal segments of the form (a, b) (a + 1, b + 1) or (a, b) (a + 1, b 1). A random walk is a Markov chain, with transition probabilities P (Z n = a Z n 1 = b) = pi(a = b + 1) + qi(a = b 1). Suppose we are interested in a hitting time T a = inf{n Z n = a} (i.e. the first time that the path (n, Z n ) hits the horizontal line Y = a). Given two integers a, b, a natural question is to determine the probability P (T b < T a Z 0 = z) (i.e. the probability that b is reached before b starting from a specified point). The interesting case is when a z b. 1

P (T b < T a Z 0 = z) = P (T b < T a, X 1 = 1 Z 0 = z) + P (T b < T a, X 1 = 1 Z 0 = z) = P (T b < T a Z 0 = z, X 1 = 1)P (X 1 = 1) + P (T b < T a Z 0 = z, X 1 = 1)P (X 1 = 1) = p P (T b < T a Z 0 = z, X 1 = 1) + q P (T b < T a Z 0 = z, X 1 = 1) = p P (T b < T a Z 1 = z + 1) + q P (T b < T a Z 1 = z 1) = p P (T b < T a Z 0 = z + 1) + q P (T b < T a Z 0 = z 1) If we let W z = P (T b < T a Z 0 = z), then we see that the W z satisfy the recurrence relation: W z = pw z+1 + qw z 1 subject to the boundary conditions W a = 0, W b = 1. Difference equations often have solutions of the form W z = exp(cz). Moreover, any solution is a linear combination of solutions of this form (and conversely any linear combination of solutions of this form are also a solution). In the current case, we would have exp(cz) = p exp(c(z + 1)) + q exp(c(z 1)), 2

which is equivalent to or 1 = p exp(c) + q exp( c) exp(c) = p exp(2c) + q. This is quadratic in exp(c), so we can solve it explicitly: exp(c) = (1 ± 1 4p(1 p))/2p = (1 ± (2p 1))/2p and we get exp(c) = 1 or exp(c) = q/p. This gives c = 0 or c = log(q/p), which in turn gives W z = 1 or W z = exp(z log(q/p)). A general solution is of the form W z = C 1 + C 2 exp(z log(q/p)), and using the boundary conditions W a = 0 and W b = 1, we get a unique solution W z = exp(z log(q/p)) exp(a log(q/p)) exp(b log(q/p)) exp(a log(q/p)). Now extend the definition of the hitting time so that T a,b = inf{n Z n = a or Z n = b}. We would like to find the expectation E(T a,b Z 0 = z). Denote this value M z. M z = E(T a,b X 1 = 1)P (X 1 = 1) + E(T a,b X 1 = 1)P (X 1 = 1) = p(1 + M z+1 ) + q(1 + M z 1 ) = 1 + pm z+1 + qm z 1 3

This equation is inhomogeneous because of the unit additive constant. The associated inhomogeneous recurrence is M z = pm z+1 + qm z 1. It is easy to verify that the sum of a solution to the inhomogeneous recurrence and any linear combination of solutions to the associated homogeneous recurrence is a solution to the inhomogeneous recurrence. It is easy to verify that z/(q p) is a solution to the inhomogeneous recurrence. The left side of the inhomogeneous recurrence is z/(q p) 1, and the right side is p(z + 1)/(q p) + q(z 1)/(q p) = (pz + p + qz q)/(q p) = (z + p q)/(q p) = z/(q p) 1. The homogeneous recurrence is the same as the recurrence from the previous problem (using a different notation). Therefore the general solution has the form C 1 + C 2 exp(z log(q/p)), hence a general solution to the inhomogeneous recurrence has the form z/(q p) + C 1 + C 2 exp(z log(q/p)). The boundary conditions are M a = M b = 0, which leads to the solution (b a) exp(z log(q/p)) (b a) exp(a log(q/p)) M z = (z a)/(q p) (q p) exp(b log(q/p)) (q p) exp(a log(q/p)). 4

This can be re-written as M z = (W z (b z) + (1 W z )(a z))/(p q). Moment Generating functions (MGF s) are another way to derive this type of result. The moment generating function of a random variable Z is defined to be the expectation M(θ) = E exp(θz). This is either equal to exp(θz)π(z)dz or Z exp(θz)π(z) depending on whether Z is continuous or discrete. A key property of the MGF is that (under fairly general conditions) d k M(θ)/dθ k = EZ k exp(θz), hence the k th derivative evaluated at θ = 0 is EZ k, the k th moment of Z. Two special cases are when k = 0, giving M(0) = 1, and k = 1, giving M (0) = EZ. Another important property of the MGF is that if X and Y are independent, and M X (θ) and M Y (θ) are their MGF s, then if Z = X + Y, M Z (θ) = M X (θ)m Y (θ). In particular, if X 1, X 2,..., are iid and Z n = n i=1 X i, then M Zn (θ) = M X (θ) n. Suppose that Z is discrete, EZ 0, and there exist a < 0 and b > 0 such that P (Z = z) > 0, and P (Z = b) > 0. We will prove that there is a unique value θ distinct from 0 such that M(θ ) = 1. 5

The moment generating function is bounded below by P (Z = a) exp(θa) and by P (Z = b) exp(θb). Therefore as θ approaches ±, M(θ) approaches +. The second derivative of the MGF is EZ 2, which is positive, hence M(θ) is convex. Since M (0) = EZ 0, M(θ) is either strictly increasing or strictly decreasing as it passes through θ = 0. By continuity, and by the limit at ±, there must be a second solution to M(θ) = 1. By convexity, this solution is unique. Review X 1, X 2,... iid P (X i = 1) = p, P (X i = 1) = q = 1 p. The process Z n = X 0 + X 1 + + X n is a random walk, where X 0 is a constant. T a = inf n {z T z = a} is the random number of steps before Z n hits a. We derived formulas for M z E(T a X 0 = z) and W z P (T b < T a X 0 = z) using difference equations. The moment generating function (MGF) for a random variable X is E exp(θx). The MGF has three key properties: 1. If M (k) X (θ) dm X (θ) k /dθ k, then M (k) X (0) = EX k. 2. If A and B are independent, then M A+B (θ) = M A (θ)m B (θ). 3. If X is (i) discrete, (ii) has positive probability of being both positive and negative, and (iii) has nonzero mean, then the MGF M X (θ) crosses M X (θ) = 1 at exactly two distinct points. One of the points is θ = 0, the other is a point θ 0. 6

Continuation The MGF for each X i is M X (θ) = p exp(θ) + q exp( θ). Since X i satisfies conditions (i), (ii), and (iii) in 3 above, there exists θ 0 such that M X (θ ) = 1. Solving directly yields θ = log(q/p). The random variable Z n X 0 is the signed net vertical displacement of the random walk after n steps. The MGF for Z n X 0 is M Zn X 0 (θ) = (p exp(θ) + q exp( θ)) n, and M Zn X 0 (θ ) = 1. Now consider Z Ta,b X 0. This is the net vertical displacement of the random walk at the point when it has first hit a or b. Therefore it can only take on two values, b X 0 (with probability W z ) and a X 0 (with probability 1 W z ). Now consider the MGF of Z Ta,b X 0. This looks like the MGF of Z n X 0 with n = T a,b, but it is actually quite different, because T a,b is a random variable upon which the final expression for the MGF should not depend. Therefore the MGF of Z n X 0 is different from It is true, however, that (p exp(θ) + q exp( θ)) T a,b. M ZTa,b X 0 (θ) = E(p exp(θ) + q exp( θ)) T a,b. Therefore even though we do not yet have the MGF of Z Ta,b X 0, we do known that it equals 1 when evaluated at θ = θ. 7

Since Z Ta,b X 0 can take on only b X 0 (with probability W z ) or a X 0 (with probability 1 W z ), the MGF of Z Ta,b X 0 satisfies M ZTa,b X 0 (θ) = W z exp(θ(b X 0 )) + (1 W z ) exp(θ(a X 0 )). This expression must evaluate to 1 at θ = log(q/p). Therefore we can solve for W z : Wald s Identity: W z = exp(zθ ) exp(aθ ) exp(bθ ) exp(aθ ). E exp(θ(z T a,b X 0 )) M X (θ) T a,b exp(θ(z Ta,b X 0 )) = E Ta,b E M X (θ) T T a,b a,b = E Ta,b M X (θ) T a,b E ( ) exp(θ(z Ta,b X 0 )) T a,b = E Ta,b M X (θ) T a,b M X (θ) T a,b = 1. Differentiate Wald s identity with respect to θ to get that the following is equal to 0: EM X (θ) T a,b(z Ta,b X 0 ) exp(θ(z Ta,b X 0 )) T a,b M X (θ) T a,b 1 M X(θ) exp(θ(z Ta,b X 0 )). Evaluating at θ = 0 yields: EZ Ta,b X 0 T a,b EX i = 0. 8

From which we conclude We know that EZ Ta,b X 0 = ET a,b EX i EZ Ta,b X 0 = W z (b X 0 ) + (1 W z )(a X 0 ), and that EX i = p q. Therefore M z = ET z,b = W z(b X 0 ) + (1 W z )(a X 0 ). p q Connection to a simple sequence alignment model: Suppose that we we are aligning two random sequences against each other without gaps, and the scoring model gives +1 for a match and 1 for a mismatch. Thus the score at position n is a random walk where p is the null probability of a match (which will ordinarily be less than 1/2). Suppose we choose as our alignment the highest scoring alignment that is ever reached before the first time that we reach a negative score. We would like to know the null distribution of this alignment score. In the notation used earlier, z = 0 since we always start at a score of 0, and a = 1, since we always stop when the score becomes negative. The probability that the maximal score is at least b is P (T b > T 1 ), which is W 0 as computed above. Specifically, it is W 0 = 1 exp( θ ) exp(bθ ) exp( θ ). 9

Approximating further, exp(bθ ) typically dominates exp( θ ), so we get P (S max b) C exp( θ b), which is a geometric probability. We want to generalize beyond a simple random walk. Suppose that each X i has support contained in c, c + 1,..., d, where c and d are positive integers, and let p c,..., p d denote the probability with which X i takes on each of these values. The MGF of X i is m X (θ) = j p j exp(jθ). We assume that p c and p d are positive, that EX i = j jp j < 0, and there is no integer that divides the set of step sizes j that have p j 0. Under these conditions, there exists θ > 0 such that m X (θ ) = 1. We now generalize T a,b for a < b as follows: T a,b = inf{n Z n b or Z n a}. The MGF of Z Ta,b X 0 also crosses 1 at θ : m ZTa,b X 0 (θ ) = 1. From now on we take X 0 = 0. We observe that Z Ta,b must terminate at one of b, b+1,..., b+d or one of c, c+1,..., 1. Thus we can write m ZTa,b (θ) = 1 k= c P k exp(kθ) + b+d 10 k=b P k exp(kθ),

where P k is the probability that Z Ta,b = k. Based on Wald s identity, we have ET a,b = EZ Ta,b /EX. We fix a = 1, and let b to get an asymptotic estimate for A = ET a,b. Since EZ Ta,b = k kp k, as b, this converges to 1 k= c kr k, where R k is the limit as b of the probability that Z Ta,b = k (note that for positive k the R k vanish in the limit). Thus we get A = 1 k= c kr k d k= c kp k as an approximation to ET a,b. Ultimately this can be used in inference for sequence alignment, since if an optimal local alignment has length A, it is unlikely to reflect a true biological relationship. At this point, we can not evaluate A, because we do not yet know how to compute the R k. Now consider the case where the stopping occurs at T L,1, where L > 0. The MGF of Z T L,1 becomes L k= L c+1 Q k (L) exp(kθ ) + d 11 k=1 Q k (L) exp(kθ ).

It is a fact that the limits Q k = lim L Q k (L) exist. The sum 1 Q = Q k is less that 1. It is the probability that the random walk is never positive. Furthermore the terms Q k (L) exp(kθ ) vanish as L for k < 0 (since we know that θ > 0). Therefore we get d k=1 Q k exp(kθ ) = 1. Let F (y) be the probability that the walk never exceeds y. We consider this event as follows: F (y) = P (no positive value ever reached) + d k=0 P (k is first positive value reached) P (y never reached k is first positive value reached). F (y) = Q + d k=0 Q k F (y k). Next we apply the renewal theorem, which states that if b i, f i, and u i (i 0) are non-negative and satisfy 1. B = b i <, 2. f i = 1, 3. µ = if i <, 4. The GCD of??? is 1, 5. u y b y = y k=0 f k u y k, then u y B/µ as y. 12

Let V (y) = (1 F (y)) exp(yθ ), so we get 1 V (y) exp( yθ ) = Q+ y k=0 which when y < d can be rewritten Q k (1 V (y k) exp((k y)θ )), V (y) = exp(yθ )(Q y+1 + + Q d ) + y and when y d can be written k=0 Q k exp(kθ )V (y k), V (y) = d k=0 Q k exp(kθ )V (y k). Now if we let f k = Q k exp(kθ ) and b y = exp(yθ )(Q y+1 + + Q d ) if y d and b y = 0 if y d, then the renewal theorem can be applied. We can compute µ directly, as it is a finite sum. To make use of the renewal theorem, we must compute B = b k. It is easy to verify that (exp(θ ) 1)B = k Q k exp(kθ ) Q. Thus B = Q/(exp(θ ) 1), and the renewal theorem states that V (y) Q (exp(θ ) 1)( k kq k exp(kθ )), and using the relationship F (y) = 1 V (y) exp( yθ ) we are done. 13

Review: Let F (y) denote the probability that a random walk starting at zero never exceeds y. We found that and that V (y) F (y) = 1 V (y) exp( yθ ), Q (exp(θ ) 1) d k=1 kq k exp(kθ ) V, where Q k is the probability that the first positive value of the walk occcurs at height k, and Q = d k=1 Q k. Thus we write or equivalently F (y) V exp( yθ ), F (y) exp(yθ ) V. Now we are interested in the case where the walk stops when it first reaches a negative value. We would like to obtain an approximation to the probability G(y) that such a walk never exceeds y. Let F (y) = 1 F (y) and G (y) = 1 G(y) denote the probabilities that the two walks exceed y. Let R j denote the probability that j is the first negative value reached by the walk that stop when it reaches a negative score (so j must be one of 1, 2,..., c). Thus we can write F (y) = G (y) + 1 14 j= c R j F (y j).

or exp(yθ )F (y) = exp(yθ )G (y) + exp(yθ ) 1 which leads to V exp(yθ )G (y) + V 1 j= c Using V as was computed earlier, we have j= c R j exp(jθ ). R j F (y j), where C = G (y) exp( θ (y + 1))C Q(1 1 j= c R j exp(jθ )) (1 exp( θ )) d k=1 kq k exp(kθ ). and we conclude that the probability of the walk exceeding y before it becomes negative is approximated by C exp( yθ ). 15

Extreme Values of iid Sequences Let X 1,..., X n denote an iid sequence of random variables. P (max{x 1,..., X n } > t) = 1 P (all X i t) = 1 P (X 1 t) n. Two examples where this can be used to give a simple expression for probabilities involving extreme values are: 1. Uniform distribution on (a, b): P (max{x 1,..., X n } > t) = 1 t a b a 2. Exponential distribution with mean λ: n. P (max{x 1,..., X n } > t) = 1 (1 exp( t/λ)) n. Now suppose that the X i have a density π( ). Let F (t) = t π(u)du denote the cumulative distribution function, and note that P (max{x 1,..., X n } t) = F (t) n and therefore the density of max{x 1,..., X n } is df (t) n /dt = nf (t) n 1 F (t) = nf (t) n 1 π(t). Similarly, the minimum satisfies P (min{x 1,..., X n } t) = (1 F (t)) n, 16

and hence the density of the minimum is n(1 F (t)) n 1 π(t). For the Exponential distribution with mean λ, the density of the maximum is n(1 exp( t/λ)) n 1 exp( t/λ)/λ, and the density of the minimum is n exp( t/λ) n 1 exp( t/λ)/λ = n exp( nt/λ)/λ, which is itself an exponential distribution with mean λ/n. The exponential distribution has an important property called the memory-less property. Suppose X 1, X 2,..., X n are exponential, and let X (1) X (n) be the same set of values in increasing order. Let I 1 = X 1, and for k > 1 let I k = X (k) X (k 1). The memory-less property is that the I k are iid. Furthermore, I k is the distribution of the smallest of n k + 1 exponential random variables, so I k is exponential with parameter λ k = λ/(n k +1). Therefore EI 1 = λ/n, EI 2 = λ/(n 1), and so on. Since X max = k I k, we have EX max = λ(1 + 1/2 + 1/3 + + 1/n). VarX max = λ 2 (1 + 1/4 + + 1/n 2 ). 17

Since k=1 1/n 2 = π 2 /6, we can approximate the variance of X max with λ 2 π 2 /6. It is a fact that log n n j=1 1/n has a finite limit called Euler s constant, denoted γ. Thus the mean of X max can be approximated as λ(γ + log n). Note the unusual nature of X max in that the mean becomes infinite while the variance stays bounded. The extreme value theory for the geometric distribution is very similar to that of the exponential distribution. For n geometric random variables with mass function (1 p)p k on k = 0, 1,..., we have P (Y max t) = (1 p) t = exp( t log(1 p)). If we set λ = 1/ log(p) then the results for the exponential distribution are approximately true for the geometric distribution. Returning to the basic fact for the exponential distribution P (max(y 1,..., Y n ) t) = (1 exp( t/λ)) n, changing variables to t t + λ log n yields P (max(y 1,..., Y n ) t + λ log n) = (1 exp( t/λ log n)) n = (1 exp( t/λ)/n) n exp( exp( t/λ)). Thus we can write 18

P (max(y 1,..., Y n ) t) exp( exp( t/λ + log n/λ)). Next we will get a slightly better approximation. Let F (t) = exp( exp( t)). This is clearly a cumulative distribution function on t (0, ) (it is positive, non-decreasing, and has left and right limits at 0 and 1 respectively). The corresponding density is f(t) = exp( t exp( t)). This is called the double exponential distribution. It is a fact that for a large class of random variables (including all geometric-like distributions), the maximum of n iid realizations comes from a distribution that converges to F as n. Specifically, let µ n = λ(log n + γ), and σ 2 n = λ 2 π 2 /6. Then the distribution function for Y max for n iid geometric-like random variables is P (Y max t) = exp( exp( π(t µ n )/σ n 6)). To get a p-value we would use P (Y max t) = 1 exp( exp( π(t µ n )/σ n 6)). 19

Application to Inference for Ungapped Alignment Scores Suppose we have an ungapped alignment of two random sequences, where the following conditions hold: 1. The GCD of the scores is 1. 2. At any given position, there is a positive probability of recieving both a negative and a positive score. 3. The expected score is negative. That is x,y p(x)p(y)s(x, y) < 0. We know condition 3 is automatically satisfied if s(x, y) = log p(x, y)/p(x)p(y) is defined as a log-likelihood ratio. Let S i = s(x i, Y i ) be the score for position i alone. The global alignment score up to position n is Z n = S 1 + + S n. Z n is a random walk that goes to with probability 1. The local alignment score for positions m + 1,..., n is Z n Z m. Define a ladder point to be a new local minimum in the walk. That is, Z k is a ladder point if Z k < min(z 1,..., Z k 1 ). Let L 1, L 2,... be the ladder points, and let U k = max{y Lk Y Lk, Y Lk +1 Y Lk,..., Y Lk+1 Y Lk }. The U k are independent, and each U k has the distribution of the maximum of value of a random walk that starts at 0 and is absorbed at any negative state (which was worked out above). 20

Because of the negative drift, the optimal local alignment Y max is approximated by the max of the U k. Let A be the expected number of steps before a negative score is reached (as above). If N is the length of the two sequences being compared, the number of ladder points is approximated by n = N/A. Substituting this into the approximate extreme value distribution for geometric-like random variables yields: P (U max u) exp( K exp( (u + log N)θ )), where K = C exp(θ )/A. 21