EXTREME VALUE DISTRIBUTIONS FOR RANDOM COUPON COLLECTOR AND BIRTHDAY PROBLEMS

EXTREME VALUE DISTRIBUTIONS FOR RANDOM COUPON COLLECTOR AND BIRTHDAY PROBLEMS Lars Holst Royal Institute of Technology September 6, 2 Abstract Take n independent copies of a strictly positive random variable X and divide each copy with the sum of the copies, thus obtaining n random probabilities summing to one. These probabilities are used in independent multinomial trials with n outcomes. Let N n N n be the number of trials needed until each some outcome has occurred at least c times. By embedding the sampling procedure in a Poisson point process the distributions of N n and N n can be expressed using extremes of independent identically distributed random variables. Using this, asymptotic distributions as n are obtained from classical extreme value theory. The limits are determined by the behaviour of the Laplace transform of X close to the origin or at infinity. Some examples are studied in detail. Keywords: Poisson embedding; point process; Polya urn; inverse gaussian; lognormal; gamma distribution; repeat time ams 1991 subject classification: primary 6g7 secondary 6c5 1 Introduction Consider a random experiment with n outcomes having probabilities p 1,..., p n. Independent trials are performed until each outcome has occurred at least c times. Let N n be the number of trials needed and let N n < N n be the number of trials when some unspecified outcome has occurred c times. Dept. Mathematics, KTH, SE 144, Stockholm, Sweden. E-mail: lholst@math.kth.se

To find the distribution of N n for c = 1 and p 1 = = p n = 1 n is usually called the coupon collector s problem. The approach by embedding in Poisson point processes given in Section 2 below gives the relation N n Z i = n maxy 1,..., Y n, i=1 where the random variables N n, Z 1, Z 2,... are independent, the Z s being Exp1 density e z for z > and the Y s are independent and Exp1. This implies EN n = nemaxy 1,..., Y n = n and the limit distribution n lim P N n/n log n x = e e x, n 1 j n log n, n, see Section 4.1 below. To find the distribution of N n for c = 2 and equal p s is the birthday problem. In this case the embedding approach gives N n Z i = n miny 1,..., Y n, i=1 where Nn, Z 1, Z 2,... are independent, the Z s Exp1, and the Y s independent and Γ2, 1 we denote by Γc, 1 a gamma distribution with density y c 1 e y /c 1! for y >. We have EN n = neminy 1,..., Y n = n and the limit distribution 1 + y n e ny dy πn/2, n, lim n P N n/ 2n x = 1 e x2, x >, see Section 5.1 below. Combinatorial approaches to the coupon collector s problem or the birthday problem have a long history and can be found in many texts, see e.g. Feller 1968 or Blom, Holst and Sandell 1994. In Holst 1995 it is proved that the distribution functions of N n can be partially ordered in the p s by Schur-convexity and that N n is stochastically largest in the symmetric case. A slight modification of the argument shows partial ordering in the collector problem and that N n is stochastically smallest in the symmetric case. 2

Papanicolaou, Kokolakis and Boneh 1998 studied a random coupon collector problem for the case c = 1 by letting the p s be random and given by X 1 X n,...,, X 1 + + X n X 1 + + X n where X 1, X 2,... are independent identically distributed positive random variables. Applications of the model were given and asymptotic results for EN n as n were derived. Note that X 1 = = X n = 1 gives the classical case. In our paper asymptotic results are obtained for the random coupon collector problem both for the distribution and the mean of N n for c 1, generalizing those of Papanicolaou et al 1998 for the mean. We prove our results by embedding in Poisson point processes. A similar approach is used in Holst 1995 to study birthday problems. By this device distributional problems on N n are transformed so that classical extreme value theory for independent identically distributed random variables can be applied, c.f. Resnick 1987. In a similar way we study Nn for random p s given as above. Other recent papers using Poisson embedding on problems of a similar flavour as ours are Steinsaltz 1999 and Camarri and Pitman 2. In the following X 1, X 2,... denote independent copies of a strictly positive random variable X with mean µ = EX <. We will see that the limit behaviour of N n as n is determined by the behaviour of the distribution function F X x = P X x for small x, or equivalently by the behaviour of the Laplace transform g X s = Ee sx for large s. The limit behaviour of Nn is determined by the behaviour of g X s for small s. The organization of the paper is as follows. In Section 2 the embedding of N n in a Poisson point process is constructed. Using this an expression for EN n is derived. In Section 3 extreme value distributions of Fréchet type exp y α occur as limiting distributions of N n and an example with the gamma distribution is analyzed. In Section 4 extremes of Gumbel type exp e y are considered; examples discussed involve X having one-point, inverse gaussian and lognormal distributions. In Section 5 birthday problems are studied and limit distributions of Weibull type 1 exp y α are obtained. 3

2 Embedding and EN n Let Π be a Poisson point process with intensity one in the first quadrant of the plane. Independent of Π let X 1, X 2,... be independent identically distributed strictly positive random variables with finite mean µ. Introduce random strips and set i 1 I it = Π {x, s : X j < x i X j, < s t}, for i = 1, 2,... and t >. Here I it is the set of points of Π in the i:th strip up to time t. Let I it denote the number of these points. As Π is a Poisson process with intensity one we have min{t : I it = c} = Y i /X i, where the independent random variables Y 1, Y 2,... are Γc, 1 and independent of X 1, X 2,.... The first time the first n strips all contain at least c points can be written M n = maxy 1 /X 1,..., Y n /X n. Given X 1,..., X n, the projection on the s-axis of the points in these strips is a Poisson process with intensity n X j. The total number of points in the n strips up to time M n can be identified with N n, because the probability that a point occurs in the i:th strip is X i / n X j and the points are independent of each other. Thus with independent Z 1, Z 2,... all being Exp1 and independent of N n, we have the basic relation: N n n Z i = M n X j. i=1 Using this different quantities of the distribution of N n can be expressed in the random variables X 1, X 2,... and Y 1, Y 2,.... Theorem 2.1 With notation as above: and for c = 1 c 1 EN n /n = µem n 1 + j= c j EX j j! nm j n 1 e X nm n 1, EN n /n = µem n 1 + o1, n, EN n < E1/X <, EN n = nµem n 1 + 1 = nµ [1 1 g X s n 1 ]ds + 1. 4

Proof. The embedding implies EN n = EM n n X j. Thus symmetry and independence give [ EN n = nem n X n = ne X n 1 P Mn 1 sp Y n /X n s X n ] ds = ne X n +ne X n = nµem n 1 + n As Y n is Γc, 1 and independent of X n we have Thus c 1 = l= l= l= [ 1 P Mn 1 s ] ds P M n 1 sp Y n /X n > s X n ds P M n 1 s EX n P Y n > X n s X n ds. P Y n > X n s X n = c 1 l= Xns l l e Xns. l! P M n 1 sex n P Y n > X n s X n ds c 1 = E P M n 1 sx n X l ns l l! e Xns ds P X n M n 1 s sl e s c 1 ds = P X n M n 1 V l, l! where V l is Γl + 1, 1. Hence c 1 c 1 l P X n M n 1 V l = E = c 1 j= l= j= l= X j nm j n 1 j! c j EX j j! nm j n 1 e XnM n 1. e X nm n 1 Combining the results above proves the first assertion. As M n a.s. as n the second assertion follows. It is readily seen that the third assertion holds for any c if it holds for c = 1. 5

Let c = 1. Then the Y s are Exp1 and we have P Y/X > s = EP Y > sx X = Ee sx = g X s. Thus P M n 1 s = 1 g X s n 1, and therefore Ee X nm n 1 = Eg X M n 1 = proving the last formula in the assertion. Furthermore, EM n = [1 1 g X s n ]ds = g X sn 11 g X s n 2 g Xsds = 1 n, Therefore EM n < if and only if g X sds = Ee sx ds = E n 1 g X s 1 g X s k ds. k= 1 e sx ds = E <. X Proving the third assertion for c = 1 and therefore for all positive integers c. The distribution function F c s = P Y/X s is important for studying N n. The following result will be useful later on. Proposition 2.1 Let X and Y be independent positive random variables, X with distribution function F X and Laplace transform g X, and Y being Γc, 1. Then for s > : g k X s = 1k EX k e sx, c 1 1 F c s = P Y/X > s = 1 k sk k= k! gk X s = F cs = 1 c c 1! gc X s = c s F cs F c+1 s = 1 s s c 1 F X x/sx c xc 1 e x c 1! dx, F X x/s xc 1 e x c 1! dx, F c s = 1 [ s 2 1 c+1 s c+1 c 1! gc+1 X s 1 c s c ] c 2! gc X s = 1 s 2 F X x/sx c 2 c xc 1 e x c 1! dx. 6

Proof. As Y is Γc, 1 we have and also P Y/X > s = EP Y > sx X = E P Y/X > s = EP X < Y/s Y = c 1 k= s k X k e sx, k! F X x/s xc 1 e x c 1! dx. By differentiation the other formulas follows by straightforward calculations. 3 Extremes of Fréchet type for N n In this section we consider X such that for some α > and for some slowly varying function L P X x = x α Lx, x. Recall that L is slowly varying at if Ltx/Lx 1 as x for every fixed t >. A special case is the gamma distribution. The limiting distributions of N n are extreme value distributions of Fréchet or Φ α type, c.f. Resnick 1987. Theorem 3.1 Let a n such that na α n L 1/a n Γα + c/c 1! 1. Then P N n /na n µ y e y α, y >, EN n /na n µ Γ1 1/α, α > 1, and EN n = +, α < 1. Proof. Using Proposition 2.1 we get as s Hence for y > P Y/X > s = s α L 1/s x/s α L x/s x c 1 c 1! e x dx x α+c 1 e x dx = s α L 1/s Γα + c/c 1!. c 1! np Y/X > a n y ny α a α n L 1/a n Γα + c/c 1! y α. Poisson convergence gives n IY j /X j > a n y P oissony α. 7

Thus for y > n P M n /a n y = P IY j /X j > a n y = e y α. From the behaviour of P Y/X > s as s we have for any integer < k < α that EY/X k <. Hence by Resnick 1987, p. 77 EM n /a n k Γ1 k/α and Theorem 2.1 gives for α > 1 EN n /na n µ = EM n 1 /a n + o1/a n Γ1 1/α, n. If α < 1 then E1/X = + implying EN n = +. Thus the second and third assertions are proved. By the embedding we have E P e tm n n X j = E Therefore for s and t = e s/nanµ 1 we get E e sn n/na n µ = E exp s es/nanµ 1 s/na n µ As e tp Nn Z i N n = E 1 + t N n. Mn a n n X j. nµ e s/na nµ 1 s/na n µ 1, P M n/a n y e y α, n X j nµ 1 in probability, it follows that e s/nanµ n 1 P s/na n µ Mn X j a n nµ y P Mn a n y e y α, n. Thus, by the continuity theorem for Laplace transforms we have for s that Ee sn n/na n µ e sy de y α, from which the first assertion of the theorem follows. 8

3.1 Example: gamma distribution Let X be Γα, 1. Then g X s = Ee sx = 1 + s α, s > 1, P X x x α /Γα + 1, x. In Theorem 3.1 we have µ = α and take a n = [nα + c 1 α + 1/c 1!] 1 α, where a n = n 1 α for c = 1. For X 1,..., X n independent and Γα, 1 the sum X 1 + + X n is Γnα, 1 and independent of X 1,..., X n /X 1 + + X n, which has the symmetric Dirichlet distribution Dα,..., α. Hence for α > 1 it follows by the embedding that [ ] 1 n 1 EN n = EM n X j = EM n E n X j = nα 1EM n n 1+ 1 α α[α + c 1 α + 1/c 1!] 1 α Γ1 1/α. The mean is infinite for α 1. Note that Y/c/X/α has an F -distribution. For the exponential case α = 1, we take a n = cn and get the limit P N n /cn 2 y e 1/y, n. The probabilities X k /X 1 + + X n for k = 1,..., n can be interpreted as the spacings in a random sample of size n 1 from a uniform distribution on the unit interval. This corresponds to a D1,..., 1 prior distribution on the drawing probabilities. Unconditionally the drawing procedure is a Polya urn scheme with n balls of different colours at start and replacing each drawn ball together with one new of the same coulour. A general Polya scheme corresponds to having some α >. 4 Extremes of Gumbel type for N n In this section we consider distributions such that P X x faster than any power as x. Extreme value distributions will be of Gumbel or Λ type, see Resnick 1987. Assume for the Laplace transform g X s = Ee sx and its derivatives that for k =, 1, 2,... and as s h k s := sexk+1 e sx EX k e sx, h k+1 s h k s = EXk+2 e sx EX k e sx EX k+1 e sx 2 1. 9

Using Proposition 2.1 this implies for Y being Γc, 1 that Thus P Y/X > s = 1 F c s s c 1 EX c 1 e sx /c 1!, F cs = s c 1 EX c e sx /c 1!, F c s s c 1 EX c+1 e sx /c 1!. 1 F c sf c s F cs 2 1, and F c s < for s sufficiently large. Then from classical extreme value theory, see Resnick 1987, Prop. 1.1 and 2.1, P M n b n /a n y e e y, EM n b n /a n γ, n, where M n = maxy 1 /X 1,..., Y n /X n and γ is Euler s constant, and with the norming constants given from 1 n = 1 F cb n, a n = 1 F c b n /F cb n. The limit behaviour of N n will now be obtained by the embedding. Theorem 4.1 Let X satisfy the conditions above and EX 2 <. Then with a n and b n as above P N n /nµ b n /a n y e e y, E N n /nµ b n /a n γ. Proof. By the embedding we have N n Z i = M n i=1 n X j. Using the estimates above it follows that b n and Thus b 2 n na 2 n = b 2 n 1 F cb n F cb n 2 1 F c b n 2 = b 2 F n cb n 2 1 F c b n F c b n F c b n b 2 nf c b n 1 c 1! Eb nx c+1 e bnx. Var b n a n n X j n = b2 n a 2 VarX, n n 1

and we get that M n n X j na n µ b n = M n n b n X j a n a n nµ + b n a n n X j nµ 1 has the same asymptotic behaviour as M n b n /a n. Furthermore by Theorem 2.1 and the estimates above Var Nn Z i 1/na n = EN n /na n 2 = µem n 1 + o1/na n 2. i=1 Hence M n n X j na n µ b Nn n i=1 = Z i a n na n µ b Nn n i=1 = Z i 1 + 1 Nn a n na n µ a n nµ b n has the same asymptotic distribution as 1 Nn nµ b n, a n that is the same as that of M n b n /a n. The convergence of the mean also follows from Resnick 1987, Prop. 2.1. 4.1 Example: constant probabilities For X µ we have Ee sx = e sµ, h k s = µs and h k+1 s/h k s = 1. Furthermore c 1 1 n = b n µ k e bnµ 1, = nµ b nµ c 1 k! a n c 1! e bnµ, implies k= b n µ = log n + c 1 log log n logc 1! + o1, a n µ = 1 + o1. Hence P N n /n log n c 1 log log n + logc 1! y e e y, EN n /n = log n + c 1 log log n logc 1! + γ + o1. For c = 1 this is the result for the classical coupon collector s problem given in the Introduction. Recall that N n is stochastically smallest among all positive distributions of X when X is constant. 11

4.2 Example: inverse gaussian distribution Let X be inverse gaussian with mean µ = 1 and variance σ 2 = 1/2ψ, that is Ee sx = e 2ψ 2ψ 1+s/ψ, 2ψ x 1 P X x = Φ x + e 4ψ Φ 2ψ x + 1 x, where Φ is the standard normal distribution function. For s we have s k EX k e sx ψs k/2 Ee sx, 1 F c s = ψsc 1/2 c 1! F cs = ψsc/2 s c 1! 1 + O 1 1 + O 1 s Ee sx, s Ee sx. Thus, the assumptions of Theorem 4.1 are satisfied. From 1 F c b n = 1/n we get 2 ψb n = log n + c 1 log log n c 1 log 2 logc 1! + 2ψ + o1, and a n = 1 F cb n F cb n log n 2ψ, b n /a n = 1 2 log n + c 1 log log n logc 1!2c 1 + 2ψ + o1. Recalling σ 2 = 1/2ψ we obtain P N n /σ 2 n log n b n /a n y e e y, EN n /n log n = σ 2 b n /a n + γ + o1. 4.3 Example: lognormal distribution Let X have a lognormal distribution. Without loss of generality let X = e σz, where Z is standard normal and µ = EX = e σ2 /2. For s > we have EX k e sx = 1 2π e kσz seσz z 2 /2 dz. With z s such that z s σ eσzs = s, 12

we find after some calculations of saddlepoint type that With y n such that that is roughly y n 2 log n, and we obtain EX k e sx 1 σzs e kσz s z s /σ z 2 s /2, s. y c 3/2 n σ c 1/2 c 1! e y2 n/2 y n /σ 1 n, b n = y n σ eσy n, 1 F c b n 1 n, a n = e σyn 1 F c b n /F cb n. This gives the limit P N n /e σ2 /2 ne σyn y n /σ y e e y, n. 4.4 Example: strictly positive support Let X d > where d = inf{x : P X x > }. Set X d = X d. Then EX k e sx e sd d k P X d x/se x dx = d k Ee sx, s. Hence h k s sd and h k+1 s/h k s 1 implying 1 F c s sdc 1 e sd Ee sx d, c 1! F cs dsdc 1 e sd Ee sx d. c 1! The assumptions of Theorem 4.1 are fullfilled and the norming constants can be determined from c 1 1 n = k= b n d k e bnd Ee b nx d, a n d = 1. k! For X d we get the example with constant probabilities. Other cases are modifications of it. For example let X d be Γα, 1, then µ = d + α, Ee sx d = 1 + s α and the norming constants can be choosen as giving the limit b n d = log n + c 1 α log log n + logd α /c 1!, a n d = 1, P d N n d + α n b nd y e e y, n. 13

5 Extremes of Weibull type for N n In this section we consider N n equals the number of trials until some unspecified outcome has occurred c 2 times. As in Section 2 we get by embedding where N n Z i = Mn i=1 n X j, M n = miny 1 /X 1,..., Y n /X n. With a proof similar to that of Theorem 2.1 we obtain: Theorem 5.1 We have EN n/n = µem n 1 j=c+1 j c EX j j! nm j n 1 e XnM n 1. In a similar way as before we get asymptotic results for N n from extreme value theory. A crucial quantity is If F c s = P Y/X < s = k=c s k k! EXk e sx, s. sf cs F c s = sc EX c e sx /c 1! k=c sk EX k e sx c, s, /k! and a n such that nf c a n 1, then for y > P M n/a n y 1 e yc, n, see Resnick 1987, Prop. 1.13, 1.16. Now small modifications of the proof of Theorem 3.1 give limits of Weibull type. Theorem 5.2 If sf cs/f c s c as s, nf c a n 1 and na n as n, then P N n/na n µ y 1 e yc, y >, and EN n/na n µ c Γ2 1/c. 14

5.1 Example: exponential moments Suppose that the Laplace transform g X s = Ee sx is finite in a neighborhood of the origin. Then s k k! EXk e sx = Os c+1. k=c+1 Hence F c s = sc c! EXc e sx + Os c+1 sc c! EXc, s, and therefore we can take a n = c!/nex c 1/c. X µ and c = 2 give the limit in the Introduction for the birthday problem P N n/ 2n y 1 e y2. Recall that N n is stochastically largest when X is constant. If X is Exp1, then µ = 1, a n = n 1/c and we get the limit P N n/n 1 1/c y 1 e yc, cf. Subsection 3.1 and the Polya urn scheme. 5.2 Example: lognormal distribution Let X = e σz where Z is standard normal. Then EX k = e k2 σ 2 /2 and we have k=c+1 s k c EX k e sx = k! k=c+1 s k c e k2 σ 2 /2 Ee sekσ2x k! Hence = k=c+1 e k2 σ 2 /2 e kk cσ2 se kσ2 k c Ee sekσ2x, s. k! which gives F c s = sc c! EXc e sx + os c sc c! EXc = sc c! ec2 σ 2 /2, s, a n c!e c2 σ 2 /2 /n 1/c. and the limit P N n/n 1 1/c c! 1/c e 1 cσ2 /2 y 1 e yc. 15

References [1] Blom, G., Holst, S. and Sandell, D. 1994. Problems and Snapshots from the World of Probability. Springer, New York. [2] Camarri, M. and Pitman, J. 2. Limit distributions and random trees derived from the birthday problem with unequal probabilities. Electronic Journal of Probability. 5, Paper no. 1, 1 19. [3] Feller, W. 1968. An Introduction to Probability Theory and Its Applications, Vol. I, 3rd edn. John Wiley, New York. [4] Holst, L. 1995. The general birthday problem. Random Structures and Algorithms. 6, 21 27. [5] Papanicolaou, V.G., Kokolakis, G.E. and Boneh, S. 1998. Asymptotics for the random coupon collector problem. Journal of Computational and Applied Mathematics. 93, 95 15. [6] Resnick, S. 1987. Extreme Values, Regular Variation, and Point Processes. Springer, New York. [7] Steinsaltz, D. 1999. Random time changes for sock-sorting and other stochastic process limit theorems. Electronic Journal of Probability. 4, Paper no. 14, 1 25. 16