arxiv: v5 [cs.it] 28 Feb 2015

Size: px

Start display at page:

Download "arxiv: v5 [cs.it] 28 Feb 2015"

Virgil Harmon
5 years ago
Views:

1 Sampling with arbitrary precision Luc Devroye, Claude Gravel October, 28 arxiv: v5 [cs.it] 28 Feb 25 Abstract We study the problem of the generation of a continuous random variable when a source of independent fair coins is available. We first motivate the choice of a natural criterion for measuring accuracy, the Wasserstein L metric, and then show a universal lower bound for the expected number of required fair coins as a function of the accuracy. In the case of an absolutely continuous random variable with finite differential entropy, several algorithms are presented that match the lower bound up to a constant, which can be eliminated by generating random variables in batches. Keywords: random number generation, random bit model, differential entropy, partition entropy, inversion, probability integral transform, tree-based algorithms, random sampling Introduction Knuth and Yao [] showed that the expected number of independent Bernoulli /2 random bits needed to generate an integer-valued random variable X whose distribution is given by p i P{X i}, where i p i, is at least equal to the binary entropy of X: E EX E{p i } in i def i p i log 2 p i. They also exhibited an algorithm dubbed the DDG tree algorithm for which the expected number of random Bernoulli /2 bits is not more than E + 2. By grouping, one can thus develop algorithms for generating batches of m independent copies of X such that the expected number of Bernoulli /2 random bits per random variable does not exceed E + 2/m. While these results settle the discrete random variate case quite satisfactorily, the generation of continuous or mixed random variables has not been treated satisfactorily in the literature. The objective of this note is to study the number of Bernoulli /2 random bits to generate a random variable X R d with a given precision >, provided that we can define precision in a satisfactory manner. Note that any algorithm that takes as input the accuracy parameter >, returns a random variable Y f T B,..., B T, where B,B 2,...,B T are independent identically distributed or i.i.d. Bernoulli /2 bits, T is the number of bits needed, and f,f 2,... are given sequences of functions. For a vector v R d, McGill University Université de Montréal

2 let v p denote the l p -norm of v for p : v p d i v i p p. For p, the -norm is v sup i d v i. With d, all p-norms are the same for p [, ]. An algorithm with accuracy is such that for some coupling of X the target random variable and Y, X Y p, where p is usually 2 or. This natural notion of accuracy corresponds to the Wasserstein L metric between two probability measures. For a random variable X, we denote by LX the distribution of X. Let M denote the space of all distributions of pairs of random variables X, Y R d R d with fixed marginal distributions F and G, respectively. Then the Wasserstein L distance between X and Y, or between F and G, is W p F, G inf { ess sup X Y p : X, Y M}, where ess sup denotes the essential supremum. This is a distance metric: If distx, Y def W p F, G. distx, Y then there exists a random variable Y output coupled with X target such that ess sup X Y p, i.e., with probability one, X Y p <. This definition of distance should satisfy simulation professionals in the sense that if their calculations require the evaluation of ΨX,..., X d, where X,...,X d are given independent random variables and Ψ is a real-valued functions, then, with probability one, ΨY,..., Y d ΨX,..., X d sup Ψy,..., y d ΨX,..., X d y X p< which is usually a quantity that can easily be controlled. We believe that software packages should have the capability of accepting as input parameter in random variate generation. It is interesting that ET, the expected value of T, can be related to the entropy almost in the way Knuth and Yao did for discrete random variables in []. Our note provides the foundational background for such a study in terms of universal lower bounds and various useful upper bounds for particular algorithms. We include examples for the main distributions. Several authors have adressed the problem of arbitrary precision in sampling algorithms. These include Flajolet and Saheb [2], who explain how to generate the first k bits of an exponential random variable for an integer k. Karney [3] describes an algorithm for the standard normal distribution. Lumbroso s thesis [4] also discusses arbitrary precision sampling. The quantity that appears in our lower bound is the partition entropy. More precisely, let A be a partition of R d and let > be a fixed parameter for the precision. The partition entropy of X with respect to A is the quantity E A X P{X A} log 2. P{X A} A A 2

3 While our results apply to all distributions, we will mainly focus on absolutely continuous distributions, i.e., random variables X with density f. We recall the definition of differential entropy: Ef def fx log 2 dx. x R d fx The differential entropy can be ill-defined,, finite or +. For more information on differential entropy and entropy in general, one can read Cover and Thomas [5]. When X has compact support, then the case + cannot occur. When f is bounded, then the case is excluded. When E X,..., X d <, it can be shown that Ef is well-defined and either finite or ; see Rényi [6], Csiszàr [7] for a proof. Our main result is the following: Theorem. Let X R d be a target random vector with density f, and assume that E X,..., X d <. Consider any algorithm that for given > outputs Y Y using T random fair coins, such that W p X, Y. Then ET Ef + d log 2 2Γ p log + d 2 Γ d p + For p and p, the third term in the lower bound is log 2 2 d /d! and d, respectively. For d, it is. The second part of this note describes several algorithms that come within a constant term of this lower bound, and therefore, are basically optimal if grouping is used for generation. Before tackling all that, we introduce a brief section in which we recall the main exact sampling algorithms for discrete random variables, and their properties, as these will be essential for the understanding of the main algorithms. 2 Bounds for the discrete case In this section, we give simple proofs of two important results for generating discrete random variables the optimal algorithm of Knuth and Yao [] and the more practical but slightly suboptimal algorithm of Han and Hoshi [8]. We recall that we want to sample X with probability vector p, p 2,.... Every p i is decomposed into its binary representation p i j b i,j 2 j, where b i,j {, } for all i, j. Consider the new random variable Z with probability vector b, 2, b,2 2 2,..., b 2, 2, b 2,2 2 2,..., def p,, p,2,..., p 2,, p 2,2,...,, 2 with p i,j b i,j /2 j. Any algorithm for generating a discrete random variable using a source of i.i.d. fair coins B, B 2,... and that is based on a stopping time T when it returns an output can be viewed as a binary tree, where B, B 2,... uniquely determines an infinite path in the tree by the rule is left and is right. We refer to this general class of algorithm as tree-based algorithms they include all possible practical algorithms. Leaves in this tree correspond to outputs. The algorithm. 3

4 of Knuth and Yao can be implemented by a binary tree, a DDG tree, in which each leaf at level j corresponds to a bit b ij in 2. One randomly walks down this tree starting at the root and reaches the leaf for b ij with probability /2 j p ij. At that point, the value i is returned, and indeed, P{X i} p ij p i, as required. A tree-based algorithm is optimal if it minimizes ET for a given probability vector. Theorem Knuth and Yao []. The expected number of bits of an optimum tree-based algorithm for sampling p, p 2,... is bounded from below by E {p i } i and from above by E {pi } i + 2. Proof of Theorem. Given a probability vector p, p 2,..., p n with n possibly infinite, let the binary expansion of p i for i {,..., n} be b i,j p i 2 j and b i,j {, }. j If T denotes the number of bits required by an optimal algorithm for sampling p, p 2,..., p n, then for t Therefore, j number of leaves at level t P{T t} n i ET b i,t 2 t. 2 t tp{t t} t t t n i n i t b i,t 2 t tb i,t 2 t. 3 We now show that the quantity within parentheses of line 3 is lower bounded by p i log 2 /p i and upper bounded by p i log 2 /p i + 2p i and then the result follows. For convenience, let x [, ] and its binary expansion be x j x 2 j. To complete the proof, it remains to prove that x log 2 x j j jx j 2 j x log 2 + 2x. 4 x If m is the first non-zero coefficient of the binary expansion of x, then there are two cases: either x 2 m or 2 2 m < x < 2 m. The inequalities are strict for case 2 since x 2 m. For the first case, x m, and line 4 is obviously true. For the second case, 2 m < x < 2 2 m or, < m + log 2x <. 4

5 Then for the upper bound, j jx j 2 j < m 2 m + jm+ m + 2 m < xm + j 2 j < x2 log 2 x, and for the lower bound, j jx j 2 j jm m mx jx j 2 j jm x j 2 j > x log 2 x. We now recall the Han-Hoshi algorithm published in [8] that implements the inversion method. Given p,..., p n with n countably finite or infinite, the algorithm partitions the interval [, ] into a countable collection of disjoint subintervals [Q i, Q i where Q, and i Q i p k, i {,..., n}. k The idea behind the algorithm is to iteratively refine a random interval I [, and to stop when I [Q i, Q i for a certain i {,..., n}. The inversion principle says that if U uniformly distributed on [, ], then the unique i {,..., n} such that Q i U < Q i+ is distributed according to p,..., p n. For a binary random source of unbiased i.i.d. bits, their algorithm is as follow: Algorithm The Han-Hoshi algorithm using a binary source : T 2: α T 3: β T 4: repeat 5: T T + 6: B Random Bit 7: α T α T + β T α T B/2 8: β T α T + β T α T B + /2 9: I [α T, β T : until I [Q i, Q i : Return i. 5

6 The following two figures and 2 are examples that show the underlying DDG tree during the execution of the Han and Hoshi algorithm Figure : Illustration of the algorithm of Han and Hoshi on the vector p, p 2, p 3, p 4, p 5, p 6, p 7 6, 5 32, 5 32, 9 32, 3 6, 32, 8. The cumulative values are q , q , q , q , q , q , and q p p + p 2 p + p 2 + p 3 Figure 2: Illustration of the Han and Hoshi algorithm on the vector p, p 2, p 3, p 4 such that p., p + p 2., and p + p 2 + p 3.. Let T be the number of random coins needed and also the number of iteration of the repeat loop. For T, [ α T, β T [ αt +, β T +. To every node internal or external corresponds an interval [ α T, β T. The root corresponding to the interval [,. For each internal node corresponds an interval [ α T, β T that is not contained in one of the interval [ Qi, Q i+, and, if the source produces B, then the left child corresponds to the interval [ α T, α T + β T /2 [ α T +, β T + and, if B, then the right child corresponds to [ α T + β T /2, β T [ αt +, β T +. Each leaf external node corresponds to an interval [ α T, β T entirely contained in [ Qi, Q i+ upon which the integer i is returned with probability Q i. The following theorem was proved by Han and Hoshi [8]: 6

7 Theorem 2. For the Han-Hoshi algorithm, p i log 2 ET p i i p i log p i Proof of Theorem 2. Our new proof partitions the leaves L i for symbol i in the DDG tree arbitrarily into two sets, A i and B i, such that A i and B i each possess at least one leaf per level. Let α i u A i pu, β i u B i pu, where pu is the probability attached to leaf u, i.e. /2 depthu. We have p i α i + β i. By nesting and elementary calculations, p i log 2 α i log p 2 + β i log i α 2 i β i i i i i p i log p i i Let α i j be the j-th bit in the binary expansion of α i, and let β i j be the j-th bit for β i. Then ET As in the proof of Theorem, we have so that, using 5, j jα i j 2 j + α i log 2 I α i i β i log 2 II β i i p i log 2 ET p i i j jβ i j 2 j def I + II. α i log 2 + 2α i, α i i i β i log 2 + 2β i, β i i p i log p i p i log p i i i p i 3 Lower bound for generating continuous random vectors In this section, we give a lower bound for the complexity of sampling any continuous distribution to an arbitrary precision. Let A be a countable partition of R d, and let > be a fixed precision parameter. Consider the infinite graph G with as vertices the sets A A, and as edges all pairs A, B A A such that inf x A,y A x y p <. Therefore, if A, B is not an edge of G, then x y p for all x A, y B. Let be the maximal degree of any vertex of G. We now state a lemma that we shall use in conjunction with the Knuth-Yao result, mentioned and reproved in the previous section, in order to prove our main theorem mentioned in the introduction. 7

8 Lemma. Let X be a target random vector of R d. Let Y be an output with the property that, with probability one, X Y p <. Let T be the number of bits used to generate Y by an algorithm. Then { ET sup EA X log 2 + }, A where A and are as above. We can maximize the bound from Lemma, of course, by selecting the most advantageous partition A and combination. The bound from Lemma coincides with the bound in Shannon [9] when the distribution X is discrete with a finite number of atoms since, in that case, by choosing sufficiently small. Proof of Lemma. Let X and Y be two dependent random variables of R d, and denote by p AB P{X A, Y B}. Note that p AB if A, B is not an edge of G. Thus E A X E A Y P{Y B} P{X A, Y B} log 2 P{X A} A,B A A by Jensen s inequality P{X A, Y B} log 2 P{Y B} P{X A} A,B A A P{X A, Y B} log 2 P{Y B} P{X A} B A A A + log 2 B A P{Y B} log 2 +. If T is the random number of bits needed to generate a discrete random variable Y that outputs a vertex A of G with probability P{Y A}, then and therefore ET E A Y by Knuth-Yao, E A X log 2 +, { ET sup EA X log 2 + }. A 6 It is interesting to recall a general result from Csiszàr [7] about the hypercube partition entropy of an absolutely continuous random vector X that will become useful later. Of particular interest to us is the cubic partition A h partitioned by h >. The cells of this partition are of the form d [ ij h, i j + h, i,..., i d Z d. j 8

9 We recall that if X def X,..., X d has finite entropy a condition we refer to as Rényi s condition then if X has a density f, Ef f log 2 f is well-defined, i.e., it is either finite or. We have Lemma 2. Under Rényi s condition, for general partition A, and random variable X R d with density f, E A X Ef + P{X A} log 2, λa A A where λ denotes the Lebesgue measure. In particular, E A h X Ef + d log 2 h Proof of Lemma 2. Fix A A. If Z is uniform on A and Y fz, then P{X A} f λaey. Thus P{X A} λa A λa log 2 P{X A}. EY log 2, EY and, by Jensen s inequality and the concavity of x log 2 /x, EY log 2 E Y log EY 2 Y f log λa 2. A f The inequality follows by summing over A A. Lemma 3 Csiszàr [7]. Let X R d have density f, and let Rényi s condition be satisfied. If Ef >, then as h, E A h X Ef + d log 2 + o. h If Ef, then as h, E A h X d log 2. h Remark. The fifth theorem of Csiszàr [7] stipulates that if E X,..., X d < and f is not absolutely continuous, then, as h, E A h X d log 2. h For more information about the asymptotic theory for the entropy of partitioned distributions as the partitions become finer, one can consult Rényi [6], Csiszàr [7], Csiszàr [], and Linder and Zeger []. 9

10 Theorem 3. Let X R d have density f. Let Y be an output with the property that with probability one, X Y p <. Then, under Rényi s condition, ET Ef + d log 2 log 2 V d,p, where V d,p 2d Γ p + Γ d p + is the volume of the unit ball in R d, and T is the number of random bits needed to generate to Y. Proof of Theorem 3. Let A h be a cubic partition. Then ET sup h E A h X log 2 h + where h is the maximal degree in the graph on A h A h defined by connecting A A h with B A h if inf x A,y B x y p <. We set h /n and use ET lim sup E A n /n X log 2 /n +. Observe that if B r denotes the l p -ball of radius r centered at, then by elementary geometric considerations, λb h d h λ B+2hd /p h d so that as n, Also, /n n d λb V d,pn d. E A /n X Ef + d log 2 n so that E A /n X log 2 /n + n d + log 2 + V d,p n d + o Ef + d log2 log 2 V d,p. Ef + d log 2 n 4 Upper bound for partition-based algorithms Consider a random variable X R d. We call a partition A a -partition if for every set A A, there exists x A A called the center such that sup x A y p. y A

11 Then any algorithm that selects A A with probability pa def P{X A} A f can be used to generate a random variable Y that approximate X to within. After generating A, set Y x A. Then, necessarily, there is a coupling X, Y with X Y p. If the selection of A is done with the help of the method of Knuth and Yao, then, if T still denotes the number of random bits required, ET E A X + 2. For p, we can take A A 2, the cubic partition with sides 2. For d, a simple partition into intervals of length 2 can be used for all values of p. If X has a density f and p or d, then the procedure suggested above has, as, ET E A 2 X + 2 Ef + d log o, 2 Ef + d log d + o, where in the last step we assume Rényi s condition and Ef >. Compare with the lower bound ET Ef + d log 2 d, and note that the difference is 2 + o. For later reference, we recall these values of Ef for the main distributions: Uniform[, ]: Ef, Exponential: Normal, : Ef log 2 e, Ef log 2 2πe. Recall that for X R, a >, a scale factor a shows up as log 2 a in the upper and lower bounds because EaX EX + log 2 a. For general p [,, we can take A A, 2/d p the cubic partition with sides 2/d p. Under Rényi s condition and Ef >, we have ET E A 2/d p d log The difference with the lower bound is D 2 + d p log 2d + d log 2 Γ p + + Ef + 2 d + d p log 2d + o. d log 2 Γ p + + o.

12 Using Γ + u u/e u 2πu, u >, we obtain D 2 + d log 2 Γ p + ep p 2 log 2 2π d + o, p which unfortunately increases linearly with d. To avoid this growing differential which we did not have for p it seems necessary to consider partitions that better approximate l p -balls. 5 Upper bound for inversion-based algorithms The inversion method for generating a random variable X with distribution function F uses the property that X F U has distribution function F, where F denotes the inverse, and U is uniform [, ]. One can use this method as a basis for generating an approximation using only a few random bits. In particular, if U j U.U U 2 2 j, and U, U 2,... are independent Bernoulli/2 random variables, then setting U t.u U t, j then Note that U, U.. Graphically, we have U + t.u U t + 2 t.u U t, U t U U + t. Fx Ut + 2 t U Ut F Ut Xt F U X + t F Ut + 2 t x The number of random coins is Figure 3: Inversion method illustrated T min{t : F U + t F U t 2}. 2

13 If we define Y F U + t then X and Y are coupled in such a way that + F U t 2 X Y. The T defined above is also the number of bits needed to generate Y. Observe that the inversion method requires F in a black box, also called an oracle. On the other hand, it avoids the cumbersome calculation of the cell probabilities and the set-up of the Knuth-Yao DDG tree, and thus shines by its simplicity. In spirit, the inversion method mimics the method of Han and Hoshi, and indeed, this observation leads to a simple bound. Let A 2 be a cubic partition of R into intervals of equal length 2. Denote the probabilities of these intervals by pa P{X A}, A A 2. Assume that we select an interval from A 2 following this law by the method of Han and Hoshi using random bits U, U 2, U 3,... also used in the inversion method. It is easy to see that the number of bits needed before halting in the inversion method is smaller. Therefore, for the inversion method, we have ET E A 2 X + 3. From this and Lemma 3, we conclude Theorem 4. If X has a density f satisfying Rényi s condition and if f log 2 /f >, then as, ET log 2 + f log o. f Remark 2. Comparing Theorem 4 with the universal lower bound ET log 2 + Ef, we see that the difference is 3 + o. Remark 3. The partition-based method has un upper for ET that is one less. Moreover, the simplicity of the inversion method cannot be underestimated. In addition, one can tighten the analysis under additional conditions on f such unimodality, monotonicity, or for specific forms., Theorem 5. Assume that X has a bounded nonincreasing density f on [,. inversion method, as, ET log 2 + f log 2 + o. f Then for the Proof of Theorem 5. Define X t F U t, X + t F U t + as in Figure 3. Then ET P { X t + X t > 2 } t 3

14 t t { P fx t + < } 2 t 2 because X + 2 t t f fx t + X+ t X t X t { P fx < } { 2 t + P fx t + 2 < } 2 t 2 < fx I + II. t Now, { } I E + log 2 2fX log 2 + f log 2 f even if the latter integral is infinite. The Theorem follows if we can show that II o. To this end, note that { II P fx t + < } 2 t 2 fx t. t For a fixed value of t, we see that fx t + < 2 t 2 fx t only if X falls in the interval that captures the value 2 t 2, if such an interval exists. But the probability of each interval is precisely /2 t. However, if 2 t 2 > f, then no such interval exists. Thus, II {t 2 t } log 2 2f t 4f o. For the uniform [, ] density, we have Ef, and so the bound of Theorem 5 is ET log 2 + o. However the o can be omitted in this case as the following simple calculation shows: ET P{T > t} t t t t 2 t i 2 t i {F i+ 2 t F i 2 t >2} 2 t { i+ 2 t i 2 t >2} 2 t {t log2 2 } 4

15 log 2 + log 2 2 log Theorem 5 improves over the bound for the partition method for d by + o. Under other regularity conditions, one can hope to obtain similar bounds that beat the partition bound. For the exponential density, the inversion method yields ET log 2 + Ef + o log 2 + log 2 e + o, where log 2 e Flajolet and Saheb [2] proposed a method for the exponential law that has E{T } log ϕ, where ϕ.2 as. For the normal law, Karney [3] proposes a method that addresses the variable approximation issue but does not offer explicit bounds. Inversion would yield E{T } log 2 + log 2 2πe + o, but the drawback is that this requires the presence of an oracle for F, the inverse gaussian distribution function. Even the partition method requires a nontrivial oracle, namely F. To sidestep this, one can use a slightly more expensive method based on the Box-Müller [2], which states that the pair of random variables 2EV, 2EV 2 with E exponential and V, V 2 uniform on the unit circle, provides a standard gaussian in R 2 of zero mean and unit covariance matrix. The random variable 2E is Maxwell, i.e., it has density re r2 /2, r >, and its differential entropy is Ef Maxwell f Maxwell r log 2 dr log 2 log , f Maxwell r f Maxwell r log r γ log2 + + r2 2 where γ is the Euler-Mascheroni constant. We sketch the procedure, which also serves as an example for more complicated random variate generation problems. Assume that the two normals are required with -accuracy each this corresponds to the choice of d 2 and p. Then we first generate a Maxwell random variable M by inversion, noting that F r e r2 2, dr 5

16 F u 2 log u. The Maxwell random variable M is needed with 2-accuracy. The Maxwell law is unimodal with mode at r. Its left piece has probability e. So we first pick a piece randomly using on average no more than two bits. The we apply inversion on the appropriate piece. By Theorem 5, we use T random bits where 2 ET log 2 + E f Maxwell o. The generated approximation is called M. Next we generate a uniform random variable U [, 2π with accuracy /2 M + /2. The generated value U [, 2π has U U /2 M +/2. Since U has differential entropy log 22π, we see that the expected number of bits, T 2, needed is bounded by M + /2 ET 2 E log 2 /2 M + /2 E log 2 log log 2 2π + o + log 2 2π + o /2 + E log 2 M + log 2 2π + o by the dominated convergence theorem. Then we return M sinu, M cosu, and claim that jointly, To see this, note that and similarly for the cosine. Next, M sinu M sinu M cosu M cosu. sinu sinu U U /2 M + /2 M sinu M sinu M M sinu + M sinu sinu M M + M U U 2 + M /2 M + /

17 Putting everything together, we see that the total expected number of bits is not more than 2 ET + ET 2 2 log 2 + E log 2 M + Ef Maxwell log 2 2π + o 2 log log 2 2πe + o 2 log o. The lower bound for generating two independent gaussians is 4 + o less, i.e., 2 log log 2 2πe 2 2 log Batch generation 6. Randomness extraction Turning a sequence of i.i.d. random variables X, X 2,... into a sequence of i.i.d. Bernoulli /2 bits has been the subject of many papers. The setting of interest to us is the following. Let F, F 2,... be a possibly infinite number of cumulative distributions supported on the positive integers. Let p, p 2,... be a fixed probability vector. Let X, X, X 2,... be i.i.d. random integers drawn from p, p 2,.... Given X, X 2,..., X n for n N, draw Y, Y 2,..., Y n independently from the distributions F X, F X2,..., F Xn. As a special case, we have the classical setting when p and then Y, Y 2,..., Y n are i.i.d. according to F. Let F, F 2,... have binary entropies given by E, E 2,..., all assumed to be finite. In other words the entropy of Y X i is denoted by E i. Assume also that E def p i E i <. i Theorem 6. There exits an algorithm described below that, upon input X, Y, X 2, Y 2,..., X n, Y n outputs a sequence of R n i.i.d. Bernoulli /2 bits where R n n p E as n. Furthermore, these bits are independent of X,..., X n. Theorem 6 describes how many perfect random bits we can extract from Y, Y 2,..., Y n, i.e., R n should be near the information content, the entropy of Y, Y 2,... Y n. Not surprisingly then, the way to achieve this can be inspired by the optimal or near-optimal methods of compression, and, in particular, arithmetic coding. Note that one can assume Y i F X i V i for i {,..., n} where V, V 2,..., V n are i.i.d. uniform random variables on [, ]. For all i N, let the binary expansions of p i be p i j b ij 2 j where b ij {, }. 7

18 def For convenience, p ij b ij for all i, j N N. Also, for all i N, let 2 j { if j, F i j j p i k p ik if j. Following the methodology of arithmetic coding, associate a uniform[, ] random variable U with binary expansion.u U 2 with the infinite data sequence X, Y,..., X n, Y n. The bits U i, i, are i.i.d. Bernoulli/2 and independent of X, X 2, X 3,... To be more precise, consider this algorithm. Algorithm 2 Randomness extraction Input: A sequence of pairs X, Y,..., X n, Y n with X l and Y l as described previously for l {,..., n}. : U 2: U + 3: for l to n do 4: U l U l + U + l U l FXl Y l 5: U + l U l + U + l U l FXl Y l 6: end for 7: R n max { t : 2 t U n 2 t U + n } {R n is the number of bits of the longest prefix common to both U n and U + n.} 8: return 2 Rn U n To verify the correctness of Algorithm 2, the intervals [U l, U + l ] are nested. More precisely, for all l, [U l, U + l ] [U l+, U + l+ ]. Define U lim sup n U n For every iteration, we have lim inf n U + n, and note U [U n, U + n ] U U j U + j U j Since R n max { t : 2 t U n 2 t U + n }, The bits U, U 2,..., U Rn n [U l, U + l ]. l L Uniform[U j, U + j ]. 2 Rn 2Rn U Un U U n + 2 Rn U +. 2 Rn are clearly i.i.d. Proof of Theorem 6. Let t R and consider the two cases {R n t} and {R n < t}. We show that t is concentrated around ne. Before considering the two cases, we compute the useful quantity px E log 2 log p 2 p i + log 2 p ij X Y p i j ij p i log 2 p i + p ij log 2 p ij i i j 8

19 EX + p i EX + def E. i j p i E i + EX i p ij pi log p 2 + log i p 2 ij pi Note that { Rn t } {U + n U n < 2 t } { n l l p Xl Y l p Xl < 2 t } { n pxl log 2 p Xl Y l } > t. 7 The pairs X, Y,..., X n, Y n are i.i.d. and therefore, by the previous calculation, px E log 2 E. p X Y By the law of large numbers, for all >, P{R n > ne + } as n. We have { Rn < t } { } 2 t U n + > 2 t Un. By the law of large numbers, for all >, if t ne, P {U n + Un } { n pxl 2 t P log 2 p Xl Y l For an arbitrary fixed integer k >, { P{R n < t + k} P U + n U n 2 t+k o k, li } t. } { + P U n + Un < } 2 t+k, 2t U n + > 2 t Un { which is as small as desired by the choice of k. The 2/2 k term is due to the fact that the event U + n Un <, 2 t U + 2 t+k n > 2 t Un } occurs only if U m 2 and m N. t 2 t+k 6.2 Batch generation algorithm based on a general DDG tree algorithm Assume given a random variable X N with fixed probability vector p, p 2,... of finite entropy denoted by EX. Assume that we employ a given DDG tree based algorithm for the generation of X. In this tree, let L be the set of leaves and let labelu be the label of leaf u L. Define Let du be the depth of u L. Then we have L i {u L : labelu i} for i N. P{X i} def p i u L i 2 du. 9

20 If the algorithm returns a variable X, then we know that we must have exited via a leaf in L X. Given X i, each exit-leaf has a given probability: P { Exit via leaf u L i } /2 du p i. Let us call the random exit-leaf Y. The DDG tree algorithm thus returns a pair X, Y. We have p i EY X i i p i i u L i p i 2 du log 2 pi 2 du 2 du log 2 2 du + p i log 2 p i i u L i i 2 du log 2 2 du EX u L EY EX. For example, for the Knuth-Yao DDG algorithm, we have EY EX 2, while for the Han- Hoshi algorithm, we have EY EX 3. Our method of batch generation will be valid for any DDG tree with finite EY. The algorithm for batch generating i.i.d. random variables X, X 2,..., X n uses an atomic operation FetchBit that first gets a bit from a queue Q if the this queue is not empty, and otherwise it gets a bit from a Bernoulli /2 generator. It is understood that that FetchBit drives the DDG tree algorithm. Algorithm 3 Batch generation : Q {Initially, the queue is empty.} 2: R {There is no recycled bit initially.} 3: for i to n do 4: Generate X i, Y i by a DDG tree algorithm. {The DDG algorithm uses the operation FetchBit to get bits either from the source or from the queue Q.} 5: return X i 6: Feed X i, Y i to the retrieval algorithm randomness extraction procedure, and recover R i R i bits which are added to Q. 7: end for Theorem 7. The batch generation algorithm uses N n random bits, where whenever EY <. N n n p EX as n, Remark 4. By the Knuth-Yao lower bound, EN n nex, and therefore, the procedure is asymptotically optimal to within o p n bits. The symbol o p n means it is on in probability as n. 2

21 Proof of Theorem 7. We choose a large integer constant k and look at N nk. Let Q t be the size of the queue at time t, and set Q. For j {,..., nk}, let T j be the number of bits needed for generating X j without extraction. The T j are i.i.d. random variables. Then we have the following simple identity: nk N nk T j R nk + Q nk. By the law of large numbers, j T + T T nk ET + T T nk p as n. We note that ET + T T nk nkey because random variables T j are i.i.d. By Theorem 6, we also have that R nk /nk p EY EX as n. Therefore, N nk nk EY + o p EY EX + o p + Q nk nk EX + o p + Q nk nk. The result follows if Q nk /nk p as n. For this, we need only consider an upper bound, since Q nk, and then Q nj Q nj + R nj R nj min j k {T nj T nj, Q nj }. Since R n /n p EY EX, we have R max nj R nj j k n EY EX p 8 and T max nj T nj j k n EY p. 9 Fix >, and let A be the event that both lefthand sides in 8 and 9 are less than, so that P{A c } o. The critical observation is that on A, Q nj + EY EX + n EY n Q nj if Q nj EY n, Q nj + EY EX + n Q nj else. and therefore, max { Q nj, EY EX + }, if 2 EX, max Q nj EY EX + n j k and Q nk nk EY EX +. k 2

22 If we choose k large enough such that EY EX + /k, then { } Qnk P nk > P{A c } o. If batch generation is applied to the partition method for continuous distributions, and the Knuth-Yao or Han-Hoshi method are used for the discrete part of that method, then the expected number of random bits needed per random variable, under Rényi s condition and Ef >, is bounded from above by d log 2 + Ef d + o, thus matching the lower bound for the case p, and all dimensions. 7 Conclusion and outlook Using the notion of maximal coupling between a generated random variable and a theoretical target random variable, we were able to lay the groundwork for a theory of generating with universal lower bound in terms of the number of random Bernoulli/2 bits needed. In order to grace the world s software libraries with variable accuracy generators, much more work is needed, both algorithmic and theoretic. We have shown that the well-known inversion method is nearly optimal in an information-theoretic sense, and will submit further work and other paradigms in the near future. Appendix: Generation of an exponential by convolution The method given in this appendix is based on taking the sum of independent random variables. Using convolution does not require an oracle for F or F which were required by the algorithms given in this paper. Its range of applicability seems however restricted to the uniform and the exponential as suggested by Kakutani s result [3] that can be stated as follows: Theorem 8 Kakutani 948. For all i N, let p i [, ] and let X i be independent Bernoulli random variables such that P{X i } p i. If X i X i2 i, then X is singular X is absolutely continuous X is discrete i i p i 2 2 diverges, p i 2 2 converges, 2 + pi >. 2 First of all, as shown in [4], if X is an exponential mean random variable, then X is distributed as a geometric random variable with parameter /e, and {X}, the fractional part of X, is distributed as a truncated exponential random variable on the interval [,, and X and {X} are independent. We concentrate on the fractional part therefore. The following theorem tells us that the fractional part is the convolution of independent Bernoulli random variables. i 22

23 Theorem 9. Let X,..., X j,... be a sequence of independent Bernoulli distributed random variables with Let X j X j. If P{X j 2 j } p j [, ], and P{X j } p j for all j N. p j p j e /2j then X is a truncated exponential random variable, i.e., the p.d.f. of X is Proof of Theorem 9. Since the Fourier transform of X j is f X x e x for x [, ]. e p j e /2j e /2j +, Ee ıx jt p j e ıt/2j + p j Since X is the sum of the independent X j s, we have Ee ıxt which is the Fourier transform of f X x. e +ıt/2j +. e /2j + Ee ıxjt j j e +ıt/2j + e /2j + e +ıt e + ıt, We can thus generate {X} with precision if we set k log 2, and let Y k X j j k Bernoullip j. j Since a raw Bernoulli random variable requires 2 bits on average, this simple method, which has accuracy guarantee Y {X}, uses an average not more than 2k 2 log 2 bits. On the other hand, the lower bound for {X} is ET log 2 + E{X}, where E{X} e e log 2e. 23

24 The factor 2 in 2 log 2 can be avoided when batch generation is used. It can also be eliminated at a tremendous storage cost if the vector 2 X,..., 2 k X k is generated using the algorithm of Knuth-Yao since we know the individual probabilities, i.e., P { 2 X,..., 2 k X k 2 x,..., 2 k x k } k j for all x,..., x k {, } k. One can show that E 2 X,..., 2 k X k p x j2 j j p j x j2 j, k EX j j k j log 2 e e Thus, for the Knuth-Yao method for this vector, ET k EX j + 2 j p j log 2 + pj p j log2 p j + k + log22 k. e log log 2 + o. e Again using Knuth-Yao, a geometric/e random variable can be generated exactly using not more than e e log 2 + log e 2 e + 2 random bits. An exponential random variable generated by this method has an overall expected complexity ET log o as. Acknowledgment Both authors thank Tamas Linder for his help. Claude Gravel wants to thank Gilles Brassard from Université de Montréal for his financial support. References [] D. E. Knuth and A. C.-C. Yao, The complexity of nonuniform random number generation, in Algorithms and Complexity: New Directions and Recent Results., J. F. Traub, Ed. New York: Carnegie-Mellon University, Computer Science Department, Academic Press, 976, pp , reprinted in Knuth s Selected Papers on Analysis of Algorithms CSLI, 2. 24

25 [2] P. Flajolet and N. Saheb, The complexity of generating an exponentially distributed variate, Journal of Algorithms, vol. 7, pp , 986. [3] C. F. F. Karney, Sampling exactly from the normal distribution, 23, arxiv. [Online]. Available: [4] J. Lumbroso, Probabilistic algorithms for data sreaming and random generation, Ph.D. dissertation, Université Pierre et Marie Curie - Paris 6, 22. [5] T. M. Cover and J. A. Thomas, Elements of Information Theory. New-York: Wiley, 99. [6] A. Rényi, On the dimension and entropy of probability distributions, Acta Mathematica Academiae Scientiarum Hungarica, vol., pp , 959. [7] I. Csiszár, Some remarks on the dimension and entropy of random variables, Acta Mathematica Academiae Scientiarum Hungarica, vol. 2, pp , 96. [8] T. S. Han and M. Hoshi, Interval algorithm for random number generation, IEEE Transactions on Information Theory, vol. 43, no. 2, pp , March 997. [9] C. E. Shannon, A mathematical theory of communication, Bell. Sys. Tech. Journal, vol. 27, pp , , 948. [] I. Csiszár, On the dimension and entropy of order α of the mixture of probability distributions, Acta Mathematica Academiae Scientiarum Hungarica, vol. 3, pp , 962. [] T. Linder and K. Zeger, Asymptotic entropy-constrained performance of tessellating and universal randomized lattice quantization, IEEE Transactions of Information Theory, vol. 4, no. 2, 994. [2] G. E. Box and M. E. Muller, A note on the generation of random normal deviates, Ann. Math. Stat, vol. 29, pp. 6 6, 958. [3] S. Kakutani, On equivalence of infinite product measures, Annals of Mathematics, pp , 948. [4] L. Devroye, Non-Uniform Random Variate Generation. Springer,

Large Deviations Performance of Knuth-Yao algorithm for Random Number Generation

Large Deviations Performance of Knuth-Yao algorithm for Random Number Generation Akisato KIMURA akisato@ss.titech.ac.jp Tomohiko UYEMATSU uematsu@ss.titech.ac.jp April 2, 999 No. AK-TR-999-02 Abstract