MATH5835 Statistical Computing. Jochen Voss

Size: px

Start display at page:

Download "MATH5835 Statistical Computing. Jochen Voss"

Irene Griffith
5 years ago
Views:

1 MATH5835 Statistical Computing Jochen Voss September 27, 2011

2 Copyright c 2011 Jochen Voss J.Voss@leeds.ac.uk This text is work in progress and may still contain typographical and factual mistakes. Reports about problems and suggestions for improvements are most welcome.

3 Contents 1 Introduction 3 2 Random Number Generation Pseudo Random Number Generators The Linear Congruential Generator Quality of Pseudo Random Numbers Pseudo Random Number Generators in Practice The Inverse Transform Method Rejection Sampling Uniform Distribution on Sets General Densities Other Methods Exercises Monte Carlo Methods Basic Monte Carlo Integration Variance Reduction Methods Importance Sampling Antithetic Variables Control Variates Applications to Statistical Inference Exercises Resampling Methods Bootstrap Methods Jackknife Methods Exercises Markov Chain Monte Carlo Methods Markov Chains The Metropolis-Hastings Method Description of the Method Random Walk Metropolis Sampling Independence Sampler Exercises The EM Algorithm 59 1

4 A Probability Reminders 65 A.1 Events and Probability A.2 Conditional Probabilities A.3 Expectation B Programming in R 71 B.1 General advice B.2 R as a calculator B.2.1 Mathematical Operations B.2.2 Variables B.2.3 Data Types B.3 Programming Principles B.3.1 Principle 1: Don t Repeat Yourself B.3.2 Principle 2: Divide and Conquer B.3.3 Principle 3: Test your Code B.4 Random Number Generation B.5 Exercises C Answers to the Exercises 89 Bibliography 107 Index 109 2

5 Chapter 1 Introduction The use of computers in mathematics and statistics has opened up a wide range of techniques for studying otherwise intractable problems and for analysing very large data sets. Statistical computing is the branch of mathematics which concerns these techniques for situations which either directly involve randomness, or where randomness is used as part of a mathematical model, This text gives an overview over the foundations of and basic methods in statistical computing. An area of mathematics closely related to statistical computing is computational statistics. In the literature there is no clear consensus about the differentiation between these two areas; sometimes the two terms are even used interchangeably. Here we take the point of view that computational statistics concerns computational methods which involve randomness as part of the method. Such methods are what this text is mainly interested in. Typical examples of such techniques are Monte Carlo methods. In contrast, we take computational statistics to be the larger area of computational methods which can be applied in statistics, even if the methods themselves do not involve randomness. Examples of such techniques include methods to the maximum of a density function in maximum likelihood estimation. Here, we will give such methods only a cursory treatment, and refer to the literature instead (see, for example, Thisted (1988) for extensive discussion of such methods). One of the most important ideas in statistical computing is, that often properties of a stochastic model can be found experimentally, by using a computer to generate many random instances of the model, and then statistically analysing the resulting sample. The resulting methods are called Monte Carlo methods, and discussion of such methods forms the main body of this text. This text is mostly self-contained, only basic knowledge of the beginnings of statistics and probability is required. The text should be accessible to an MSc or PhD student in Mathematics, Statistics, Physics or related areas. For reference, appendix A summarises the most important concepts and results from probability. Since we study computational methods, programming skills are required to get maximal benefit out of this text. Previous programming experience would of course be useful, but is not required: appendix B gives an introduction to the aspects of programming we require here and, starting from there, we provide many exercises for the reader to train his or her skills. While the main part of the text is independent of any specific programming language, the exercises use the statistical software package R (R Development Core Team, 2011) for implementation of our algorithms and the text includes an introduction to programming in R in appendix B. 3

6 Notation For reference, the following table summarises some of the notation used throughout this text. N the natural numbers: N = {1, 2, 3,... R the real numbers (a n ) n N a sequence of (possibly random) numbers: (a n ) n N = (a 1, a 2,...) [a, b] an interval of real numbers: [a, b] = { x R a x b {a, b the set containing a and b U[0, 1] the uniform distribution on the interval [0, 1] U{ 1, 1 the uniform distribution on the two-element set { 1, 1 A The complement of a set: A = { x x / A. A B the Cartesian product of the sets A and B: A B = { (a, b) a A, b B 1 {X A Indicator function of the event that the random variable X takes a value in the set A: 1 {X A = 1 if X A and 0 else (see page 68) 1 A (x) Indicator function of the set A: 1 A (x) = 1 if x A and 0 else (see page 68) X µ S R S Indicates that a random variable X is distributed according to a probability distribution µ. The number of elements in a finite set; in section 2.3 also the volume of subsets S R d. matrices where the columns and rows are indexed by elements of S (see page 45). R S S vectors where the components are indexed by elements of S (see page 45). 4

7 Chapter 2 Random Number Generation The basis of all stochastic algorithms is formed by methods to generate random samples from a given model, e.g. to generate throws of a coin or of a die, Gaussian random numbers, random functions (e.g. in finance), or random graphs (e.g. in epidemiology). In this chapter we will discuss how to generate random samples from a given distribution on a computer. Since computer programs are inherently deterministic, some care is needed when trying to generate random numbers on a computer. Usually the task is split into two steps: a) Generate a sequence X 1, X 2, X 3,... of independent, identically distributed (i.i.d.) random variables, uniformly distributed on the interval [0, 1] or on a finite set like {1, 2,..., n. Methods for this will be discussed in section 2.1. b) Starting with the result of the first step, transform the random variables to construct samples of more complicated distributions. Different methods for such transformations are discussed later in this chapter, starting in section 2.2. This split allows the process of generating randomness (which has a set of concerns, discussed in section 2.1) to be separate from the process of transforming the randomness to have the correct distribution (which can often be done exactly, with mathematical rigour). 2.1 Pseudo Random Number Generators There are two fundamentally different classes of methods to generate random numbers: a) True random numbers are generated using some physical phenomenon which is random. Generating such numbers requires specialised hardware and can be expensive and slow. Classical examples of this include tossing a coin or throwing dice. Modern methods utilise quantum effects, thermal noise in electric circuits, the timing of radioactive decay, etc. 5

8 b) Pseudo random numbers are generated by computer programs. While these methods are normally fast and resource effective, a challenge with this approach is that computer programs are inherently deterministic and therefore cannot produce a truly random output. In this text we will only consider pseudo random number generators. Definition 2.1. A pseudo random number generator (PRNG) is an algorithm which outputs a sequence of numbers that can be used as a replacement for an i.i.d. sequence of true random numbers. In the definition, and throughout this text, the shorthand i.i.d. is used as an abbreviation for the term independent and identically distributed The Linear Congruential Generator A (simple) example of a pseudo random number generator is given in the following algorithm: Algorithm LCG (linear congruential generator) input: m > 1 (the modulus), a {1, 2,..., m 1 (the multiplier), c {0, 1,..., m 1 (the increment), X 0 {0, 1,..., m 1 (the seed). output: a sequence X 1, X 2, X 3,... of pseudo-random numbers. 1: for n = 1, 2, 3,... do 2: X n (ax n 1 + c) mod m 3: output X n 4: end for The sequence generated by the algorithm LCG consists of integers X n {0, 1, 2,..., m 1. The output depends on the parameters m, a, c and on the seed X 0. If m, a and c are carefully chosen, the resulting sequence behaves similar to a sequence of independent, uniformly distributed random variables. This is illustrated by the following example and, more extensively, by exercise 2.2. Example 2.2. For parameters m = 8, a = 5, c = 1 and seed X 0 = 0, algorithm LCG gives the following output: n 5X n X n

9 The output 1, 6, 7, 4, 5, 2, 3, 0, 1, 6,... shows no obvious pattern and could be considered to be a sample of a random sequence. # (2.1) Since each new value of X n only depends on X n 1, the generated series will repeat over and over again once it reaches a value X n which has been generated before. In example 2.2 this happens for X 8 = X 0 and we get X 9 = X 1, X 10 = X 2 and so on. Since X n can take only m different values, the output of a linear congruential generator starts repeating itself after at most m steps; the generated sequence is eventually periodic. To work around this problem, typical values of m are on the order of m = The values for a and c are then chosen such that the generator actually achieves the maximally possible period length of m. A criterion for the choice of m, a and c is given in the following theorem (Knuth, 1981, section ). Theorem 2.3. The LCG has period m if and only if the following three conditions are satisfied: a) m and c are relatively prime b) a 1 is divisible by every prime factor of m, and c) a 1 is a multiple of 4 if m is a multiple of 4. In the situation of the theorem, the period length does not depend on the seed X 0 and usually this parameter is left to be chosen by the user of the pseudo random number generator. Example 2.4. Let m = 2 32, a = and c = Since the only prime factor of m is 2 and c is odd, the values m and c are relatively prime and condition (a) of the theorem is satisfied. Similarly, condition (b) is satisfied, since a 1 is even and thus divisible by 2. Finally, since m is a multiple of 4, we have to check condition (c) but, since a 1 = = , this condition also holds. Therefore the linear congruential generator with these parameters m, a and c has period 2 32 for every seed X 0. If the period length is longer than the amount of random numbers we use, one can use the output of the linear congruential generator as a replacement for a sequence of i.i.d. random numbers which are uniformly distributed on the set {0, 1,..., m 1. The applications of random number generators that will be discussed in the remaining parts of this text mostly require a sequence of real-valued i.i.d. random variables, e.g. uniformly distributed on the interval [0, 1]. We can get a sequence (U n ) n N of pseudo random numbers to replace an i.i.d. sequence of U[0, 1] random variables by setting U n = X n m where (X n ) n N is the output of the linear congruential generator. 7

10 2.1.2 Quality of Pseudo Random Numbers Pseudo random number generators used in modern software packages like R or Matlab are more sophisticated (and more complicated) than the linear congruential generator presented here, but they still share the following characteristics: The sequence of random numbers generated by a pseudo random number generator depends on a seed. One can get different random sequences for different runs by choosing the seed based on variable quantities like the current time of the day. Conversely, one can get reproducible results by setting the seed to a known, fixed value. The property that the output is eventually periodic is shared by all pseudo random number generators implemented in software. The period length is a measure for the quality of a pseudo random number generator. Another problem of the output of the LCG algorithm is that the generated random numbers are not independent (since each value is a deterministic function of the previous value). Again, to some extent this problem is shared by all PRNGs. There are methods available to quantify the effect and there are PRNGs which suffer less from this problem, e.g. the Mersenne Twister algorithm (Matsumoto and Nishimura, 1998). Finally, since pseudo random numbers are meant to be used as a replacement for i.i.d. sequences of true random numbers, a commonly way to test pseudo random numbers is to apply statistical tests (e.g. for testing independence) to the generated sequences. Random number generators used in practice pass such tests without problems. # (2.2) Pseudo Random Number Generators in Practice This section contains advice on using pseudo random number generators in practice. First, it is almost always a bad idea to implement your own pseudo random number generator: Finding a good algorithm for pseudo random number generation is a difficult problem, and even when an algorithm is available, given the nature of the generated output, it can be a challenge to spot and remove all mistakes in the implementation. Therefore, it is normally advisable to use a well-established method for random number generation, typically the random number generator built into a well-known software package or provided by a well-established library. A second consideration concerns the rôle of the seed. While different pseudo random number generators differ greatly in implementation details, they all use a seed (like the value X 0 in algorithm LCG) to initialise the state of the random number generator. Often, when non-predictability is required, it is useful to set the seed to some volatile quantity (like the current time) to get a different sequence of random numbers for different runs of the program. At other times it can be more useful to get reproducable results, for example to aid debugging or to ensure repeatability of published results. In these cases, the seed should be set to a known, fixed value. 2.2 The Inverse Transform Method The inverse transform method is a method for generating random variables with values in the real numbers, using only U[0, 1]-distributed values as input. In this section we will 8

11 X i Xi X i Xi X i Xi X i Xi+1 Figure 2.1: Scatter plots to illustrate the correlation between consecutive outputs X i and X i+1 of different pseudo random number generators. The random number generators used are the runif function in R (top left), the LCG with m = 81, a = 1 and c = 8 (top right), the LCG with m = 1024, a = 401, c = 101 (bottom left) and finally the LCG with parameters m = 2 32, a = , c = (bottom right). Clearly the output in the second and third example does not behave like and i.i.d. sequence of random variables. See exercise 2.2 for details. 9

12 use the concepts of probability densities and of (cumulative) distribution functions. See appendix A for a quick summary of these concepts. Theorem 2.5. Let F be a distribution function. Define the inverse of F by F 1 (u) = inf { x R F (x) u u (0, 1) and let U U[0, 1]. Define X = F 1 (U). Then X has distribution function F. proof. Using the definitions of X and F 1 we find P (X a) = P ( F 1 (U) a ) = P ( min{ x F (x) U a ). Since min{ x F (x) U a holds if and only if F (a) U, we can conclude P (X a) = P ( F (a) U ) = F (a) where the final equality comes from the definition of the uniform distribution on [0, 1]. (q.e.d.) Once F 1 is determined, this method is very simple to apply and the resulting algorithm is trivial. We state the formal algorithm only for completeness: Algorithm INV (inverse transform method) input: the inverse F 1 of a CDF F. randomness used: U U[0, 1]. output: X F. 1: output X = F 1 (U) Example 2.6. Let X have density f(x) = { 3x 2, 0 else. for x [0, 1], and Then F (a) = a 0, if a < 0, f(x) dx = a 3, if 0 a < 1, and 1 for 1 a. Since F maps (0, 1) into (0, 1) injectively, F 1 is given by the usual inverse function and thus F 1 (u) = u 1/3 for all u (0, 1). Thus, by theorem 2.5, if U U[0, 1], the cubic root U 1/3 has the same distribution as X has. Example 2.7. Let X be discrete with P (X = 0) = 0.6 and P (X = 1) = 0.4. Then 0, if a < 0, F (a) = 0.6, if 0 a < 1, and 1 if 1 a. 10

13 F (x) u v w F 1 (w) a F 1 (v) F 1 (u) x Figure 2.2: Illustration of the inverse F 1 of a CDF F. At level u the function F is continuous and injective; here F 1 coincides with the usual inverse of a function. The value v falls in the middle of a jump of F and thus has no preimage; F 1 (v) is the preimage of the right-hand limit of F and F ( F 1 (v) ) v. At level w the function F is not injective, several points map to w; the preimage F 1 (w) is the left-most of these points and we have, for example, F 1( F (a) ) a. Using the definition of F 1 we find F 1 (u) = { 0, if 0 < u 0.6, and 1 else. By theorem 2.5 we can construct an random variable X with the correct distribution from U U[0, 1], by setting { 0, if U 0.6, and X = 1 if U > 0.6. The inverse transform method can always be applied when F 1 can be computed explicitly. For some distributions like the normal distribution this is not possible, and the inverse transform method cannot be applied directly. The method can be applied (but may not be very useful) for discrete distributions like in example Rejection Sampling Rejection sampling is a method to generate samples of a given target distribution using samples from some related auxiliary distribution. Different from the previous method, rejection sampling may use more than one sample from the auxiliary distribution to generate one sample of the target distribution. The method is not restricted to the generation of random numbers, but works for sampling on arbitrary probability spaces. We formulate the method here for distributions on the Euclidean space R d. 11

14 2.3.1 Uniform Distribution on Sets In this section we consider a special case of rejection sampling, namely a rejection algorithm to sample from the uniform distribution on a set. This algorithm, while being useful in itself, will be used as the basis for the general rejection sampling method in the following section. When A R d is a set, we write A for the d-dimensional volume of A: the cube Q = [a, b] 3 R 3 has Q = (b a) 3, the unit circle C = { x R 2 x x2 2 1 has 2- dimensional volume (area) 2π and the line segment [a, b] R has 1-dimensional volume (length) b a. Definition 2.8. A random variable X with values in R d is uniformly distributed on a set A R d with A <, if P (X B) = A B A B R d. (2.3) As for real intervals, we use the notation X U(A) to indicate that X is uniformly distributed on A. In general, the set B in (2.3) may not be a subset of A. For the case B A we have A B = B and thus (2.3) simplifies to P (X B) = B / A. Similarly, if B A, we have P (X / B) = P (X R d \ B) = A \ B A = A B A = 1 B A. One of the basic ideas of rejection sampling is described in the following lemma, which turns the uniform distribution on a set into the uniform distribution of a sub-set. Lemma 2.9. Let B A R d with A < and let (X k ) k N be a sequence of independent random variables, uniformly distributed on A. Furthermore let N = min{ k N X k B. Then X N U(B). proof. The lemma can be proved by verifying that condition (2.3) is satisfied. For C R d we have P (X N C) = P (X k C, N = k) = = k=1 P (X k B C) P (X k 1 / B) P (X 1 / B) k=1 k=1 A (B C) ( 1 B A A ) k 1, where we used the independence of the X k. Using the geometric series formula k=0 qk = 1/(1 q) we get P (X N C) = B C A 1 1 ( 1 B ) = A B C B for all C R d. This is the required condition for X N to be uniformly distributed on B. (q.e.d.) 12

15 The result of lemma 2.9 is easily turned into an algorithm for sampling from U(B): we can generate the X k one by one, for each generated value check the condition X k B, and stop once the condition is true. In the context of the rejection sampling algorithm, the random variables X k are called proposals and we say that the proposals X 1,... X N 1 are rejected and X N is accepted General Densities In this section we present the general rejection sampling method. The following lemma, which makes a connection between distributions with general densities and uniform distributions on sets, forms the basis of this method. Lemma Let f : R d [0, ) be a probability density and let A = { (x, v) R d [0, ) 0 v < f(x) R d+1. Then the following two statements are equivalent: a) X has probability density f on R d and V = f(x)u where U U[0, 1]. The random variables X and U are independent. b) (X, V ) is uniformly distributed on A. proof. The volume of the set A can be found by integrating the height f(x) over all of R d. Since f is a probability density, we get A = f(x) dx = 1. (2.4) R d Assume first that the random variables X with density f and U U[0, 1] are independent, and let V = f(x)u. Furthermore let C R d, D [0, ) and B = C D. Then we get P ( (X, V ) B ) = P ( X C, V D ) = P ( V D X = x ) f(x) dx = C P ( f(x)u D ) f(x) dx = On the other hand we have f(x) A B = 1 {(x,v) B dv dx = R d 0 = D [0, f(x)] dx C C C D [0, f(x)] f(x) dx = f(x) R d 1 {x C f(x) 0 C D [0, f(x)] dx 1 {v D dv dx and thus P ( (X, V ) B ) = A B. This shows that (X, V ) is uniformly distributed on A. For the converse statement assume now that (X, V ) is uniformly distributed on A and define U = V/f(X). Since (X, V ) A, we have f(x) > 0 almost surely and thus there is no problem in dividing by f(x). Given sets C R d and D R we find P ( X C, U D ) = P ((X, V ) { (x, v) ) x C, v/f(x) D = A { (x, u) x C, v/f(x) D f(x) = 1 {x C 1 {v/f(x) D dv dx. R d 0 13

16 M R f Y k A 0 a X k b Figure 2.3: Illustration of the rejection sampling method where the graph of the target density is contained in a rectangle R and the proposals are uniformly distributed on R. Using the substitution u = v/f(x) in the inner integral we get P ( X C, U D ) = = R d C 1 1 {x C 1 {u D f(x) du dx 0 f(x) dx 1 [0,1] (u) du. D Therefore X and U are independent with densities f and 1 [0,1], respectively. (q.e.d.) An easy application of the lemma is to use the implication from b) to a) to convert a uniform distribution in R 2 to a distribution on R with a given density f : [a, b] R. For simplicity, we assume here that f lives on a bounded interval [a, b]. Furthermore, assume that f satisfies f(x) M for all x [a, b]. We can generate samples from the distribution with density f as follows: a) Let X k U[a, b] and Y k U[0, M] for k N, independently. Then the (X k, Y k ) are i.i.d., uniformly distributed on the rectangle R = [a, b] [0, M]. b) Consider the set A = { (x, y) R y f(x) and let N = min{ k N X k B. By lemma 2.9, (X N, Y N ) is uniformly distributed on A. c) By lemma 2.10, the value X N is distributed with density f. This procedure is visualised in figure 2.3. In the general case, when f is defined on an unbounded set, we cannot use proposals which are uniformly distribution on a rectangle anymore. A solution to this problem is to replace the rectangle R with a different, unbounded area which has finite volume (so that the uniform distribution exists) and to use lemma 2.10 a second time to obtain a uniform distribution on this set. This idea is implemented in the following algorithm. 14

17 c g(x k ) cg f cg(x k )U k X k x Figure 2.4: Illustration of the rejection sampling method from algorithm REJ. The proposal ( Xk, U k cg(x k ) ) is distributed uniformly on the area under the graph of cg. The proposal is accepted if it falls into the area underneath the graph of f. Algorithm REJ (rejection sampling) input: a probability density f (the target density), a probability density g (the proposal density), a constant c > 0 such that f(x) c g(x) for all x. randomness used: X k i.i.d. with density g (the proposals), U k U[0, 1] i.i.d. output: i.i.d. random variables Y j with density f. 1: let j 1 2: for k = 1, 2, 3,... do 3: generate X k with density g 4: generate U k U[0, 1] 5: if cg(x k )U k f(x k ) then 6: output Y j = X k 7: j j + 1 8: end if 9: end for The assumption in the algorithm is that we can already sample from the distribution with probability density g, but we would like to generate samples from the distribution with a density f instead. The theorem below shows that the method works whenever we can find the required constant c. This condition implies, for example, that the support of f cannot be bigger than the support of g, i.e. we need f(x) to be 0 whenever g(x) = 0. Since f and g are both probability densities, we find 1 = f(x) dx cg(x) dx = c, i.e. the constant c always satisfies c 1 and with equality only being possible for f = g. Theorem Let (Y j ) j N be the sequence of random variables generated by algorithm REJ. Then the following statements hold: 15

18 a) The Y j are i.i.d. with density f. b) The number N j of proposals required to generate Y j is geometrically distributed with parameter p = 1/c. In particular, E(N j ) = c. proof. Since the X k have density g, we know from lemma 2.10 that (X k, g(x k )U k ) is uniformly distributed on the set { (x, v) 0 v < g(x). Consequently, Z k = (X k, cg(x k )U k ) is uniformly distributed on A = { (x, y) 0 y < cg(x). By lemma 2.9, the accepted values are then uniformly distributed on the set B = { (x, y) 0 y < f(x) A and, applying lemma 2.10 again, we find that the X k, conditional on being accepted, have density f. This completes the proof of the first statement. Proposals are accepted if and only if Z i B. Since A = cg(x) dx = c and B = f(x) dx = 1, the probability of accepting proposal i is P (Z i B) = B A = 1 c and the events are independent. Since N j is the time until the first success of independent trials which succeed with probability p = 1/c, the values N j are geometrically distributed with parameter p. This completes the proof of the second statement. (q.e.d.) The average cost of generating one sample is given by the average number of proposals required times the cost for generating each proposal. Therefore the algorithm is efficient, if the following two conditions are satisfied: a) There is an efficient method to generate the proposals X i. b) f and g are similar in the sense that the constant c is close to Other Methods There are many specialised methods to generate samples from specific distributions. There are often faster that the generic methods described in the previous chapters, but can typically only be used for a single distribution. These specialised method (optimised for speed and often quite complex) form the basis of built-in random number generators in software packages. In contrast, the methods discussed in the previous sections are general purpose methods which can be used for a wide range of distributions when no pre-existing method is available. # (2.5) Further Reading A lot of information about linear congruential generators and about testing of random number generators can be found in Knuth (1981). The Mersenne Twister, a popular modern pseudo random number generator, is described in Matsumoto and Nishimura (1998). Rejection sampling and its extensions are described in Robert and Casella (2004, section 2.3). Specialised methods for generating normally distributed random variables can be found in Box and Muller (1958) and Marsaglia and Tsang (2000). 16

19 f θ x Figure 2.5: The density of the mixture distribution 1 2 N (1, 0.04) + 1 2N (4, 1), together with a histogram generated from 4000 samples. 2.5 Exercises Exercise E-2.1. Write an R function to implement the linear congruential generator. The function should have the following form: LCG <- function(n, m, a, c, X0) {... return(x); The return value should be the vector X = (X 1, X 2,..., X n ). Test your function LCG by first manually computing X 1,..., X 6 for m = 5, a = 1, c = 3 and X 0 = 0 and then comparing the result to the output of LCG(6,5,1,3,0). Exercise E-2.2. Given a sequence X = (X 1,..., X n ) of U[0, 1]-distributed (pseudo) random numbers, we can use a scatter plot of (X i, X i+1 ) for i = 1,..., n 1 in order to try to assess whether the X i are independent. a) Try creating such a plot using the built-in random-number generator of R: X <- runif(1000) plot(x[1:999], X[2:1000], asp=1) Can you explain the resulting plot? b) Create a similar plot, using your function LCG from exercise 2.1: m <- 81 a <- 1 c <- 8 seed <- 0 X <- LCG(1000, m, a, c, seed)/m plot(x[1:999], X[2:1000], asp=1) Discuss the resulting picture. 17

20 c) Repeat the experiment from part b using the parameters m = 1024, a = 401, c = 101 and m = 2 32, a = , c = Discuss the results. Exercise E-2.3. Let X, Y U[0, 1] be independent. Using definition 2.8 of the uniform distribution on a set, show that (X, Y ) is uniformly distributed on the square Q = [0, 1] [0, 1]. Exercise E-2.4. Let f and g be two probability densities and c R with f(x) cg(x) for all x. Show that c 1 and that c = 1 is only possible for f = g (except possibly on sets with measure 0). Exercise E-2.5 (rejection sampling for the normal distribution). a) Work out the optimal constant c for the rejection sampling algorithm when the proposals are Exp(1)-distributed and the target density is given by { 2 f(x) = 2π exp ( ) x2 2 if x 0 0 else. b) Implement the resulting method. Test your program by generating a histogram of the output and by comparing the histogram to the density f. c) How can the program from part b be modified to generate standard normal distributed random numbers? Hint: Have a look at exercise 2.5. Exercise E-2.6. In this exercise we will see that the rejection sampling algorithm still can be applied when the density of the target distribution is only known up to a constant. Let f : R [0, ) be a function. Define f(x) = 1 Z f(x) x R where Z = f(x) dx. Then f is a probability density. a) Assume that we can generate proposals X 1, X 2,... with density g where f(x) cg(x) for all x R. State the rejection sampling algorithm for sampling from the density f. Write the algorithm in terms of f instead of f and show that the method can be applied even if the value of Z is unknown. b) Write a program to generate samples of the probability distribution with density f(x) = 1 Z exp(cos(x)) on the interval [0, 2π]. How can you test your program? Exercise E-2.7. Consider n probability distributions P 1,..., P n with cumulative distribution functions F 1,..., F n. Furthermore let θ 1,..., θ n > 0 such that n i=1 θ i = 1. Then the mixture of P 1,..., P n with weights θ 1,... θ n is the distribution P θ with CDF F θ (x) = n θ i F i (x) x R. i=1 a) Convince yourself that P θ is the distribution obtained by the following two-step procedure: first choose randomly one of the distributions P 1,..., P n where P i is chosen with probability θ i, and then, independently, take a sample of the chosen distribution. 18 R

21 b) Show that, if P 1,..., P n have densities f 1,..., f n, then P θ also has a density which is given by n f θ = θ i f i. c) Write an R program which generates samples from the mixture of P 1 = N (1, 0.01), P 2 = N (2, 0.04) and P 3 = N (4, 0.01) with weights θ 1 = 0.5, θ 2 = 0.2 and θ 3 = 0.3. Generate a plot, similar to the one in figure 2.5, showing a histogram of 4000 samples of this distribution, together with the density of the mixture. Hint: relevant R functions include rnorm, dnorm, sample, hist, and curve. i=1 19

22 20

23 Chapter 3 Monte Carlo Methods Monte Carlo methods are computational methods where one examines properties of a probability distribution by generating a large sample from the given distribution and then studying the statistical properties of this sample. In this chapter we consider various Monte Carlo methods to numerically approximate an expectation E ( f(x) ) where f is a real-valued function. 3.1 Basic Monte Carlo Integration Let X be a random variable and f be a real-valued function. different methods to compute the expectation E ( f(x) ) : Then there are several a) Sometimes we can find the answer analytically. For example, when the distribution of X has a density ϕ, we can use the relation E ( f(x) ) = f(x) ϕ(x) dx (3.1) to obtain the value of the expectation (see appendix A.3). This method only works if we can solve the resulting integral. b) If the integral in (3.1) cannot be solved analytically, we can try to use numerical integration to get an approximation to the value of the integral. When X takes values in a low-dimensional space, this method often works well, but for higher dimensional spaces numerical approximation can become very expensive and the resulting method may no longer be efficient. Since numerical integration is outside the topic of statistical computing, we will not follow this approach here. c) The technique we will study in this chapter is called Monte Carlo integration. This technique is based on the strong law of large numbers (see theorem A.9 in the appendix): If (X i ) i N is a sequence of i.i.d. random variables with the same distribution as X, then lim N 1 N N f(x i ) = E ( f(x) ) (3.2) i=1 almost surely. While exact equality only hold in the limit N, we can use the approximation E ( f(x) ) 1 N f(x i ) (3.3) N 21 i=1

24 when N is large. Note that the estimate of the right-hand side is constructed from the random values X i thus is a random quantity itself. The term Monte Carlo in the name of the last of these methods is in reference to the Monte Carlo casino (Roulette tables are random number generators in a sense), and the word integration refers to the equivalence (3.1) between computing expectations and computing integrals. The random variables X i in (3.3) are sometimes called i.i.d. copies of X. Since, formally, X itself is not used to compute the Monte Carlo estimate, sometimes the random variable X is not explicitly introduced and one writes E ( f(x 1 ) ) instead of E ( f(x) ). Since X 1 has the same distribution as X, these two expressions have the same value. Of course any other X i could also be used instead of X 1. For completeness we reformulate formula (3.3) as an algorithm: Algorithm MC (Monte Carlo integration) input: a function f N N. randomness used: an i.i.d. sequence (X i ) i N of random variables. output: an estimate for E ( f(x 1 ) ). 1: s 0 2: for i = 1, 2,..., N do 3: s s + f(x i ) 4: end for 5: return s/n Example 3.1. For f(x) = x, the Monte Carlo estimate (3.3) reduces to E(X) 1 N N X i = X. i=1 This is just the usual estimator for the mean. Example 3.2. Assume that for some reason we need to compute E ( sin(x) ) where X N (µ, σ 2 ). Obtaining this value analytically will be difficult, but we can easily get an approximation using Monte Carlo integration: If we choose N big and generate independent, N (µ, σ)-distributed random variables X 1, X 2,..., X N, then by the strong law of large numbers we have E ( sin(x) ) 1 N sin(x i ). N The right-hand side of this approximation can be easily evaluated using a computer program, giving an estimate for the required expectation. i=1 The basic Monte Carlo method above is presented only for the case of computing an expectation. The following arguments show how the method can be applied to compute probabilities or to determine the value of an integral. 22

25 We can compute probabilities using Monte Carlo integration, by rewriting them as expectations: If X is a random variable, we have P (X A) = E ( 1 A (X) ) (see appendix A.3). Thus, we can estimate using Monte Carlo integration, by P (X A) = E ( 1 A (X) ) 1 N where the X i are i.i.d. copies of X and N is big. N 1 A (X i ) (3.4) We can compute integrals like b a f(x) dx using Monte Carlo integration by utilising the relation (3.1): Let U i U[a, b]. Then the density of U i is ϕ(x) = 1 b a 1 [a,b]. We get b a for big N. f(x) dx = (b a) i=1 f(x)ϕ(x) dx = (b a)e ( f(u) ) b a N N f(u i ) (3.5) Example 3.3. Let X N (0, 1) and a R. Then the probability P (X a) cannot be computed analytically, but i=1 can be used as an approximation. P (X a) 1 N N i=1 1 {X a # (3.6) Example 3.4. The method from (3.5) allows to derive a different method for computing the probability P (X a) from example 3.3. Assuming a 0, we know P (X a) = a Using (3.5), we can approximate this by 1 e x2 /2 dx = 1 2π a 2π P (X a) π a N N i=1 0 e U 2 i /2 e x2 /2 dx. where U i U[0, a] are i.i.d. For the case a < 0 we can use the relation P (X a) = 1 P (X a). With the method discussed so far, an open question is still how big values of N we should choose. On the one hand, the bigger N is, the more accurate our estimates get; as N the Monte Carlo approximation converges to the correct answer. But on the other hand, the bigger N is, the more expensive the method becomes, because we need to generate and process more samples X i. Key to choosing N are the following two equations: The expectation of the Monte Carlo estimate (3.3) for E ( f(x) ) is given by ( 1 E N N i=1 ) f(x i ) = 1 N N E ( f(x i ) ) = E ( f(x) ). i=1 Therefore the estimate is unbiased for every N. 23

26 Since the X i are independent, the variance of the Monte Carlo estimate is given by ( 1 Var N N i=1 ) f(x i ) = 1 N 2 N Var ( f(x i ) ) = 1 N Var( f(x) ). (3.7) i=1 Therefore the variance, which controls the fluctuations of our estimates around the correct value, converges to 0 as N. Since the estimate is unbiased, the magnitude of a typical estimation error will be given by the standard deviation of the estimate; therefore we expect the error of a Monte Carlo estimate to decay like 1/ N. The fact that the error of Monte Carlo integration decays only like 1/ N means that in practice huge numbers of samples can be required. To increase accuracy by a factor of 10, i.e. to get one more significant digit of the result, one needs to increase the number of samples by a factor of 100. Example 3.5. Assume that U U[0, 1] and that we want to estimate E(U 2 ) using Monte Carlo integration. Then we have ( 1 Var N To compute this variance we note and E ( U 2) = N i=1 E ( (U 2 ) 2) = 1 0 ) f(u i ) = 1 N Var( U 2). 1 0 u 2 du = u3 3 u 4 du = u5 5 1 = 1 u=0 3 1 u=0 = 1 5. Therefore, Var(U 2 ) = 1/5 (1/3) 2 = 4/45 and the variance of the Monte Carlo estimate is ( 1 N ) Var f(x i ) = 4 1 N 45 N N. i=1 We can use relation (3.7) to determine which value of N is required to achieve a given precision. We can either directly prescribe the standard deviation of the estimate, or alternatively we can use estimates like the following: By Chebyshev s inequality (see theorem A.8) we have and thus ( 1 P N ( 1 P N N i=1 N i=1 f(x i ) E ( f(x) ) ) ε 1 N Var( f(x) ) ε 2 f(x i ) E ( f(x) ) ) ε 1 Var( f(x) ) Nε 2. Consequently, by choosing N Var ( f(x) ) /αε 2 we can achieve ( 1 P N N i=1 f(x i ) E ( f(x) ) ) ε 1 α. 24

27 Example 3.6. Assume Var ( f(x) ) = 1. In order to estimate E ( f(x) ) so that the error is at most ε = 0.01 with probability at least 1 α = 95%, we can use a Monte Carlo estimate with N Var( f(x) ) 1 αε 2 = 0.05 (0.01) 2 = samples. Example 3.7 (estimating probabilities I). Let X be a real-valued random variable and A R. Then we can use (3.4) to estimate P (X A). The error of this method is determined by the variance Var ( 1 A (X) ) = E ( 1 A (X) 2) E ( 1 A (X) ) 2 = P (X A) P (X A) 2. In the preceding estimates we assumed that the value Var ( f(x) ) is known, whereas in practice this is normally not the case (since we need to use Monte Carlo integration to estimate the expectation of f(x), it seems unlikely that the variance is known explicitly). There are various ways to work around this problem: Sometimes an upper bound for Var ( f(x) ) is known, which can be used in place of the true variance. One can try a two-step procedure where first Monte-Carlo integration with a fixed N (say N = 1000) is used to estimate Var ( f(x) ). Then, in a second step, one can use estimates as above to determine a value of N for use in determining the required expectation. Sequential methods can be used where one generates samples X i one-by-one and estimates the standard deviation of the generated f(x i ) from time to time. The procedure is continued until the estimated standard deviation falls below a prescribed limit. In all the error estimates above, we used the fact that f(x) has finite variance. As Var ( f(x) ) gets bigger, convergence to the correct result gets slower and the required number of samples to obtain a given error increases. Since the strong law of large numbers does not require the random variables to have finite variance, equation (3.2) still holds for Var ( f(x) ) =, but convergence in (3.3) will be extremely slow and the resulting method will not be very useful in this case. 3.2 Variance Reduction Methods As we have seen, the efficiency of Monte Carlo integration is determined by the variance of the estimate: the higher to variance, the more samples are required to obtain a given accuracy. This chapter describes methods to improve efficiency by considering modified Monte Carlo methods, which, for a given sample size N, achieve a lower variance of the Monte Carlo estimate. 25

28 3.2.1 Importance Sampling The importance sampling method is based on the following argument. Assume that X is a random variable with density ϕ, that f is a real-valued function and that ψ is another probability density with ψ(x) > 0 whenever f(x)ϕ(x) > 0. Then we have E ( f(x) ) f(x)ϕ(x) = f(x) ϕ(x) dx = ψ(x) dx, ψ(x) where we define the fraction to be 0 whenever the denominator (and thus the numerator) equals 0. Since ψ is a probability density, the integral on the right can be written as an expectation again: if Y has density ψ, we have E ( f(x) ) ( f(y )ϕ(y ) ) = E. ψ(y ) Using basic Monte Carlo integration, we can estimate this expectation as E ( f(x) ) 1 N N i=1 f(y i )ϕ(y i ) ψ(y i ) (3.8) where the Y i are i.i.d. copies of Y. This approximation can be used as an alternative to the basic Monte-Carlo approximation (3.3). We can write the resulting method as an algorithm as follows: Algorithm IMP (importance sampling) input: a function f, the density ϕ of X, an auxiliary density ψ, N N. randomness used: an i.i.d. sequence (Y i ) i N with density ψ. output: an estimate for E ( f(x) ). 1: s 0 2: for i = 1, 2,..., N do 3: s y + f(y i )ϕ(y i )/ψ(y i ) 4: end for 5: return s/n This method is a generalisation of the basic Monte Carlo method: if we choose ψ = ϕ, the two densities in (3.8) cancel, the Y i are i.i.d. copies of X and, for this case, the method is identical to basic Monte Carlo integration. The usefulness of importance sampling lies in the fact that we can choose the density ψ (and thus the Y i ) in order to maximise efficiency. As we have seen, the error of the Monte Carlo estimate is determined by the variance ( N Var i=1 f(y i )ϕ(y i ) ) ψ(y i ) = 1 ( f(y )ϕ(y ) ) N Var. ψ(y ) Therefore, the method is efficient, if both of the following criteria are satisfied: a) The Y i can be generated efficiently. 26

29 b) The variance Var ( f(y )ϕ(y )) ψ(y ) is small. To find out how ψ can be chosen to maximise efficiency, we first consider the extreme case where fϕ/ψ is constant: in this case, the variance of the Monte Carlo estimate is 0 and therefore there is no error at all! In this case we have ψ(x) = 1 f(x)ϕ(x) (3.9) c for all x where c is the constant value of fϕ/ψ. We can find the value of c by using the fact that ψ is a probability density: c = c ψ(x) dx = f(x) ϕ(x) dx = E ( f(x) ). Therefore, in order to to choose ψ as in (3.9) we have to already have solved the problem of computing the expectation E ( f(x) ) and thus we don t get a useful method for this case. Still, this boundary case offers some guidance: since we get optimal efficiency if ψ is chosen proportional to fϕ, we can expect that we will get good efficiency if we choose ψ to be approximately proportional to the function f ϕ. Example 3.8 (estimating probabilities II). Let X be a real-valued random variable and A R. Then the importance sampling method for estimating the probability P (X A) is given by P (X A) = E ( 1 A (X) ) 1 N N i=1 1 A (Y i )ϕ(y i ) ψ(y i ) where ψ is a probability density with ψ(x) > 0 for all x A with ϕ(x) > 0 and (Y i ) i N is a sequence of i.i.d. random variables with density ψ. The error of the method is determined by the variance ( 1A (Y )ϕ(y ) ) ( 1A (Y ) 2 ϕ(y ) 2 ) ( 1A (Y )ϕ(y ) ) 2 Var = E ψ(y ) ψ(y ) 2 E ψ(y ) ϕ(x) 2 = A ψ(x) ( A 2 ψ(x) dx ϕ(x) ) 2 ψ(x) ψ(x) dx ϕ(x) ) 2 = A ψ(x) ( A ϕ(x) dx ϕ(x) dx ( = E 1 A (X) ϕ(x) ) P (X A) 2. ψ(x) (3.10) In example 3.7 we found that the error for the basic Monte Carlo method is determined by Var ( 1 A (X) ) = E ( 1 A (X) ) P (X A) 2. Comparing this to (3.10), we see that the importance sampling method will have a better variance than basic Monte Carlo integration if we can choose ψ such that ψ > ϕ on the set A Antithetic Variables The antithetic variables method (also called antithetic variates method) reduces the variance and thus the error of Monte Carlo estimates by using pairwise dependent samples X i instead of independent samples. 27

30 For illustration, we first consider the case N = 2: Assume that X 1 and X 2 are identically distributed random variables, which are not independent. As for the independent case we have ( f(x1 ) + f(x 2 ) ) E = E( f(x 1 ) ) + E ( f(x 2 ) ) = E ( f(x) ), 2 2 but for the variance we get ( f(x1 ) + f(x 2 ) ) Var = Var( f(x 1 ) ) + 2 Cov ( f(x 1 ), f(x 2 ) ) + Var ( f(x 2 ) ) 2 4 = 1 2 Var( f(x) ) Cov( f(x 1 ), f(x 2 ) ). Compared to the independent case, an additional covariance-term 1 2 Cov( f(x 1 ), f(x 2 ) ) is present. The idea of the antithetic variables method is to construct X 1 and X 2 such that Cov ( f(x 1 ), f(x 2 ) ) is negative, thereby reducing the total variance. If we construct the X i using the inverse transform method, we can proceed as follows: Let F be the distribution function of X and let U U[0, 1]. Then 1 U U[0, 1] and we can use X 1 = F 1 (U) and X 2 = F 1 (1 U) to generate X 1, X 2 F. Since F 1 is monotonically increasing, X 1 increases as U increases while X 2 decreases as U increases. Therefore one would expect that X 1 and X 2 to be negatively correlated (the following lemma shows that this is indeed the case) and consequently the variance of (X 1 + X 2 )/2 will be smaller than one would expect for independent random variables. Lemma 3.9. Let g : R R be monotonically increasing and U U[0, 1]. Then Cov ( g(u), g(1 U) ) < 0. proof. Let V U[0, 1], independent of U. We distinguish two cases: If U V we have g(u) g(v ) and g(1 U) g(1 V ). Otherwise, if U > V, we have g(u) g(v ) and g(1 U) g(1 V ). Thus, in both cases we have ( g(u) g(v ) )( g(1 U) g(1 V ) ) 0 and consequently Cov ( g(u), g(1 U) ) = E ( g(u)g(1 U) ) E ( g(u) ) E ( g(1 U) ) = 1 ) (g(u)g(1 2 E U) + g(v )g(1 V ) g(u)g(1 V ) g(v )g(1 U) = 1 ( (g(u) 2 E )( ) ) g(v ) g(1 U) g(1 V ) 0. This completes the proof. (q.e.d.) If f is monotonically increasing, we can apply this lemma for g(u) = f ( F 1 (u) ). In this case we have f(x 1 ) = f ( F 1 (U) ) = g(u) and f(x 2 ) = f ( F 1 (1 U) ) = g(1 U), and the lemma allows to conclude Cov ( f(x 1 ), f(x 2 ) ) 0. 28

31 For N > 2, assuming N is even, we group the samples X i into pairs and construct every pair as above: Let U j U[0, 1] i.i.d. and for j N define X 2j = F 1 (U j ) and X 2j+1 = F 1 (1 U j ). Then X i F for all i N and we can use the approximation E ( f(x) ) 1 N N f(x i ). We can write the resulting method as an algorithm as follows. Algorithm ANT (antithetic variables) input: a function f, The inverse F 1 of the distribution function of X, N N even. randomness used: a sequence (U 1, U 2,..., U N/2 ) of i.i.d. U[0, 1] random variables. i=1 output: an estimate for E ( f(x) ). 1: s 0 2: for j = 1, 2,..., N/2 do 3: s s + f ( F 1 (U j ) ) + f ( F 1 (1 U j ) ) 4: end for 5: return s/n While lemma 3.9 only guarantees that the variance for the antithetic variables methods is smaller or equal to the variance of standard Monte Carlo integration without giving any quantitative bounds, in practice often a significant reduction of variance is observed. Example Assume that X U[0, 1] and that we want to estimate E(X 2 ) using the antithetic variables method. Since the distribution function of X is F (x) = x, the antithetic samples we get with the method above are X 2j = U 2 j and X 2j+1 = (1 U j ) 2. Thus we have ( 1 Var N N i=1 ) f(x i ) = 1 N 2 N 2 Var( U 2 + (1 U) 2) where U U[0, 1]. To compute this variance we note and E ( U 2 U ) = E ( (U 2 U) 2) = = 1 2N Var( U U + U 2) = 2 N Var( U 2 U ) (u 2 u) du = ( u 3 (u 4 2u 2 u + u 2 ) du = ( u 5 5 u4 2 + u3 ) 1 3 u=0 3 u2 2 = ) 1 u=0 = = Therefore, Var(U 2 U) = 1/30 ( 1/6) 2 = 1/180 and the variance of the antithetic variables estimate is ( 1 N ) Var f(x i ) = 2 N N Var( U 2 U ) = N N. i=1 29

Uniform Random Number Generators

Uniform Random Number Generators JHU 553.633/433: Monte Carlo Methods J. C. Spall 25 September 2017 CHAPTER 2 RANDOM NUMBER GENERATION Motivation and criteria for generators Linear generators (e.g., linear congruential generators) Multiple