A Practical Guide to Quasi-Monte Carlo Methods

Size: px

Start display at page:

Download "A Practical Guide to Quasi-Monte Carlo Methods"

John Daniels
5 years ago
Views:

1 A Practical Guide to Quasi-Monte Carlo Methods Frances Y. Kuo and Dirk Nuyens Abstract These notes are prepared for the short course on High-dimensional Integration: the Quasi-Monte Carlo Way, to be held at National Chiao Tung University and National Taiwan University in November We will cover basic theory and practical usage of quasi-monte Carlo methods, with a demo on the software packages. Our aim is to make these notes easily accessible to non-experts, including students, practitioners, and potential new collaborators. We discuss only the essential concepts and hide away most of the technical details. We do not cite references in the text, but references for further reading are provided in the final section. The sections marked with * contain more theoretical background and are targeted at potential collaborators who wish to gain a deeper understanding. These sections are not necessary for students and practitioners who just want to try out quasi-monte Carlo methods for the first time. Date: 7 November 2016 Frances Y. Kuo School of Mathematics and Statistics, University of New South Wales, Sydney NSW 2052, Australia f.kuo@unsw.edu.au Dirk Nuyens Department of Computer Science, KU Leuven, Celestijnenlaan 200A, 3001 Leuven, Belgium dirk.nuyens@cs.kuleuven.be 1

3 Contents 1 Introduction High dimensional integration Monte Carlo method Quasi-Monte Carlo methods Lattice points Generating vector Random shifting and practical error estimation Fast component-by-component construction Lattice sequences A taste of the theoretical error analysis* Digital nets Digital net property Digital construction Sobol sequences Polynomial lattice rules Random digital shifting and scrambling Higher order nets by interlacing Toy applications Transformation to the unit cube Option pricing Maximum likelihood PDE with a random coefficient Software demo A simple test function The difficulty of our test function Some technical details* Usage of random number generators Monte Carlo approximation Quasi-Monte Carlo approximation Using standard lattice point generators Applying the theory*

4 4 Contents 5.9 Constructing point sets Sobol sequences, digital sequences, and interlacing Small project Further reading... 49

5 1.2 Monte Carlo method 5 1 Introduction High dimensional problems are coming to play an ever more important role in applications. They pose immense challenges for practical computation, because of a nearly inevitable tendency for the cost of computation to increase exponentially with dimension. Effective and efficient methods that do not suffer from this curse of dimensionality are in great demand. Quasi-Monte Carlo (QMC) methods can lift this curse and we will show you how. 1.1 High dimensional integration We begin with an integral formulated over the s-dimensional unit cube [0,1] s, 1 1 I (f ) = f (x 1,...,x s )dx 1 dx s = f (x)dx, 0 0 [0,1] s where the number of integration variables the dimensionality s is large, e.g., hundreds or thousands or more. (Note that an expectation can be written as an integral. Later we will discuss the important question of how to transform an integral from practical applications into this form.) One approach that comes to mind is to approximate this integral by a product rule, i.e., each one-dimensional integral is approximated by your favorite onedimensional quadrature rule, e.g., rectangle rule, Simpson rule, Gauss rule, etc. But this would not work: with 100 integration variables, even if you have just 2 quadrature points in each coordinate direction, then you would require evaluations of the integrand f and your computation would never finish in your life time! So, forget about product rules in high dimensions! (There is a class of methods calledsparse grids which cleverly leaves out some product points; that s a story for another day.) 1.2 Monte Carlo method The Monte Carlo method, or MC method in short, approximates the integral by averaging random samples of the function Q n (f ) = 1 n n 1 k=0 f (t k ), (1)

6 6 1 Introduction where the sample points t 0,...,t n 1 are independent and uniformly distributed over the unit cube. This is a very simple and widely used method. It can be deployed as long as the integrand is square integrable. Apart from the ease of use, the Monte Carlo method has the advantage of producing an unbiased estimate of the integral, i.e., E[Q n (f )] = I(f ). It can be easily shown that the root-mean-square error of the Monte Carlo method satisfies E I (f ) Q n (f ) 2 = σ(f ) n, where σ 2 (f ):= I (f 2 ) (I (f )) 2 is the variance of f. So we say that the Monte Carlo method converges like order 1/ n, and we write O(1/ n). In concrete terms, this means that if you want to reduce your error in half, then you need to use 4 times as many sample points. This convergence rate is often too slow for practical applications. The variance of f is generally not explicitly known, but in practice we can estimate the root-mean-square error by E I (f ) Q n (f ) 2 1 n(n 1) (f (t k ) Q n (f )) 2. n 1 k=0 1.3 Quasi-Monte Carlo methods Quasi-Monte Carlo methods, orqmc methods in short, take the same form (1) as the Monte Carlo method in the unit cube, but instead of generating the sample points t k randomly, we choose them deterministically in a clever way to be more uniformly distributed than random points, so that they have a faster rate of convergence. All QMC theoretical error bounds take the common form of a product I (f ) Q n (f ) D(t 0,...,t n 1 )V (f ), (2) with one factor depending only on the points and the other depending only on the integrand. In the classical theory these two factors are called the discrepancy of the points and the variation of f, respectively. If the integrand f has sufficient smoothness, e.g., can be differentiated once with respect to each variable, then classical theory tells us that certain QMC methods can converge like O((logn) s /n); they are referred to as low-discrepancy sequences. The convergence rates can be even higher for periodic integrands. The drawback of the classical QMC theory is that the error bound and implied constant grow exponentially with dimension s, so the theory is not useful when s is very large. A remedy is provided in modern QMC theory by working with weighted function spaces: the error bound can be independent of s as long as the

7 1.3 Quasi-Monte Carlo methods 7 integrand f has the appropriate property that there is some varying degree of importance between the variables. A taste of this modern theory is given in 2.5. We then have a very similar, but modern, interpretation of (2) in the form I (f ) Q n (f ) e γ (t 0,...,t n 1 ) f γ, where the first factor is now called the worst case error of the QMC method in a weighted function space with weights γ, and the second factor is the norm of f in that same weighted space. There are two main families of QMC methods: lattice rules and digital nets. They represent different approaches to achieve uniformity of the points. Here we will introduce these methods, providing a bit more details on lattice rules while touching only some basic principles of digital nets.

8 8 2 Lattice points 2 Lattice points Lattice rules have been around since the 1950s and they are very easy to specify and use: all you need is one integer vector with s components. 2.1 Generating vector Given an integer vector z = (z 1,...,z s ) known as the generating vector, a (rank-1) lattice rule with n points takes the form Q n (f ) = 1 n n 1 k=0 f ({ }) kz = 1 n n n 1 k=0 f ( kz mod n n ), (3) where the braces around a vector indicate that we take the fractional parts of each component in the vector, e.g., {(1.8, 2.3)} = (0.8, 0.3), which is clearly equivalent to carrying out the modulo n operation in the numerator as indicated in (3). Figure 1 (left) illustrates a 64 point lattice rule in 2D. The quality of the lattice rule depends on the choice of the generating vector. Due to the modulo operation, it suffices to consider the values from 1 up to n 1, leaving out 0 which is clearly a bad choice. Furthermore, we restrict the values to those relatively prime to n, to ensure that every one-dimensional projection of the n points yields n distinct values. Thus we write z U s n, with U n := {z Z :1 z n 1 and gcd(z,n) = 1}. For theoretical analysis we often assume that n is prime to simplify some number theory arguments. For practical application we often take n to be a power of 2. The total number of possible choices for the generating vector is then (n 1) s and (n/2) s, respectively. Even if we have a criterion to assess the quality of the generating vectors, there are simply too many choices to carry out an exhaustive search when n and s are large. Later we will return to this issue of constructing a good generating vector. 2.2 Random shifting and practical error estimation We can shift the points of a lattice rule by any vector of real numbers Δ = (Δ 1,...,Δ s ), to obtain a shifted lattice rule Q n (f ) = 1 n n 1 k=0 f ({ }) kz n + Δ.

9 2.2 Random shifting and practical error estimation 9 Fig. 1 Applying a (0.1, 0.3)-shift to a 64-point lattice rule in two dimensions: left original lattice rule, middle moving all points by (0.1,0.3), right wrapping the points back inside the unit cube. Due to the fractional part function, we may restrict the shift to Δ [0,1) s. Figure 1 (right) illustrates the result of shifting a 64-point lattice rule in 2D by the vector (0.1, 0.3). Clearly we see the regular structure of the lattice points are preserved. A randomly shifted lattice rule provides an unbiased approximation of the integral (this applies to all QMC methods), while using multiple shifts allows us to obtain a practical error estimate in the same way as the Monte Carlo method. It works as follows. We generate q independent random shifts Δ (i) for i = 0,...,q 1 from the uniform distribution on [0,1] s. For the same fixed lattice generating vector z, we compute the q different shifted lattice rule approximations and denote them by Q (i) n (f )fori = 0,...,q 1. We take the average Q n,q (f ) = 1 q q 1 i=0 Q (i) n (f ) = 1 q q 1 i=0 ( 1 n n 1 k=0 f ({ kz n + Δ(i) }) ) as our final approximation to the integral. Then an estimate for the root-meansquare error of Q n,q (f ) is given by E I (f ) Q n,q (f ) 2 1 q(q 1) q 1 i=0 (Q (i) n (f ) Q n,q (f )) 2. Here the expectation is taken with respect to the random shifts. The total number of function evaluations in Q n,q (f )isqn. Typically, we take q to be small, e.g., q = 16 or 32. For a fair comparison with the Monte Carlo method, we should therefore take n MC = qn QMC samples in the Monte Carlo method.

10 10 2 Lattice points 2.3 Fast component-by-component construction Recall that the components of the generating vector can be restricted to the set U n,e.g.,u n = {1,2,3,...,n 1} when n is prime, and U n = {1,3,5,...,n 1} when n is a power of 2. There are far too many choices in high dimensions. Suppose that we have a computable criterion for assessing the quality of a generating vector in dimension s, and we denote it by E s (z 1,...,z s ), and it is the smaller the better. Then we can use the component-by-component construction to find a generating vector: 1. Set z 1 = 1 (because in one dimension all choices are the same). 2. Choose z 2 from the set U n so that E 2 (z 1, z 2 ) is minimized. 3. Choose z 3 from the set U n so that E 3 (z 1, z 2, z 3 ) is minimized. 4. Choose z 4 from the set U n so that E 4 (z 1, z 2, z 3, z 4 ) is minimized The fact that such a greedy algorithm can produce good generating vectors is justified by theory, and we will say more about this in 2.5. The computational cost of the algorithm depends on the form of this criterion E s (z 1,...,z s ). We have the fast component-by-component construction: in some favourable situations the cost is O(snlogn) operations, i.e., linearly in dimension s and almost linearly in the number of points n. This means that we can really construct generating vectors in tens of thousands of dimensions and millions of points! The magic behind the fast component-by-component construction is that in many cases the algorithm requires the evaluation of a matrix-vector multiplication with the matrix of the form [ ( )] kz mod n ω n z U n,1 k n 1 for some function ω.whenn is prime, we can permute the rows and columns of this matrix to obtain a circulant matrix so that the matrix-vector multiplication which typically requires O(n 2 ) operations can be done in O(n logn) operations using Fast Fourier Transforms. When n is not prime it gets more complicated but similar cost savings can be made. Figure 2 illustrates the structure of such a matrix (left) for n = 53 and the corresponding matrix after permutation (right). 2.4 Lattice sequences Recall that the formula for obtaining the kth point of an n-point lattice rule with generating vector z is { } k t k = n z. (4)

11 2.4 Lattice sequences Fig. 2 Circulant permutation for n = 53 (prime) for fast component-by-component construction. This gives rise to a so-called closed QMC method: the generation of the points depends on knowing n in advance. This is inconvenient in practice, because if we want to change the number of points we would need to generate all of the points from scratch. An open QMC method, on the contrary, allows you to keep adding points as you wish while keeping all existing points; they are therefore referred to as sequences and are also said to be extensible. In a lattice sequence in base 2, the formula is changed to t k = { φ 2 (k) z }, (5) where φ 2 ( )istheradical inverse function in base 2: loosely speaking, if we have the index k = ( k 2 k 1 k 0 ) 2 in binary representation, then φ 2 (k) = (0.k 0 k 1 k 2 ) 2 is obtained by mirroring the bits of k around the binary point. For example, if k = 6 = (110) 2 then φ 2 (k) = (0.011) 2 = The formula (5) does not require you to know n in advance, and so in practice you can add more points to your lattice rule approximation until you are satisfied with the error. When n = 2 m for any m 1, the formulas (4) and (5) produce the same set of points, only the ordering of the points are different. Therefore, if you want the points of an extensible lattice rule only at exact powers of 2, you can avoid the radical inverse function and still use the formula (4) to get your points. For example, if you already have n = 2 m points for some m, then to double the number of points all you need to do is to use the formula (4) with n replaced by 2n, and then consider only those points generated by the odd indices k. All of the above extends trivially to base b 2. We know how to construct a good generating vector for a lattice sequence using the fast component-by-component construction, i.e., this generating vector can be used for many different values of n. Figure 3 illustrates the nested structure in the matrices when we work with powers of 2 and how we explore the nested circulant structure.

12 Lattice points 1 Natural ordering of the indices 4 Symmetric reduction after application of B 2 kernel function Grouping on divisors 3 Generator ordering of the indices Fig. 3 Number theoretic permutations on a matrix with n = 128 (power of 2) for fast componentby-component construction. 2.5 A taste of the theoretical error analysis* Here we discuss some key elements of the theory and construction for lattice rules. It is not necessary to understand all of these to be able to use lattice rules. We therefore mark this subsection as optional material (*). The reader may skip to the next section. Weighted Sobolev space: what kind of integrands can we handle? In the modern analysis of randomly shifted lattice rules, we assume that the integrand f belongs to a weighted Sobolev space of functions whose mixed first derivatives are square-integrable, with the norm given by f 2 γ = 1 u {1,...,s} γ u [0,1] u ( u ) 2 f (x)dx u dx u. (6) [0,1] s u x u There are different variants of the norm, but ultimately it is a way to measure the regularity and variability of the function. Okay, this is a hell of a formula to take in. Let us explain what it means step by step. There are 2 s possible subsets u of the coordinate indices {1,...,s}. Let us pick a simple example first, say, s = 5andu = {1,3,4}. Then we separate the active variables x u = (x 1, x 3, x 4 ) from the inactive variables x u = (x 2, x 5 ), and consider

13 2.5 A taste of the theoretical error analysis* 13 u f x u = 3 f x 1 x 3 x 4. This is called a mixed first derivative because we never differentiate more than once with respect to each variable, even though it looks like a 3rd order derivative in the regular sense. According to the norm, we should integrate out the inactive variables, square the result, and then integrate out the active variables: ( ) 2 f (x 1, x 3, x 4 ; x 2, x 5 )dx 2 dx 5 dx 1 dx 3 dx 4. (7) x 1 x 3 x We do this for each of the 2 s subsets of {1,...,s} and then sum up the results, but with weights γ u > 0 acting as relative scaling. A large value for (7) means that f is more variable in the projection of (x 1, x 3, x 4 ), and we need a larger weight γ {1,3,4} to compensate it in the norm if we want f to have norm 1. We denote the norm with a subscript γ to emphasize the important role played by the weights. We will see below that under appropriate conditions on the weights we can obtain error bounds that are independent of the dimension s. In practice we would choose the weights to match the characteristics of a given integrand. The simplest form of weights are the so-called product weights: we assume that there is one weight γ j > 0 associated with each variable x j so that γ u = j uγ j, e.g., γ {1,3,4} = γ 1 γ 3 γ 4. Typically we also assume that γ 1 γ 2 >0, indicating that the variables are labeled in the order of decreasing importance. Another form of weights that have become popular in recent times are called POD weights, or product and order dependent weights. There is an additional sequence of numbers Γ l such that γ u = Γ u j u γ j, i.e., the weights have an extra multiplying factor which depends on the number of elements in the set u, hence the name order dependent. POD weights arise from some PDE applications and often some factorials u! appear; we won t discuss them further here. Worst case error: how do we assess the quality of a lattice rule? The worst case error for a shifted lattice rule in our weighted Sobolev space is defined to be the largest possible error for any function with norm at most 1, i.e., e γ (z,δ):= sup I (f ) Q n (f ). f γ 1 This means that for any given f in our weighted Sobolev space, we have the lattice rule error bound I (f ) Q n (f ) e γ (z,δ) f γ.

14 14 2 Lattice points For a randomly shifted lattice rule, we have the root-mean-square error bound E I (f ) Q n (f ) 2 eγ sh (z) f γ, (8) where the expectation is with respect to the random shift Δ, and where e sh γ (z):= [0,1] s e 2 γ(z,δ)dδ is called the shift-averaged worst case error. Notice the separation of the dependence of the error bounds on the points from the dependence on the integrand, similarly to (2), but here the weights enter both factors. There is a trade-off: large weights lead to a small norm but a large worst case error, and vice versa. Our weighted Sobolev space happens to be a reproducing kernel Hilbert space. We will not go into any details here, other than saying that this provides a very powerful set of tools for analysis and we have explicit computable formulas for e γ (z,δ)andeγ sh (z). In particular, with product weights we know that [eγ sh (z)]2 = n 1 n s k=0 j =1 ( ({ })) kzj 1 + γ j B 2, (9) n where B 2 (x) = x 2 x + 1/6 for x [0,1] is the Bernoulli polynomial of degree 2. Component-by-component construction: how do we find a good lattice generating vector? Given weights γ j and a generating vector z, we can evaluate (9) in O(ns) operations. Theoretically we could do this for each of the (n 1) s choices of generating vectors when n is prime and then pick the vector with the smallest worst case error. This is however not practically possible when s is large. We will choose the generating vector by the component-by-component construction: given n, s max, and weights γ u, 1. Set z 1 = For s = 2,3,..., s max, choose z s in U n to minimize [e sh γ (z 1,...,z s 1, z s )] 2. To prove that this gives a good lattice rule we use mathematical induction, combined with the so-called averaging argument. First we present a simple result to illustrate the proof technique. Theorem 1. Let n γ 1 /6 be prime. A lattice rule can be constructed by the componentby-component algorithm such that [e sh γ (z 1,...,z s 1, z s )] 2 1 n s j =1 ( 1 + γ ) j. 6

15 2.5 A taste of the theoretical error analysis* 15 By taking the square root on both sides, we see that this simple result gives us only the O(1/ n) convergence rate, the same as the Monte Carlo method. On the other hand, since s j =1 ( 1 + γ ) j = exp 6 ( s log j =1 ( 1 + γ ) ) j 1 exp( 6 6 s γ j ), we see that our error bound can be independent of the dimension s provided that the sum of the infinite sequence of weights is finite, i.e., γ j <. j =1 Here is a brief synopsis of the proof.from (9) we can write the error expression in s dimensions in terms of the error expression in s 1 dimensions as follows: [e sh γ (z 1,...,z s 1, z s )] 2 = [e sh γ (z 1,...,z s 1 )] 2 + γ s n n 1 k=0 [ B 2 ({ kzs n j =1 }) s 1 ( ({ })) ] kzj 1 + γ j B 2. j =1 n Then we take the average of this expression over all choices of z s from U n.whenn is prime this means that we take A(z 1,...,z s 1 ) = 1 n 1 n 1 [e sh z s =1 γ (z 1,...,z s 1, z s )] 2. Since the only dependence on z s is in the first B 2 factor, we end up having to compute 1 n 1 ({ }) kzs B 2, n 1 n z s =1 which equals 1/6 if k = 0 and 1/(6n) otherwise. We combine this with the induction hypothesis on [e sh γ (z 1,...,z s 1 )] 2 to show that A(z 1,...,z s 1 ) 1 n s j =1 ( 1 + γ ) j. 6 Now since we take z s to be the value that minimizes [e sh γ (z 1,...,z s 1, z s )] 2,itmust be bounded by the average A(z 1,...,z s 1 ) and in turn bounded by the required upper bound. A more sophisticated averaging argument can be used to prove that we can get close to O(1/n) convergence. We state the result below for general weights γ u and general n. Theorem 2. A lattice rule can be constructed by the component-by-component algorithm such that

16 16 2 Lattice points e sh γ (z) ( 1 U n γ λ u u {1,...,s} ( ) ) 2ζ(2λ) u 1/(2λ) (2π 2 ) λ for all λ (1/2,1], where ζ(x) = k=1 k x is the Riemann zeta function. We have U n =n 1 when n is prime, U n =n/2 when n is a power of 2, and more generally U n n/2 when n is a power of a prime. The convergence rate close to O(1/n) is obtained by taking the parameter λ arbitrarily close to 1/2. This imposes stronger decay requirements on the weights γ u if we want to end up with a bound that is independent of s. In particular, if we have product weights, then to have a convergence rate close to O(1/n) we need γj <. j =1 We concede that this theorem is way too technical for the purpose of this introduction, but we just want to provide a taste of the analysis, as the heading of this subsection foreshadowed.

17 3.1 Digital net property 17 3 Digital nets Here we take a very informal approach to introduce the family of digital nets. 3.1 Digital net property Loosely speaking, the general principle of digital nets is all about getting the same number of points in various allowable sub-divisions of the unit cube. This is similar in spirit to the Sudoku game! Figure 4 illustrates the digital net property in 2D with 16 points. We can partition the unit square into 16 rectangles of the same shape and size. There is exactly one point in each rectangle (points on the top and right boundaries count toward the next rectangle), and this must hold for all of the 5 possible ways to sub-divide the unit square. Fig. 4 Illustration of a (0,4,2)-net in base 2: every elementary interval of volume 1/16 contains exactly one of the 16 points. A point that lies on the dividing line counts toward the interval above or to the right.

18 18 3 Digital nets This property generalizes to base b 2: instead of halving each time, we subdivide into b equal partitions. The property also generalizes to include a quality parameter called t-value : each allowable sub-division (formally called elementary intervals) contains exactly b t points. The smaller t is, the finer we can sub-divide the unit cube, and the more uniformly distributed the points are. Such a point set with n = b m points in s dimensions is called a (t,m, s)-net. Figure 4 is an example of a (0,4,2)-net in base 2. A(t, s)-sequences is a sequence of points in s dimensions such that if we chop the sequence into consecutive blocks of b m points then every block is a (t,m, s)- net. 3.2 Digital construction Needless to say, we cannot design digital nets in high dimensions by hand drawing rectangles or boxes. We construct digital nets by a digital construction scheme. Recall that to construct lattice rules we need a generating vector of integers one integer per dimension. To construct a digital net we need a vector of generating matrices C 1,...,C s one generating matrix per dimension. Here is how it works in base 2 (easily generalizes to base b 2). Suppose we want n = 2 m points. To get the j th component of the kth point, we write k = (k m 1 k 1 k 0 ) 2 in binary representation, take the m m binary matrix C j for dimension j, and compute y 1 y 2. y m = C j k 0 k 1. k m 1, (10) where all additions and multiplications are carried out modulo 2. Then the j th component of the kth point is (0.y 1 y 2 y m ) 2. Just as the case that the choice of generating vector determines the quality of a lattice rule, here the choice of the generating matrices determines the quality of a digital net the corresponding t-value of the net can be small (good) or large (bad). In the case of a lattice rule we need only one integer value z j in dimension j, but here we need to specify m 2 entries for the binary matrix C j. Finding good generating matrices can be a difficult task. Below we discuss two special cases of digital net construction. Before we proceed, note that we can think of C j in (10) as the top-left hand corner of some bigger matrix and it does not even have to be a square matrix.

19 3.4 Polynomial lattice rules Sobol sequences Sobol points are a popular example of digital nets in base 2, and they have been around long before the general concept of digital nets took shape. (Sobol is a Russian name the is not a typo! It denotes a soft pronunciation of the letter l.) To generate Sobol points, we need one primitive polynomial and some initial direction numbers for every dimension. Primitive polynomials have specific properties which we will not go into here, but it is well-known how many there are of a given degree and also what they are. Since we need a different primitive polynomial for each dimension, and since the quality of the Sobol points deteriorates when the degree of the polynomial increases, we arrange all the primitive polynomials in order of increasing degree so that we use up all the lower degree polynomials first. The initial direction numbers are used to kick start some recurrence relation involving the coefficients of the primitive polynomial in each dimension. These eventually lead to the entries of the generating matrix C j. Again we will not go into the technical details here. Many software packages include implementation of Sobol generators, e.g., Matlab, NAG, QuantLib. We also provide our own Sobol generators for more than 20,000 dimensions. 3.4 Polynomial lattice rules Another good way to get digital nets is by polynomial lattice rule construction they are actually digital nets rather than lattice rules, but in some formulation they mimic lattice rules and hence the name. Instead of having one generating vector of integers z 1,...,z s,weneedagenerating vector of polynomials q 1 (χ),...,q s (χ) one polynomial per dimension. Let p(χ) be a polynomial of degree m with binary coefficients. In dimension j, we have a polynomial q j (χ) of degree at most m 1 with binary coefficients. We find the binary digits u 1,u 2,...in q j (χ) p(χ) = u 1 χ + u 2 χ 2 + u 3 χ 3 + by equating coefficients in q j (χ) = (u 1 /χ+u 2 /χ 2 +u 3 /χ 3 + )p(χ), noting that all additions and multiplications are to be done modulo 2. Then we set u 1 u 2 u 3 u m u 2 u u m+1 C j = u u m u m u m+1 u m+2 u 2m 1

20 20 3 Digital nets The polynomial p(χ) is called the modulus, and it does not play a crucial role. The quality of a polynomial lattice rule is determined by the choice of the generating polynomials q 1 (χ),...,q s (χ). Nowadays we have theory and algorithm to find good polynomials using fast component-by-component construction, analogously to lattice rules. All of these generalize to base b Random digital shifting and scrambling To preserve the digital net property, we need a different kind of randomization strategy other than shifting, which preserves the lattice structure. One simple strategy is by digital shifting. Instead of taking {t k + Δ} for the kth point, we do t k Δ which means we carry out the exclusive-or operation on the binary bits of the vector components. For example, if x = = (0.101) 2 and y = = (0.001) 2, then x y = (0.100) 2 = 0.5. Scrambling is a more sophisticated randomization technique which can improve on the convergence rate of digital nets by an extra factor of 1/ n in some circumstances. Figure 5 illustrates the concept of scrambling as a sequence of pictures in 2D where slices are randomly swapped following some allowable conditions that preserve the digital net property. As for lattice rules, randomization of digital nets provide an unbiased estimate to the integral approximation as well as a practical error estimate. 3.6 Higher order nets by interlacing There is a strategy called interlacing which can turn a regular digital net into higher order digital net. Higher order digital nets can achieve O(1/n α ) convergence if the integrand is roughly α times differentiable in each variable. We need a different function space setting to the one in 2.5, and the theory is quite challenging. Conceptually, to get a higher order digital net in s dimensions with interlacing factor α, we take a regular digital net in α s dimensions, and then interlace every block of α dimensions. Interlacing works as follows: if we have x = (0.x 1 x 2 x 3 ) 2, y = (0.y 1 y 2 y 3 ) 2 and z = (0.z 1 z 2 z 3 ) 2, then the result of interlacing these three numbers is (0.x 1 y 1 z 1 x 2 y 2 z 2 x 3 y 3 z 3 ) 2. This corresponds to an interlacing factor of 3, and we end up with a number that has three times as many bits as the original numbers.

3.6 Higher order nets by interlacing 21 (a) (b) (c) (d) (e) (f) (g) (h)

halves; (c) swap 3rd and 4th vertical quarters; (d) swap 3rd and 4th, 7th

10th, 15th and last sixteenths; (f) swap 1st and 2nd horizontal quarters;

swap 3rd and 4th, 7th and 8th, 9th and 10th, 15th and last horizontal

In practice an efficient way to implement higher order digital nets is by

net, and then generate the points from these expanded matrices by allowing

Note that precision is a practical issue for higher order digital nets: if

21 3.6 Higher order nets by interlacing 21 (a) (b) (c) (d) (e) (f) (g) (h) Fig. 5 Owen s scrambling in base 2: (a) original; (b) swap left and right halves; (c) swap 3rd and 4th vertical quarters; (d) swap 3rd and 4th, 7th and last vertical eighths; (e) swap 3rd and 4th, 7th and 8th, 9th and 10th, 15th and last sixteenths; (f) swap 1st and 2nd horizontal quarters; (g) swap 1st and 2nd, 5th and 6th, 7th and last horizontal eighths; (h) swap 3rd and 4th, 7th and 8th, 9th and 10th, 15th and last horizontal sixteenths. In practice an efficient way to implement higher order digital nets is by interlacing the rows of the generating matrices of the regular digital net, and then generate the points from these expanded matrices by allowing non-square matrices in (10). Note that precision is a practical issue for higher order digital nets: if we want n = 2 m points with interlacing factor α, then under standard double precision machine numbers we can only manage when αm 53. For example, we can only get up to order α = 3 with 2 16 points.

22 22 4 Toy applications 4 Toy applications In this section we outline three integrals which arise from grossly simplified models of practical applications. We begin with a discussion on the transformation needed to bring the integral into the unit cube. 4.1 Transformation to the unit cube Question 1. Given an integral g (y)φ(y)dy, where φ : R R is some univariate probability density function, i.e., φ(y) 0 for all y R and φ(y)dy = 1, how do we transform the integral into [0,1]? Answer. Let Φ : R [0,1] denote the cumulative distribution function of φ,i.e., Φ(y) = y φ(t)dt and and let Φ 1 : [0,1] R denote its inverse. Then we use the substitution (or change of variables) to obtain x = Φ(y) y = Φ 1 (x), g (y)φ(y)dy = 1 with the transformed integrand f := g Φ 1. Question 2. Is this the only way? 0 g (Φ 1 (x))dx = 1 0 f (x)dx, Answer. No. We can divide and multiply by any other probability density function φ, and then map to[0, 1] using its inverse cumulative distribution function Φ 1 : g (y)φ(y)dy = = = = g (y)φ(y) φ(y) g (y) φ(y)dy, g ( Φ 1 (x))dx f (x)dx, φ(y)dy g (y) := g (y)φ(y) φ(y)

23 4.1 Transformation to the unit cube 23 giving a different transformed integrand f := g Φ 1. Ideally we would like to use a density function which leads to an easy integrand in the unit cube. This is related to the concept of importance sampling for the Monte Carlo method. Question 3. How does this transformation generalize to s dimensions? Answer. If we have a product of univariate densities, then we can apply the mapping Φ 1 componentwise y = Φ 1 (x) = (Φ 1 (x 1 ),,Φ 1 (x s )) to obtain g (y) R s s φ(y j )dy = g (Φ 1 (x))dx = f (x)dx. [0,1] s [0,1] s j =1 Remember that we can always divide and multiply to get such a product: (some ugly function of y) s (some ugly function of y)dy = s R s R s j =1 φ(y φ(y j )dy. j ) j =1 Question 4. How do we tackle the multivariate normal density which occurs in many integrals from practical models? Answer. If the multivariate normal density is the dominating part of the entire integrand, then factorize the covariance matrix Σ, i.e., find an s s matrix A such that Σ = AA, (11) and then use the substitution (treating all vectors as column vectors) y = Az followed by z = Φ 1 (x) to obtain g (y) exp( 1 2 y Σ 1 y) dy (12) R s (2π) s det(σ) = g (Az) exp( 1 2 z z) dz R s (2π) s s exp( 1 2 z2 j ) = g (Az) dz = g (AΦ 1 (x))dx = f (x)dx. R s 2π [0,1] s [0,1] s j =1 The factorization (11)isnot unique. Two obvious choices are 1. the Cholesky factorzation with lower triangular matrix A,or

24 24 4 Toy applications 2. the principal components factorization which is given by A = [ λ 1 η 1 ; ; λ s η s ], where (λ j,η j ) s j =1 denotes the eigenpairs of Σ, with ordered eigenvalues λ 1 λ 2 λ s and unit-length column eigenvectors η 1,...,η s. Other choices are possible. Question 5. What if the multivariate normal density is not the dominating part of the entire integrand? Answer. In that case, other transformation steps would be required to capture the main feature of the entire integrand. 4.2 Option pricing Following the Black-Scholes model, the integral arising from pricing an arithmetic average Asian call option takes the general form of (12), with and g (y) = e rt max ( 1 s ) s S t j (y j ) K,0, j =1 S t j (y j ) = S 0 exp ((r σ2 2 ) ) jt s + σy j. where r is the risk-free interest rate, σ is the volatility, and S 0 is the initial asset price. The variables y = (y 1,...,y s ) correspond to a discretization of the underlying Brownian motion over a time interval [0,T ], and the covariance matrix has entries Σ ij = (T /s)min(i, j ). The payoff function g (y) compares the average of the asset prices S t j at the discrete times with the strike price K, and takes their difference if it is positive, or the value 0 if the difference is negative. It is widely accepted that QMC methods work especially well for such problems if we take the principal component construction approach to facorize the covariance matrix Σ. The success of QMC for option pricing problems cannot be explained by the standard theory due to the kink in the integrand. Lots of new QMC theory have been developed with this problem in mind. Parameters for numerical experiment: S 0 = 100 (dollars), K = 100 (dollars), T = 1(year),r = 0.1, σ = 0.2, and s = 256.

25 4.4 PDE with a random coefficient Maximum likelihood One example of a time series Poisson likelihood model involves an integral of the form (12), with s exp(τ j (β + y j ) e β+y j ) g (y) =. τ j! j =1 Here β R is a model parameter, τ 1,...,τ s {0,1,...}are the count data, and Σ is a Toeplitz covariance matrix with Σ ij = σ 2 κ i j /(1 κ 2 ), where σ 2 is the variance and κ ( 1,1) is the autoregression coefficient. The obvious way to transform this integral into the unit cube by factorizing Σ would yield a very spiky function f. Instead, it is better to consider q(y) and the multivariate normal density together and then perform some change of variables with the effect of recentering and rescaling the whole integrand, before mapping to the unit cube. We have some new QMC theory that can explain the success of this approach. Parameters for numerical experiment: β = 0.7, σ 2 = 0.3, κ = 0.5, and τ = (2,0,3,2,0,2,1,4,2,1,8,5,2,3,6,2,2,0,0,1,0,7,2,5,1) for s = 25 (we have more data beyond 25 dimensions). This example came from Kuo, Dunsmuir, Sloan, Wand & Womersley (2008). 4.4 PDE with a random coefficient We consider the following model of a parametric elliptic Dirichlet problem x (a(x, y) x u(x, y)) = f (x) for x D and y [ 1 2, 1 2 ]s, (13) u(x, y) = 0 for x D, where D R d, d {1,2,3}, is a bounded spatial domain with a Lipschitz boundary D, and where the parametric diffusion coefficient is (after truncating an infinite series to s terms) s a(x, y) = a(x) + y j ψ j (x), (14) The parameter vector y is distributed on [ 1 2, 1 2 ]s with the uniform probability measure. We assume that a L (D) and j =1 ψ j L (D) <, and that 0 < a min a(x, y) a max < for all x and y. We refer to this as the uniform case. We are interested in the expected value of some bounded linear function G of the solution u, which is an integral of the form [ 1 2, 1 G(u(, y))dy. 2 ]s j =1

26 26 4 Toy applications Note that in this problem we are integrating with respect to the parameter vector y, and not the spatial variables x. A QMC Finite Element approximation to this integral is 1 n n 1 k=0 ( G (u h, t k 1 )), 2 where u h denotes the Finite Element weak solution. Essentially, we generate a number of QMC points (either deterministic or randomized) and translate them to the cube [ 1 2, 1 2 ]s. Each such translated QMC point gives a different value of y and we solve the corresponding PDE and then apply the functional G. We finally take the average of all solutions. By now there is a large body of literature on applying QMC methods to these and related problems, including the so-called lognormal case which give rise to an integral over R s with a normal density. QMC methods are relatively new to these applications and they are proven to be very competitive to other well established methods. Parameters for numerical experiment: f (x) = 100x 1, G(u(, y)) = D u(x, y)dx, d = 2, a(x) = 1, s = 100 (or any other number), and ψ j (x) = λ j sin(k 1,j πx 1 )sin(k 2,j πx 2 ), where the sequence of pairs ( (k 1,j,k 2,j ) ) j 1 is an ordering of the elements of Z+ Z + such that λ j = (k 2 1,j + k2 2,j ) 2 is a non-increasing sequence. (In other words, we form the pairs of positive integers, order them according to the reciprical of the sum of the squared components, and then keep the first s pairs.) This example came from Dick, Kuo, Le Gia & Schwab (2016).

27 5.1 A simple test function 27 5 Software demo We will demonstrate how to apply QMC methods to your favorite integrals/expectations. We will consider a simple test function throughout this section, instead of taking more complicated examples as in the previous section (e.g., where one function evaluation could involve solving a PDE). Matlab will be used as the lingua franca in the examples, but further down you can find Python and C++ code. If you know at least one of these languages (or any similar language) then you should be able to understand what is going on, especially after comparing the three different implementations in 5.7.Froma computational point of view it is important that we vectorize our function evaluations, see 5.3 for an explanation. The exposition is such that the better code comes at the end. All numerical tests in this chapter were run on the same old laptop which has a 1.8 GHz Intel Core i7 (2 core s) 4 GB under Mac OS Sierra with Matlab R2016a, Python with NumPy and clang A simple test function We consider the following example function taken from Gantner & Schwab (2016): g (x) := exp ( c ) s x j j b = j =1 s exp(cx j j b ). (15) For testing purposes it is nice to know the exact value of the integral. Since g is a product of one-dimensional functions and we can write down the solution of the one-dimensional integrals, we find I (g ):= g (x)dx = [0,1] s s j =1 j =1 exp(c j b ) 1 cj b, c 0. Let us define g in Matlab as a vectorized inline function taking multiple vectors at once as an s n array and returning a 1 n array of results: % x is an [s-by-n] matrix, n vectors of size s; vectorized g(x): g c, b) exp(c * (1:size(x,1)).^(-b) * x); % note that vectors are considered to be columns and so the % product above is an inner product, summing over the dimensions In fact we will define g slightly different as we will not just pass in multiple vectors at once, but also different shifted versions of these vectors as an s n m array:

28 28 5 Software demo g c, b) reshape(... exp(c * (1:size(x,1)).^(-b) * x(:,:)),... % 'as above' 1, size(x,2), size(x,3)... % 1-by-n-by-[whatever left] ); Of course more complicated functions would better be defined in a separate file. E.g.: we could define g in a separate file with the same function signature (we call this file gfun.m, and thus this function s name is gfun, to distinguish from the inline definition above): function y = gfun(x, c, b) % function y = gfun(x, c, b) % % Vectorized evaluation of the example function % \[ g(x) := \exp( c \sum_{j=1}^s x_j j^{-b} ) \] % % Inputs: % x array of s-dimensional vectors, [s-by-n] array % or [s-by-n-by-m] array (or even deeper nesting, % but the first dimension should be the dimension s) % c scaling parameter, scalar % b dimension decay parameter to diminish influence of % higher dimensions, scalar, b >= 0 (b = 0 is no decay) % Outputs: % y function value for each input vector, [1-by-n] array % or [s-by-n-by-m] array (or deeper, but same as x) % % Note: the array x (and also the resulting array y) can have more % than two dimensions, e.g., x could be [s-by-n-by-m] and then the % resulting y will be [1-by-n-by-m]. This is to accommodate for % multiple versions of a point set (e.g., for shifted point sets). y = reshape(... exp(c * (1:size(x,1)).^(-b) * x(:,:)),... max((1:ndims(x) ~= 1).* size(x), 1)... % first dim to 1 ); This version is even more general as it allows x to have any shape as long as the leading dimension is s (the dimensionality of the points), and it will return an array of the same shape but with the first dimension set to 1 (mapping vectors into function values). We are now ready to fix some parameters: % parameters of the g-function s = 100; % number of dimensions c = 1; % c-parameter b = 2; % decay-parameter Then we can calculate its exact integral value: a = c * (1:s).^(-b); exact = prod(expm1(a)./a); % or as a function (repeating the 'a' twice): gexact c, b) prod(expm1(c*(1:s).^(-b))./(c*(1:s).^(-b))); % notice we use expm1(a) and not exp(a) - 1

29 5.3 Some technical details* The difficulty of our test function y exp(x) 1 exp(x/5) exp( x/5) 0 exp( x) exp( 5x) 1 x Fig. 6 Interpretation of the combined parameters c and b for the function g in a single coordinate. The effect is multiplied in multiple dimensions. The parameters c and b specify how difficult the function g will be. We can use Figure 6 as a guideline. It is clear we have a product of such one-dimensional exponential functions. Looking at the figure, we see that if the argument to the exponential function is a small number then the function is nearly linear and approaching a constant function. Note that constant functions are ridiculously easy to integrate. The extreme case would be c = 0; in that case both MC and QMC methods will give the exact value of the integral already with one function value. The larger the value of c (positive or negative) the more we deviate from a linear function and we need more and more samples to determine its integral. For negative c with very large magnitude, the function is essentially 0 except for g (0) = 1, so it is a peaky function and rather hard to integrate. In the multivariate case, the parameter b, withb 0, modifies how quickly we converge to a constant function as the dimension j increases. When the number of dimensions increases, the deviation from the constant function is multiplied. 5.3 Some technical details* Floating point precision Note the usage of the function expm1 in calculating the exact integral of g. This is useful for small arguments when exp(x) 1 since then it is more accurate to compute the right-most expression instead of the middle expression of

30 30 5 Software demo exp(x) 1 = ) (1 + x + x2 2! + x3 3! + 1 = x + x2 2! + x3 3! +, because of floating point arithmetic where 1 + ɛ is rounded to 1 for ɛ smaller than the floating point precision. Vectorization of (interpreted) code It is often a good idea to vectorize code in interpreted languages (e.g., Matlab and Python). Vectorized code often runs much faster than for-loops. Once you get used to vectorized code, in terms of matrices and vectors, it is less error prone and easier to read. In Matlab we vectorize by using matrix and vector operations, as well as use array operations on each element by using a dot in front of the expression:.*,./,.^. You can look up vectorization in the Matlab documentation. A very useful function is reshape which does not cost any computational or memory effort. It just takes the same block of data but interprets the data as if the elements are to be interpreted with a different shape. Matlab uses column-major format, which means that for x an s n m array, the first s consecutive numbers in the data block is the part x(1:s,1,1) = x(:,1,1). This is the same as in Fortran. C and C++ however use row-major format in which the last dimension iterates over consecutive elements. In Python, using NumPy, the default is row-major but it can be chosen on an array by array basis. The way we have chosen to lay out our collections of s-dimensional vectors in Matlab is such that the vectors are stored consecutively in memory. In this way the data needed to evaluate one function value is localized in memory. If we would have to implement such a vectorization in Python with row-major format, then we would formulate x as an m n s array instead. That is what we will choose in when making a Python implementation. Parallelization We note that MC and QMC methods are embarrassingly parallel. This is a technical term to mean that all computations are independent and so can be distributed (probably preferably in blocks) straightforwardly over multiple cores. The accumulation of the results can be done with a simple reduce operation. 5.4 Usage of random number generators It is good practice to have reproducible results (for debugging or when testing optimizations, or checking the results in this text). We will use the Mersenne Twister as the random number generator for our MC simulations and we will set

31 5.5 Monte Carlo approximation 31 its initial state to a fixed value such that we can repeat our experiment and get exactly the same random numbers. Similarly we will use the combined recursive generator from L Ecuyer for the random shifting in case of the QMC approximations. rng_mc = RandStream('mt19937ar', 'Seed', 1234); rng_shifts = RandStream('mrg32k3a', 'Seed', 1234); In Matlab you can now draw random numbers from the Mersenne Twister by doing x = rand(rng_mc, s, n)to obtain an s n array. 5.5 Monte Carlo approximation We are now ready to do a first approximation of the integral using MC method. We will use ten thousand samples to get a MC approximation. (Of course in our test case we do know the exact value of the integral.) tic; N = 1e5; % number of samples G = g(rand(rng_mc, s, N), c, b); % evaluate at once, mean but easy MC_Q = mean(g); MC_std = std(g)/sqrt(n); t = toc; fprintf('mc_q = %g (error=%g, std=%g, N=%d) in %f sec\n',... MC_Q, abs(mc_q - gexact(s, c, b)), MC_std, N, t); This gives us MC_Q = (error= , std= , N=100000) in sec Without resetting the seed and re-running the above code 9 times gives the following output: MC_Q = (error= , std= , N=100000) in sec MC_Q = (error= , std= , N=100000) in sec MC_Q = (error= , std= , N=100000) in sec MC_Q = (error= , std= , N=100000) in sec MC_Q = (error= , std= , N=100000) in sec MC_Q = (error= , std= , N=100000) in sec MC_Q = (error= , std= , N=100000) in sec MC_Q = (error= , std= , N=100000) in sec MC_Q = (error= , std= , N=100000) in sec In Figure 7 we plot the results of these 10 approximations to obtain estimates to I (g ) as well as σ 2 (g ). In Figure 8 we see the standard error plotted in terms of the number of samples used. We can clearly see from the figure that the convergence is 1/ N, as expected. For our test function we can actually calculate σ 2 (g ), and so we also plotted σ(g )/ N as a dashed reference line. Using 10 5 random samples we observe an estimated standard error of If we would want to divide this error by 10 then we will need to take 100 times more samples.

Quasi-Monte Carlo integration over the Euclidean space and applications

Quasi-Monte Carlo integration over the Euclidean space and applications f.kuo@unsw.edu.au University of New South Wales, Sydney, Australia joint work with James Nichols (UNSW) Journal of Complexity 30