Quasi-Monte Carlo Sampling to improve the Efficiency of Monte Carlo EM

Size: px

Start display at page:

Download "Quasi-Monte Carlo Sampling to improve the Efficiency of Monte Carlo EM"

Wilfred Hancock
5 years ago
Views:

1 Quasi-Monte Carlo Sampling to improve the Efficiency of Monte Carlo EM Wolfgang Jank Department of Decision and Information Technologies University of Maryland College Park, MD November 17, 2003 Abstract In this paper we investigate an efficient implementation of the Monte Carlo EM algorithm based on Quasi-Monte Carlo sampling. The Monte Carlo EM algorithm is a stochastic version of the deterministic EM (Expectation-Maximization) algorithm in which an intractable E-step is replaced by a Monte Carlo approximation. Quasi-Monte Carlo methods produce deterministic sequences of points that can significantly improve the accuracy of Monte Carlo approximations over purely random sampling. One drawback to deterministic Quasi-Monte Carlo methods is that it is generally difficult to determine the magnitude of the approximation error. However, in order to implement the Monte Carlo EM algorithm in an automated way, the ability to measure this error is fundamental. Recent developments of randomized Quasi-Monte Carlo methods can overcome this drawback. We investigate the implementation of an automated, datadriven Monte Carlo EM algorithm based on randomized Quasi-Monte Carlo methods. We apply this algorithm to a geostatistical model of online purchases and find that it can significantly decrease the total simulation effort, thus showing great potential for improving upon the efficiency of the classical Monte Carlo EM algorithm. Key words and phrases: Monte Carlo error; low-discrepancy sequence; Halton sequence; EM algorithm; geostatistical model. 1

2 1 Introduction The Expectation-Maximization (EM) algorithm (Dempster et al., 1977) is a popular tool in statistics and many other fields. One limitation to the use of EM is, however, that quite often the E-step of the algorithm involves an analytically intractable, sometimes high dimensional integral. Hobert (2000), for example, considers a model for which the E-step involves intractable integrals of dimension twenty. The Monte Carlo EM (MCEM) algorithm, proposed by Wei & Tanner (1990), estimates this intractable integral with an empirical average based on simulated data. Typically, the simulated data is obtained by producing random draws from the distribution commanded by EM. By the law of large numbers, this integral-estimate can be made arbitrarily accurate by increasing the size of the simulated data. The MCEM algorithm typically requires a very high accuracy, especially at the later iterations. Booth & Hobert (1999), for example, report sample sizes of over 66,000 at convergence. This suggests that the overall efficiency of MCEM could be improved by using simulation methods that achieve a high accuracy in the integral-estimate with smaller sample sizes. Recent research has provided evidence that entirely random draws do not necessarily result in the most efficient use of the simulated data. In particular, one criticism of random draws is that they often do not explore the sample space well (Morokoff & Caflisch, 1995; Caflisch et al., 1997). For instance, points drawn at random tend to form clusters which leads to gaps where the sample space is not explored at all (see Figure 1 for illustration). This criticism has lead to the development of a variety of deterministic methods that provide for a better spread of the sample points. These deterministic methods are often classified as Quasi-Monte Carlo (QMC) methods. Theoretical as well as empirical research has shown that QMC methods can significantly increase the accuracy of the integral-estimate over random draws. Figure 1 about here In this paper we investigate an implementation of the MCEM algorithm based on QMC methods. Wei & Tanner (1990) point out that for an efficient implementation, the size of the simulated data should be chosen small at the initial stage but increased successively as the algorithm moves along. Early versions of the method require a manual, user-determined increase of the sample size, for instance, by allocating the amount of data to be simulated in each iteration already before the start 2

3 of the algorithm (e.g. McCulloch, 1997). Implementations of MCEM that determine the necessary sample size in an automated, data-driven fashion have been developed only recently (see Booth & Hobert, 1999; Levine & Casella, 2001; Levine & Fan, 2003). Automated implementations of MCEM base the decision to increase the sample size on the magnitude of the error in the integralapproximation. In their seminal work, Booth & Hobert (1999) use statistical methods to estimate this error when the simulated data is generated at random. However, since QMC methods are deterministic in nature, statistical methods do not apply. Moreover, determining the error of the QMC integral-estimate analytically can be extremely hard (Caflisch et al., 1997). Recently, the development of randomized QMC methods has overcome this early drawback. Randomized Quasi-Monte Carlo (RQMC) methods combine the benefits of deterministic sampling methods, which achieve a more uniform exploration of the sample space, with the statistical advantages of random draws. A survey of recent advances in RQMC methods can be found in L Ecuyer & Lemieux (2002). In this work we implement an automated MCEM algorithm based on RQMC methods. Specifically, we demonstrate how to obtain a QMC sample from the distribution commanded by EM and we use the ideas of RQMC sampling to measure the error of the integral-estimate in every iteration of the algorithm. We implement this Quasi-Monte Carlo EM (QMCEM) algorithm within the framework of the automated MCEM formulation proposed by Booth & Hobert (1999). The remainder of this paper is organized as follows. In Section 2 we briefly motivate the ideas surrounding QMC and RQMC. In Section 3 we explain how RQMC methods can be used to implement QMCEM in an automated, data-driven fashion. We apply this algorithm to a geostatistical model of online purchases in Section 4 and conclude with final remarks in Section 5. 2 Quasi-Monte Carlo Sampling Quasi-Monte Carlo methods can be regarded as a deterministic counterpart to classical Monte Carlo. Suppose we want to evaluate an (analytically intractable) integral I = f(x)dx (1) C d 3

4 over the d-dimensional unit cube, C d := [0, 1] d. Classical Monte Carlo integration randomly selects points x k Uniform(C d ), k = 1,..., m, and approximates (1) by the empirical average Ĩ = 1 m f(x k ). (2) m k=1 Quasi-Monte Carlo methods, on the other hand, select the points deterministically. Specifically, QMC methods produce a deterministic sequence of points that provides the best-possible spread in C d. These deterministic sequences are often referred to as low-discrepancy sequences (see, for example, Niederreiter, 1992; Fang & Wang, 1994). A variety of different low-discrepancy sequences exist. Examples include the Halton sequence (Halton, 1960), the Sobol sequence (Sobol, 1967), the Faure sequence (Faure, 1982), and the Niederreiter sequence (Niederreiter, 1992), but this list is not exhaustive. In this work we focus our attention on the Halton sequence since it is conceptually very appealing. 2.1 Halton Sequences Let b be a prime number. Then any integer k, k 0, can be written in base-b representation as k = d j b j + d j 1 b j d 1 b + d 0, where d i {0, 1,..., b 1} for i = 0, 1,..., j. Define the base-b radical inverse function, φ b (k), as φ b (k) = d 0 b 1 + d 1 b d j b j+1. Notice that for every integer k 0, φ b (k) [0, 1]. The kth element of the Halton sequence is obtained via the radical inverse function evaluated at k. Specifically, if b 1,..., b d are d different prime numbers, then a d-dimensional Halton sequence of length m is given by {x 1,..., x m }, where the kth element of the sequence is x k = [φ b1 (k 1),..., φ bd (k 1)] T, k = 1,..., m. (3) (See Halton (1960) or Wang & Hickernell (2000) for more details.) Notice that the Halton sequence does not need to be started at the origin. Indeed, for any d- vector of non-negative integers, n = (n 1,..., n d ) T, say, the Halton sequence with the first elements skipped, x k = [φ b1 (n 1 + k 1),..., φ bd (n d + k 1)] T, k = 1,..., m, (4) 4

5 remains a low-discrepancy sequence (see Pagès, 1992; Bouleau & Lépingle, 1994). We will refer to the sequence defined by (4) as a Halton sequence with starting point n. Figure 1 shows the first 2500 elements of a two-dimensional Haltion sequence with n = (0, 0) T. 2.2 Randomized Quasi-Monte Carlo Owen (1998b) points out that the main (practical) disadvantage of QMC is that determining the accuracy of the integral-estimate in (2) is typically very complicated, if not impossible. Moreover, since QMC methods are based on deterministic sequences, statistical procedures for error estimation do not apply. This drawback has lead to the development of randomized Quasi-Monte Carlo (RQMC) methods. L Ecuyer & Lemieux (2002) suggest that any RQMC sequence should have the following two properties: 1) every element of the sequence has a uniform distribution over C d ; 2) the lowdiscrepancy property of the sequence is preserved under the randomization. The first property guarantees that the approximation Ĩ in (2) is an unbiased estimate of the integral in (1). Moreover, one can estimate its variance by generating r independent copies of Ĩ (which is typically done by generating r independent sequences x (j) 1,..., x(j) m, j = 1,..., r). Given a desired total simulation amount N = rm, smaller values of r (paired with a larger value of m) should result in a better accuracy of the integral-estimate, since it takes better advantage of the low-discrepancy property of each sequence. At the extreme, taking r = N and m = 1 simply reproduces classical Monte Carlo estimation. 2.3 Randomized Halton Sequences Recall that, regardless of the starting point, the Halton sequence remains a low-discrepancy sequence. Wang & Hickernell (2000) use this fact to show that if the Halton sequence is started at a random point, x 1 Uniform(C d ), then it satisfies the RQMC properties 1) and 2) from Subsection 2.2. In the following sections, we will use RQMC sampling based on the randomized Halton sequence. 5

6 3 Quasi-Monte Carlo EM The Expectation-Maximization (EM) algorithm (Dempster et al., 1977) is an iterative procedure useful to approximate the maximum likelihood estimator (MLE) in incomplete data problems. Let y be a vector of observed data, let u be a vector of unobserved data or random effects and let θ denote a vector of parameters. Furthermore, let f(y, u; θ) denote the joint density of the complete data, (y, u). Let L(θ; y) = f(y, u; θ)du denote the (marginal) likelihood function for this model. The MLE, ˆθ, maximizes L( ; y). In each iteration, the EM algorithm performs an expectation and a maximization step. Let θ (t 1) denote the current parameter value. Then, in the tth iteration of the algorithm, the E- step computes the conditional expectation of the complete data log-likelihood, conditional on the observed data and the current parameter value, [ Q(θ θ (t 1) ) = E log f(y, u; θ) y; θ (t 1)]. (5) The tth EM update, θ (t), maximizes (5). That is θ (t) satisfies Q(θ (t) θ (t 1) ) Q(θ θ (t 1) ) (6) for all θ in the parameter space. This is also known as the M-step. The M-step is often implemented using standard numerical methods like Newton-Raphson (see Lange, 1995). Solutions to overcome a difficult M-step have been proposed in, for example, Meng & Rubin (1993). Given an initial value θ (0), the EM algorithm produces a sequence {θ (0), θ (1), θ (2),... } that, under regularity conditions (see Boyles, 1983; Wu, 1983), converges to ˆθ. In this work we focus on the situation when the E-step does not have a closed form solution. Wei & Tanner (1990) proposed to approximate an analytically intractable expectation in (5) by the empirical average Q(θ θ (t 1) ) Q(θ θ (t 1) ; u 1,..., u mt ) = 1 m t log f(y, u k ; θ), (7) m t where u 1,..., u mt are simulated from the conditional distribution f(u y; θ (t 1) ). Then, by the law of large numbers, Q(θ θ (t 1) ) will be a reasonable approximation to Q(θ θ (t 1) ) if m t is large enough. k=1 6

7 We consider a modification of (7) suitable for RQMC sampling. Let u (j) 1,..., u(j) m t, j = 1,..., r, be r independent RQMC sequences of length m t, each simulated from f(u y; θ (t 1) ). (The details of how to simulate a RQMC sequence from f(u y; θ (t 1) ) are deferred until Subsection 3.2.) Then, an unbiased estimate of (5) is given by the pooled estimate Q P (θ θ (t 1) ) = 1 r Q (j) (θ θ (t 1) ), (8) r j=1 where Q (j) (θ θ (t 1) ) = Q(θ θ (t 1) ; u (j) 1,..., u(j) m t ) in (7). The tth Quasi-Monte Carlo EM (QMCEM) update, θ (t), maximizes Q P ( θ (t 1) ). 3.1 Increasing the length of the RQMC sequences We have pointed out earlier that the Monte Carlo sample sizes m t should be increased successively as the algorithm moves along. In fact, Booth et al. (2001) argue that MCEM will never converge if m t is held fixed across iterations because of a persevering Monte Carlo error (see also Chan & Ledolter, 1995). While earlier versions of the method choose the Monte Carlo sample sizes in a deterministic fashion before the start of the algorithm (e.g. McCulloch, 1997), the same deterministic allocation of Monte Carlo resources that works well in one problem may result in a very inefficient (or inaccurate) algorithm in another problem. Thus, data-dependent (and user-independent) sample size rules are necessary in order to implement MCEM in an automated way. Booth & Hobert (1999) base the decision of a sample size increase on the noise in the parameter updates (see also Levine & Casella, 2001; Levine & Fan, 2003). Let θ (t 1) denote the current QMCEM parameter value and let θ (t) denote the maximizer of Q P ( θ (t 1) ) in (8) based on r independent RQMC sequences each of length m t. Thus, θ (t) satisfies F P ( θ (t) θ (t 1) ) = 0, (9) where we define F P (θ θ ) = Q P (θ θ )/ θ. Let θ (t) denote the parameter update of the deterministic EM algorithm, that is, θ (t) satisfies F(θ (t) θ (t 1) ) = 0, (10) where, in similar fashion to above, we define F(θ θ ) = Q(θ θ )/ θ. Thus, a first order Taylor expansion of F P ( θ (t) θ (t 1) ) about θ (t) yields ( θ (t) θ (t) ) T SP (θ (t) θ (t 1) ) F P (θ (t) θ (t 1) ), (11) 7

8 where we define the matrix S P (θ θ ) = 2 QP (θ θ )/ θ θ T. Under RQMC sampling, QP is an unbiased estimate of Q. Assuming mild regularity conditions, it follows that for the expectation E[ F P (θ (t) θ (t 1) )] = F(θ (t) θ (t 1) ) = 0. (12) Therefore, the expected value of θ (t) is θ (t) and its variance-covariance matrix is given by Var( θ (t) ) = [ SP ] (θ (t) θ (t 1) 1 ) Var( FP (θ (t) θ (t 1) [ SP ] ) (θ (t) θ (t 1) 1 ). (13) Under regular Monte Carlo sampling, it follows that, for a large enough Monte Carlo sample size, θ(t) is approximately normal distributed with mean and variance specified above. Under RQMC sampling, however, the accuracy of the normal approximation may depend on the number r of independent RQMC sequences. In Section 4 we consider a range of values for r in order to investigate its effect on QMCEM. In our implementations we estimate Var( θ (t) ) by substituting θ (t) for θ (t) in (13) and estimate Var( F P (θ (t) θ (t 1) ) via 1 r 2 r j=1 ( ) ( θ Q (j) (θ θ (t 1) T ) θ Q (j) (θ θ )) (t 1). (14) = (t) Larger values of r should result in a more accurate estimate for Var( θ (t) ). However, we also pointed out that smaller values of r should result in a better accuracy of the Monte Carlo estimate in (8), since it takes better advantage of the low-discrepancy property of each individual sequence u (j) 1,..., u(j) m t. We investigate the impact of this trade-off on the overall efficiency of the method in Section 4. The QMCEM algorithm proceeds as follows. Following Booth & Hobert s recommendation, we measure the noise in the QMCEM update θ (t) by constructing a (1 α) 100% confidence ellipsoid about the deterministic EM update θ (t), using the normal approximation for θ (t). If this ellipsoid contains the previous parameter value θ (t 1), then we conclude that the system is too noisy and we increase the length m t of the RQMC sequence. Booth et al. (2001) argue that the sample sizes should be increased at an exponential rate. Thus, we increase the sample size to m t+1 := (1+κ)m t, where κ is a small number, typically κ = 0.2, 0.3, 0.4. Since stochastic algorithms, like MCEM, can satisfy deterministic stopping rules purely by chance, it is recommended to continue the method until the stopping rule is satisfied for several consecutive iterations (see also Booth & Hobert, 8

9 1999). Thus, we stop the algorithm when the relative change in two successive parameter updates is smaller than some small number δ, δ > 0, for 3 consecutive iterations. 3.2 Laplace Importance Sampling to generate RQMC sequences Recall that the pooled estimate in (8) is based on r independent RQMC sequences u (j) 1,..., u(j) m t, j = 1,..., r, simulated from f(u y; θ (t 1) ). In this section we demonstrate how to generate randomized Halton sequences using Laplace importance sampling. Laplace importance sampling has been proven useful to draw approximate samples from f(u y; θ) in many instances (see Booth & Hobert, 1999; Kuk, 1999). Laplace importance sampling attempts to find an importance sampling distribution whose mean and variance match the mode and curvature of f(u y; θ). More specifically, suppressing the dependence on y, let l(u; θ) = log f(y, u; θ) (15) denote the complete data log likelihood and let l (u; θ) and l (u; θ) denote its first and second derivatives in u, respectively. Suppose that ũ denotes the maximizer of l satisfying l (u; θ) = 0. Then the Laplace approximations to the mean and variance of f(u y; θ) are µ(θ) = ũ and Σ(θ) = {l (ũ; θ)} 1, respectively (e.g. De Bruijn, 1958). Booth & Hobert (1999) as well as Kuk (1999) propose to use a multivariate normal or multivariate t importance sampling distribution, shifted and scaled by µ(θ) and Σ(θ), respectively. Let f Lap (u y; θ) denote the resulting Laplace importance sampling distribution. Recall that by RQMC property 1), every element of a RQMC sequence has a uniform distribution over C d. Let x k be the kth element of a randomized Halton sequence. Using a suitable transformation (e.g. Robert & Casella, 1999), we can generate a d-vector of i.i.d. normal or t variates. Shifting and scaling this vector by µ(θ) and Σ(θ) results in a draw u k from f Lap (u y; θ). Thus, using r independent randomized Halton sequences of length m t, x (j) 1,..., x(j) m t, j = 1,..., r, we obtain r independent sequences u (j) 1,..., u(j) m t from f Lap (u y; θ). Booth & Hobert (1999) or Kuk (1999) successfully use Laplace importance sampling for the fitting of generalized linear mixed models. In the following we use the method to an application of generalized linear mixed models to data exhibiting spatial correlation. 9

10 4 Application: A Geostatistical Model of Online Purchases In this section we consider sales data from an online book publisher and retailer. The publisher sells online the titles it publishes in print form as well as, more recently, also in PDF form. The publisher has good reason to believe that a customer s preference for either print or PDF form varies significantly due to his or her geographical location. In fact, since the PDF form is directly downloaded from the publisher s web site, it requires a reliable and typically fast internet connection. However, the availability of reliable internet connections varies greatly across different regions. Moreover, directly downloaded PDF files provide content immediately without having to wait for shipment as in the case of a printed book. Thus, shipping times can also influence a customer s preference. The preference can also be affected by a customer s access to good quality printers or his/her technology readiness, all of which often exhibit strong local variability. Data exhibiting spatial correlation can be modelled using generalized linear mixed models (e.g Breslow & Clayton, 1993). Diggle et al. (1998) refer to these spatial applications of generalized linear mixed models as model based geostatistics. These spatial mixed models are challenging from a computational point of view since they often involve approximating rather high dimensional integrals. In the following we consider a set of data leading to an analytically intractable likelihoodintegral of dimension 16. Let {z i } d i=1, z i = (z i1, z i2 ), denote the spatial coordinates of the observed responses {y i } d i=1. For example, z i1 and z i2 could denote the longitude and latitude of the observation y i. While y i could represent a variety of response types, we focus here on the binomial case only. For instance, y i could indicate whether or not a person living at location z i has a certain disease or whether or not this person has a preference for a certain product. One of the modelling goals is to account for the possibility that two people living in close geographic proximity are more likely to share the same disease or the same preference. Let u = (u 1,..., u d ) be a vector of random effects. Assume that, conditional on u i, the responses y i arise from the model ( ) exp(β + u i ) y i u i Binomial n i,, (16) 1 + exp(β + u i ) where β is an unknown regression coefficient. Assume furthermore that u follows a multivariate normal distribution with mean zero and covariance structure such that the correlation between two 10

11 random effects decays with the geographical distance between the associated two observations. For example, assume that Cov(u i, u j ) = σ 2 exp{ α z i z j }, (17) where denotes the Euclidian norm. While different modelling alternatives exist (see, for example, Diggle et al., 1998), we will use the above model to investigate the efficiency of Quasi-Monte Carlo MCEM implementations for estimating the parameter vector θ = (β, σ 2, α). We analyze a set of online retail data for the Washington, DC, area. Washington is a very diverse area with respect to a variety of aspects like socio-economic factors or infrastructure. This diversity is often expressed in regionally/locally strongly varying customer preferences. The data set consists of 39 customers who accessed the publisher s web site and either purchased the title in print form or in PDF. In addition to a customer s purchasing choice, the publisher also recorded the customer s geographical location. Geographical location can easily be obtained (at least approximately) through the customer s ZIP code. ZIP code information can then be transformed into longitudinal and latitudinal coordinates. After aggregating customers from the same ZIP code with the same preference, we obtained d = 16 distinct geographical locations. Let n i denote the number of purchases from location i and let y i denote the number of PDF purchases thereof. Figure 2 displays the data. Figure 2 about here Quasi-Monte Carlo has been found to improve upon the efficiency of classical Monte Carlo methods in a variety of setting. For instance, Bhat (2001) reports efficiency gains via the Halton sequence in a logit model for integral dimensions ranging from 1 to 5. Lemieux & L Ecuyer (1998), on the other hand, consider integral dimensions as large as 120 and find efficiency improvements for the pricing of Asian options. In our example, the correlation structure of the random effects in equation (17) causes the likelihood function (and therefore also the E-step of the EM algorithm) to include an analytically intractable integral of dimension 16. Indeed, the (marginal) likelihood function for the model in (16) and (17) can be written as ( ) d exp{ 0.5u T Σ 1 u} L(θ; y) f(y i u i ; θ) Σ 1/2 du, (18) where u = (u 1,..., u 16 ) T i=1 contains the random effects corresponding to the 16 distinct locations and Σ is a matrix with elements σ ij = Cov(u i, u j ). 11

12 The evaluation of high dimensional integrals is computationally burdensome. We conducted a simulation study to investigate the efficiency of QMC approaches relative to that of classical Monte Carlo. Table 1 shows the results for three different QMCEM algorithms, using r = 5, r = 10 and r = 30 RQMC sequences, respectively. This compares to an implementation of MCEM using classical Monte Carlo techniques. We can see that the Monte Carlo standard errors of the parameter estimates of θ = (β, σ 2, α) are very similar across the estimation methods, indicating that all 4 methods estimate the parameters with (on average) comparable accuracy. However, the total simulation efforts required to obtain this accuracy differs greatly. Indeed, while classical Monte Carlo requires an average number of 800,200 simulated vectors (each of dimension 16!), it only takes 20,836 for QMC (using r = 5 RQMC sequences). This is a reduction in the total simulation effort by a factor of almost 40! It is also interesting to note that among the 3 different QMC approaches, choosing r = 30 RQMC sequences results in a (average) total simulation effort of 30,997 simulated vectors compared to only 20,836 for r = 5. Table 1 about here The reduction in the total simulation effort that is possible with the use of QMC methods is intriguing. The MCEM algorithm usually spends most of its simulation effort in the final iterations when the algorithm is in the vicinity of the MLE. This has already been observed by, for example, Booth & Hobert (1999) or McCulloch (1997). The reason for this is the convergence behavior of the underlying deterministic EM algorithm. EM usually takes large steps in the early iterations, but the size of the steps reduce drastically as EM approaches ˆθ. The step size in the tth iteration of EM can be thought of as the signal that is transmitted to MCEM. However, due to the error in the Monte Carlo approximation of the E-step in (7), MCEM receives only a noisy version of that signal. While the signal-to-noise ratio is large in the early iterations of MCEM, it declines continuously as MCEM approaches ˆθ. This makes larger Monte Carlo sample sizes necessary in order to increase the accuracy of the approximation in (7) and therefore to reduce the noise. Table 1 shows that QMC methods, due to their superior ability to estimate an intractable integral more accurately, manage to reduce that noise with smaller sample sizes. The result is a smaller total simulation effort required by QMC. Table 1 also shows that among the 3 different QMCEM algorithms, implementations that use fewer but longer low-discrepancy sequences result in a better total simulation effort than a large 12

13 number of short sequences. Indeed, the simulation effort for r = 30 RQMC sequences is about 50% higher than that for r = 5 or r = 10. We pointed out in Section 2 that for a given total simulation amount r m, smaller values of r paired with larger values of m should result in a more accurate integral-estimate. On the other hand, the trade-off for using small values of r is a less accurate variance estimate in (14). In order to implement MCEM using randomized Halton sequences, a balance has to be achieved between a more accurate integral-estimate (i.e. less noise) and a more accurate variance estimate. In our example, we found this balance for values of r between 5 and 10. We also experimented with values smaller than 5 and frequently encountered problems with the numerical stability of the estimate of the covariance matrix in (14). In the final paragraphs of this section we want to take a closer look at noise of the QMCEM algorithm and compare it to classical MCEM. Figure 3 visualizes the Monte Carlo error for three different Monte Carlo estimation methods: classical Monte Carlo using random sampling (column 1), randomized Quasi-Monte Carlo with r = 5 RQMC sequences (column 2) and pure Quasi-Monte Carlo without randomization (column 3). Figure 3 about here We can see that for classical Monte Carlo, the average parameter update (thick solid line) is very volatile and has wide confidence bounds (dotted lines). This suggests that the Monte Carlo error is huge. This is in strong contrast to QMC. Indeed, for pure QMC sampling the parameter updates are significantly less volatile with much tighter confidence bounds. Notice that we allocated the same simulation effort for both simulation methods! It takes classical MCEM much larger sample sizes to reduce the noise to the same level as under QMC sampling. We have argued at the beginning of this paper that in order to implement MCEM in an automated way, the ability to estimate the error in the Monte Carlo approximation is essential. Randomized QMC methods provide this ability. While randomized Halton sequences have the lowdiscrepancy property (and thus estimate the integral with a higher accuracy than classical Monte Carlo), randomization may not come for free. Indeed, the second column of Figure 3 shows that, while the error reduction is still substantial compared to a classical Monte Carlo approach, the system is noisier than under pure QMC sampling. 13

14 5 Conclusion In this paper we have demonstrated how recent advances in randomized Quasi-Monte Carlo can be used to implement the MCEM algorithm in an automated, data-driven way. The empirical investigations provide encouraging evidence that this Quasi-Monte Carlo EM algorithm can lead to a significant efficiency gains over implementations using regular Monte Carlo methods. We focused our investigations in this work on the randomized Halton sequence only. Other randomized Quasi-Monte Carlo methods exist. See, for example, Owen (1998a) or L Ecuyer & Lemieux (2002). It could be a rewarding topic for future research to investigate the benefits of different Quasi-Monte Carlo methods for the implementation of Monte Carlo EM (and also other stochastic estimation methods that are frequently encountered in the statistics literature). Acknowledgements All the simulations made in this work are based on the programming language Ox of Doornik (2001). 14

15 References Bhat, C. (2001). Quasi-random maximum simulated likelihood estimation for the mixed multinomial logit model. Transportation Research 35, Booth, J. G. & Hobert, J. P. (1999). Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society B 61, Booth, J. G., Hobert, J. P. & Jank, W. (2001). A survey of Monte Carlo algorithms for maximizing the likelihood of a two-stage hierarchical model. Statistical Modelling 1, Bouleau, N. & Lépingle, D. (1994). Numerical Methods for Stochastic Processes. New York: Wiley. Boyles, R. A. (1983). On the convergence of the EM algorithm. Journal of the Royal Statistical Society B 45, Breslow, N. E. & Clayton, D. G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association 88, Caflisch, R., Morokoff, W. & Owen, A. (1997). Valuation of mortgage-backed securities using brownian bridges to reduce effective dimension. Journal of Computational Finance 1, Chan, K. S. & Ledolter, J. (1995). Monte Carlo EM estimation for time series models involving counts. Journal of the American Statistical Association 90, De Bruijn, N. G. (1958). Asymptotic Methods in Analysis. Amsterdam: North-Holland. Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39, Diggle, P. J., Tawn, J. A. & Moyeed, R. A. (1998). Model-based geostatistics. Journal of the Royal Statistical Society A 47, Doornik, J. A. (2001). Ox: Object Oriented Matrix Programming. London: Timberlake. 15

16 Fang, K.-T. & Wang, Y. (1994). Number Theoretic Methods in Statistics. New York: Chapman & Hall. Faure, H. (1982). Discrépance de suites associées à un système de numération (en dimension s). Acta Arithmetica 41, Halton, J. H. (1960). On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals. Numerische Mathematik 2, Hobert, J. P. (2000). Hierarchical models: A current computational perspective. Journal of the American Statistical Association 95, Kuk, A. Y. C. (1999). Laplace importance sampling for generalized linear mixed models. Journal of Statistical Computation and Simulation 63, Lange, K. (1995). A gradient algorithm locally equivalent to the EM algorithm. Journal of the Royal Statistical Society B 57, L Ecuyer, P. & Lemieux, C. (2002). Recent advances in randomized Quasi-Monte Carlo Methods. In Modeling Uncertainty: An Examination of Stochastic Theory, Methods, and Applications, M. Dror, P. L Ecuyer & F. Szidarovszki, eds. Kluwer Academic Publishers. Lemieux, C. & L Ecuyer, P. (1998). Efficiency improvement by lattice rules for pricing asian options. In Proceedings of the 1998 Winter Simulation Conference. IEEE Press. Levine, R. & Fan, J. (2003). An automated (Markov Chain) Monte Carlo EM algorithm. Tech. rep., San Diego State University. Levine, R. A. & Casella, G. (2001). Implementations of the Monte Carlo EM algorithm. Journal of Computational and Graphical Statistics 10, McCulloch, C. E. (1997). Maximum likelihood algorithms for generalized linear mixed models. Journal of the American Statistical Association 92, Meng, X.-L. & Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika 80,

17 Morokoff, W. J. & Caflisch, R. E. (1995). Quasi-Monte Carlo integration. Journal of Computational Physics 122, Niederreiter, H. (1992). Random Number Generation and Quasi-Monte Carlo Methods. Philadelphia: SIAM. Owen, A. (1998a). Scrambling Sobol and Niederreiter-Xing points. Journal of Complexity 14, Owen, A. B. (1998b). Monte Carlo extension of Quasi-Monte Carlo. In 1998 Winter Simulation Conference Proceedings. New York: Springer, pp Pagès, G. (1992). Van der Corput sequences, Kakutani transforms and one-dimensional numerical integration. Journal of Computational and Applied Mathematics 44, Robert, C. P. & Casella, G. (1999). Monte Carlo Statistical Methods. New York: Springer. Sobol, I. M. (1967). Distribution of points in a cube and approximate evaluation of integrals. U.S.S.R. Computational Mathematics and Mathematical Physics 7, Wang, X. & Hickernell, F. J. (2000). Randomized Halton sequences. Mathematical and Computer Modelling 32, Wei, G. C. G. & Tanner, M. A. (1990). A Monte Carlo implementation of the EM algorithm and the poor man s data augmentation algorithms. Journal of the American Statistical Association 85, Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. The Annals of Statistics 11,

18 Regular Monte Carlo Quasi-Monte Carlo Figure 1: 2500 points in the unit square: The upper plot shows the result of regular Monte Carlo sampling; that is, 2500 points selected randomly. Random points tend to form clusters, oversampling the unit square in some places; this leads to gaps in other places, where the sample space is not explored at all. The lower plot shows the result of Quasi-Monte Carlo sampling: 2500 points of a two dimensional Halton sequence. 18

19 Geographical Distribution of Data Latitude Longitude Proportion of PDF Purchases per Location Proportion Latitude Longitude Figure 2: Geographical distribution of PDF purchases for Washington, DC: The upper plot shows the geographical borders of Washington, DC, as well as the geographical location of the 39 purchases of PDF or Print. The lower plot displays the geographical scatter of the relative proportion of PDF purchases. 19

20 Classical Monte Carlo Randomized Quasi Monte Carlo Pure Quasi Monte Carlo Beta Beta Beta Sigma Sigma Sigma Alpha Alpha Alpha Figure 3: Monte Carlo Error and Quasi-Monte Carlo Error: Starting MCEM near the MLE, we performed 100 iterations using a fixed Monte Carlo sample size of rm t 1000, t = 1,..., 100. We repeated this experiment 50 times for a) MCEM using classical Monte Carlo sampling (column 1); b) randomized Quasi-Monte Carlo with r = 5 (column 2); c) pure Quasi-Monte Carlo without randomization, i.e. r = 1 (column 3). For each parameter value we plotted the average of the 50 iteration histories (thick, solid lines) as well as pointwise 95% confidence bounds (dotted lines). 20

21 Table 1: Spatial model: The table investigates the efficiency of Quasi-Monte Carlo implementations of MCEM for fitting geostatistical models. We investigate three different Quasi-Monte Carlo (QMC) algorithms using r = 5, 10 and 30 independent RQMC sequences, respectively. These RQMC sequences are obtained via randomized Halton sequences using Laplace importance sampling based on a t distribution with 10 degrees of freedom. We benchmark these three QMC algorithms against an implementation of MCEM based on regular Monte Carlo (MC) sampling using the same Laplace importance sampler. We start each algorithm from (β (0), σ 2(0), α (0) ) = (0, 1, 1) and increase the length of the RQMC sequences according to Section 3.1 using α = 0.25 and κ = 0.2. The algorithm is terminated if the relative difference in two successive parameter updates falls below δ = 0.01 for 3 consecutive iterations. For each of the four MCEM implementations we performed this experiment 50 times recording the final parameter values, β i, σi 2 and α i and the total number of simulated vectors, N = T i j=1 r m j, where T i denotes the final iteration number (i = 1,..., 50). The table displays the Monte Carlo average (AVG) and the Monte Carlo standard error (SE) for these values. For instance, for the regression parameter β it displays the average β over the 50 replications and the Monte Carlo standard error s β / 50, where s β denotes the sample standard deviation over the 50 replicates. β σ 2 α N MC AVG ,200 SE ,347 QMC AVG ,836 (r=5) SE QMC AVG ,768 (r=10) SE QMC AVG ,997 (r=30) SE ,234 21

arxiv: v1 [math.st] 21 Jun 2012

arxiv: v1 [math.st] 21 Jun 2012 IMS Collections c Institute of Mathematical Statistics, On Convergence Properties of the Monte Carlo EM Algorithm Ronald C. Neath Hunter College, City University of New York arxiv:1206.4768v1 math.st]