AN EM ALGORITHM FOR HAWKES PROCESS

Size: px

Start display at page:

Download "AN EM ALGORITHM FOR HAWKES PROCESS"

Bryce Harper
6 years ago
Views:

1 AN EM ALGORITHM FOR HAWKES PROCESS Peter F. Halpin new york university December 17, 2012 Correspondence should be sent to Dr. Peter F. Halpin 246 Greene Street, Office 316E New York, NY Phone: (212) Fax: (212) Webpage:

2 Psychometrika Submission December 17, AN EM ALGORITHM FOR HAWKES PROCESS Abstract This manuscript addresses the EM algorithm developed in Halpin & De Boeck (in press). The runtime of the algorithm grows quadratically in the number of observations, making its application to large data sets impractical. A strategy for improving efficiency is introduced, and this results in linear growth for many applications. The performance of the modified algorithm is assessed using data simulation. Key words: Hawkes process; EM algorithm; maximum likelihood; runtime

3 Psychometrika Submission December 17, Introduction Halpin & De Boeck (in press) considered the time series analysis of bivariate event data in the context of dyadic interaction. They proposed the use of point processes, and in particular Hawkes process, as way to capture the temporal dependence between the actions of two individuals. Estimation was based on the so-called branching structure representation of Hawkes process, which they showed to be amenable to estimation via the EM algorithm (see also Veen & Schoenberg, 2008). Unfortunately, the runtime of the algorithm grows quadratically in the number observations, making its application to large data sets impractical. The present paper provides a modification of the original algorithm that substantially improves its runtime. The modification reduces the number of computations in the algorithm by tolerating a specified degree of rounding error, and this results in linear growth for many applications. The next section outlines Hawkes process in sufficient detail for this paper to be self-contained and gives an intuitive description of the problem to be addressed. The subsequent section presents the modification to the EM algorithm and illustrates some cases where this yields linear growth. The final section uses data simulation to arrive at a magnitude of rounding error that has a negligible effect on parameter recovery. Hawkes Process Under mild conditions, a point process can be uniquely defined in terms of its conditional intensity function (CIF). The main reason for specifying a point process in terms of its CIF is that this leads directly to an expression for its likelihood. A general form for the CIF is λ(t) = lim 0 E(M{(t, t + )} H t ) (1)

4 Psychometrika Submission December 17, where M{(a, b)} is random counting measure representing the number of events (i.e., isolated points) falling in the interval (a, b), E(M{(a, b)}) is the expected value, and H t is the σ-algebra generated by the time points t k, k N, occurring before time t R+ (see Daley & Vere-Jones, 2003). In this paper it is assumed that the probability of multiple events occurring simultaneously is negligible, in which case M is said to be orderly. Then for fixed t and sufficiently small values of, λ(t) is an approximation to the bernoulli probability of an event occurring in the interval (t, t + ), conditional on all of the events happening before time t. In applications, this means we are concerned with how the probability of an event changes over continuous time as a function previous events. Point processes extend immediately to the multivariate case. M{(a, b)} is then vector-valued and each univariate margin gives the number of a different type of event occurring in the time period (a, b). Although Halpin and De Boeck (in press) considered a bivariate model, this paper focusses on the univariate case since the problem to be addressed can be most simply explained in that situation. The CIF of Hawkes process can be specified as a linear causal filter: λ(t) = µ + t 0 φ(t s) dm(s). (2) The interpretation of equation (2) is unpacked in the following three points. 1. µ > 0 is a baseline, which can be a function of time but is here treated as a constant. 2. φ(u) is a response function that governs how the the process depends on its past. Hawkes process requires the following three assumptions: φ(u) 0, u 0; φ(u) = 0, u < 0; Together these assumptions imply that 0 φ(u)du 1. φ(u) = α f(u; ξ) (3)

5 Psychometrika Submission December 17, where 0 α 1 and f(u; ξ) is a probability density function on R+ with parameter ξ. Equation (3) presents a convenient method for parametrizing φ, with some common choices for f(u; ξ) being the exponential (e.g., Ogata, 1988; Truccolo, Eden, Fellows, Donoghue, & Brown, 2005), the two-parameter gamma (Halpin & Boeck, in press), and the power law distribution (Barabási, 2005; Crane & Sornette, 2008). Under this parameterization, α is referred to as the intensity parameter and f(u; ξ) the response kernel. 3. In the case that M is orderly, dm(u) = M(u + ) is representable as series of right-shifted Dirac delta functions and the integral reduces to a sum over all events in [0, t], yielding φ(t s) dm(s) = t j <t φ(t t j ). (4) Thus each new time point is associated with a response function describing how that time point affects the future of the process. Under the assumptions of Hawkes process, each new time point increases the probability of further events occurring in the immediate future (i.e., φ(u) is non-negative). The summation shows that the effect of multiple time points on the probability of further events is cumulative. For these reasons, Hawkes process is often referred to as self-exciting; the occurrence of one event increases the probability of further events, whose occurrence in turn increases the probability of even more events. In terms of applications this means that Hawkes process is appropriate for modelling clustering, which occurs when periods of high event frequency are separated by periods of relative inactivity. As noted, the CIF leads directly to an expression for the log-likelihood (see Daley & Vere-Jones, 2003): l(θ X) = ln(λ(t k )) k T 0 λ(s)ds (5) where [0, T ] is the observation period, X = t 1, t 2,... denotes the observed event times, and θ contains the parameters of the model. Substitution of equations (2) through (4) into

6 Psychometrika Submission December 17, equation (5) shows that the log-likelihood of Hawkes process contains the logarithm of a weighted sum of density functions. A similar situation occurs in finite mixture modelling (e.g., McLachlan & Peel, 2000) and nonlinear regression (e.g., Seber & Wild, 2003), where it is known to lead to numerical optimization problems related to ill-conditioning of and multiple roots in the likelihood function. In the present case the problem is aggravated by the fact that the number of densities appearing in the likelihood increases with the number observations, which is shown in equation (4). It is important to note that the number of model parameters does not grow with the number of time points; the densities are simply right-shifted. In general, if there are a total of n observed events, then there are a total of n(n 1)/2 response functions appearing in the log-likelihood of a univariate Hawkes process, not including the duplicated response functions appearing the integral. This is the source of the quadratic growth of the optimization problem, which is the issue to be dealt with in this paper. The quadratic growth is especially problematic because the EM algorithm proposed by Halpin and De Boeck (in press) requires the use of multiple starting values. This means that even moderately sized data sets cannot be estimated in a reasonable amount of time. For example, an actual runtime of over 24 hours was recorded for a problem with N 1500 events and 50 starting values (implemented in the C language on a machine with 2 GHz of processing speed). Because one of the most exciting potential applications of Hawkes process is to big data collected via computer-mediated communication (e.g., databases, twitter), it is important to have an estimation approach that is feasible for large samples. The following section outlines how that can be accomplished. Reducing Runtime by Introducing Rounding Error This section outlines the original EM algorithm suggested by Halpin and De Boeck (in press) and then considers how to reduce its runtime. The algorithm is based on alternative representation of Hawkes process, which is referred to as its branching structure. In terms

7 Psychometrika Submission December 17, of the EM algorithm, the branching structure provides the complete data representation of the model, whereas the causal filter in equation (2) is the incomplete data representation. Taking this approach, the logarithm of the sum of densities in equation (5) is replaced by the sum of their logarithms, which results in better conditioning of the numerical optimization problem and was shown to perform satisfactorily with relatively small data sets (N 400). Although the considerations of this section could also be made for equation (5), the focus is on the EM approach. The branching structure representation of Hawkes process is in terms of a cluster Poisson process. It was first proposed by Hawkes and Oakes (1974), who proved it to be equivalent to the representation given in the foregoing section. Their argument was very technical and it served to establish the existence and uniqueness of the process. The branching structure has also found more intuitive applications. For example, in ecology it is used to describe the growth of wildlife populations in terms of subsequent generations of offspring due to each immigrant (e.g., Rasmussen, 2011). In the context of disease control, it is interpreted as the number of people contaminated by each subsequent carrier (e.g., Daley & Vere-Jones, 2003). Veen and Schoenberg (2008) were the first to consider the branching structure as a strategy for obtaining maximum likelihood estimates (MLEs) of Hawkes process. For the present purpose, the effect of the branching structure is to decompose Hawkes process into n independent Poisson processes whose rate functions are given by the response functions in equation (3). These processes govern the number of offspring of each event. There is also an additional Poisson processes governing the number of immigrant events; this process has a rate function given by the baseline parameter µ. Importantly, each event t k is assumed to be due to one and only one of these independent Poisson processes: either one centered at its parent, t j, with t j < t k, or the baseline process. Consequenty, if we knew which process each event belonged to, estimation of Hawkes process would reduce to that for a collection of independent Poisson processes. It is therefore natural to introduce a

8 Psychometrika Submission December 17, missing variable that describes the specific process to which each event t k belongs, and proceed by means of the EM algorithm. As with other applications of the EM algorithm, the missing data need not correspond to the hypothesized data generating process; it can be treated merely as a tool for obtaining MLEs. The following notation is employed to set up the algorithm. Let Z = (Z 1, Z 2,, Z n ) denote the missing data. If an event t k is an offspring of event t j, t j < t k, this is denoted by setting Z k = j. If an even t k is an immigrant then Z k = 0. Also let φ j (u) denote the response functions governing each Poisson process, where it is understood that φ 0 (u) = µ. For j > 0, these response functions are identical to those introduced in equation (3) above, except the subscript serves to make explicit the centering event t j. Letting l(θ X, Z) denote the complete data log-likelihood, Halpin and De Boeck (in press) showed that Q(θ) = E Z X,θ l(θ X, Z) ( n = ln(φ j (t k t j )) Prob(Z k = j X, θ) j=0 k>j 0 T ) φ j (T t j ) (6) where Prob(Z k = j X, θ) = φ j (t k t j ) r<k φ r(t k t r ). (7) Equations (6) and (7) provide the necessary components of an EM algorithm for Hawkes process. Equation (7) is readily computed on the E step. On the M step these probabilities are treated as fixed and entered into equation (6). Using this approach, Halpin and De Boeck (in press) provided closed form solutions for the baseline parameter µ and the intensity parameter α. However, in order to obtain the parameters of the response kernel, it is necessary to numerically optimize the Q function. This is the computationally expensive part of the algorithm. Since the sum over k > j is the source of the quadratic growth of the Q function, let s

9 Psychometrika Submission December 17, first consider how this can be reduced. Recall that for j > 1, φ j (u) = α f(u; ξ) is just a weighted density on R+. For usual choices of the response kernel, f(u; ξ) 0 as u become large (i.e., response functions typically have a right tail that asymptotes at zero). Intuitively, this means that when t k t j is large, the contribution of φ j (t k t j ) to equation (6) will be negligible. In order to make this idea more formal, consider the sets W j = {k : f(t k t j ; ξ) < w} and let W denote the average of the cardinalities of the W j. Replacing the sum over k > j with the sum over k W j in equation (6) results in W n densities appearing in the double summation. This substitution will be referred to as the modified Q function and denoted Q. W is the linear growth factor of Q. The relative efficiency of Q over Q is R = W n n(n 1)/2 = 2 W /(n 1) The value of W depends on (a) ξ, which is updated throughout the optimization process, (b) w, which can be determined by the researcher, and (c) the actual observations t k, which are fixed. This makes is difficult to obtain analytical results on W. However, Table 1 provides evidence that it does not grow with n and it can be much smaller than (n 1)/2. ========================= Insert Table 1 about here ========================= The table was produced by simulating data using the inverse method (see Daley & Vere-Jones, 2003). The causal filter in equation (2) was used for simulation, not the branching structure. Three different sample sizes (N = 500, 1500, and 5000) were simulated

10 Psychometrika Submission December 17, from each of three different models. Model 1 and Model 2 used exponential response functions, with Model 1 having moderate intensity (α =.4) and Model 2 having high intensity (α =.8). This means that the data from Model 2 showed a much higher degree of clustering (i.e., a larger number of events occurring in close proximity to one another). Model 3 is also high intensity (α =.8) but used a two-parameter gamma kernel with shape parameter set to.5. The result is heavier-tailed response functions, which have been reported in various applications to human communication data (e.g., Barabási, 2005; Crane & Sornette, 2008; Halpin & Boeck, in press). The choices of intensity parameter are intended to reflect its possible range rather than realistic values; I have not seen intensity estimates greater than.5 in real data applications. For each simulated data set, Q was computed using the true parameter values and w = 1e-10. The main point to be taken from the Table 1 is that the values of W did not increase with n and therefore the rate of growth of Q was linear. The exact rate of linear growth depended on the parameters of the data generating model, with more clustered data showing faster growth. However, even at extraordinarily high intensities and even at the smallest sample size, the growth rate was much smaller than (n 1)/2. Based on these results, it reasonable to conclude that Q is more efficient to compute than Q. It should be emphasized that this depends on the type of response kernel; the approach outlined here will not work unless the response kernel has a right tail that asymptotes at zero. Table 1 does not address how the rounding error w affects the MLEs produced by the EM algorithm. That is the topic of the next section. Although this section has only focussed on the computation of the Q function, entirely similar remarks can be made about the computation of equation (7) on the E step, and about the computation of equation (5). Effect of Rounding Error on the EM algorithm This section considers how the rounding error w effects convergence and parameter recovery. Data were again simulated using the inverse method with the incomplete data

11 Psychometrika Submission December 17, model (equation (2)). The data-generating model used a two-parameter gamma density as the response kernel. The parameters of the data generating model are stated in the Table 3 and were based on the real data example reported in Halpin and De Boeck (in press). N = 250 data sets of size n = 500 time points were generated from the model. For each data set, the EM algorithm described in Halpin and De Boeck (in press) was implemented using Q in place of Q. The starting values for the estimation algorithm were obtained by randomly disturbing the data generating values, which avoided the need for multiple starting values. Convergence was evaluated using the incomplete data log-likelihood (equation (5)). The convergence criterion was an absolute difference of less than 1e-5 on subsequent M steps. The simulation compared the rounding errors w = 0, 1e-10, 1e-5, 1e-3. Because a rounding error of 0 is not possible in practice, this was implemented using w = 2.22e-16, which is the smallest double precision number representable most modern computers. Therefore the value w = 0 represents the amount of error that is intrinsic to the specific realization of the estimation process (i.e., with the given sample size, convergence criterion, etc). The remaining values of w represent the introduction of rounding error for computation efficiency. Let s first consider the role of rounding error in the convergence of the algorithm. Figure 1 shows the relationship between the log-likelihoods evaluated at the MLEs and the log-likelihoods evaluated at the data generating parameters. The relation is quite similar for the three smallest values of w, but is appreciably worse for the largest value. It is important to note that even for w = 0, the relationship is not perfect. The amount of additional error introduced by the two middle values of w is not perceptible in the figure. ========================= Insert Figure 1 about here =========================

12 Psychometrika Submission December 17, Table 2 provides a closer look at the log-likelihoods. It reports the mean and standard deviation for the differences between the log-likelihoods of the estimated models and the log-likelihoods computed using the true values. The table entries are reported as percentages of the difference between the log-likelihoods of w = 0 and of the true values (i.e., as percentages of the intrinsic estimation error). If w > 0 did not affect the convergence of the EM algorithm, all values in the table would be 100. Based on the table we can conclude that all values of w > 0 introduced additional error into the convergence of the EM algorithm. For w = 1e-10 this was less than.1 percent of the intrinsic estimation error. ========================= Insert Table 2 about here ========================= Turning now to address parameter recovery, Table 3 reports the bias and error of the MLEs for each level of rounding error. The entries are reported as percentages of the data generating parameters. It can be seen that bias and error were very similar for the lowest two values of w, but for larger values of w there is increased bias and reduced error. Figure 2 shows the distribution of estimates of the gamma response kernels for w = 0 and w = 1e-10. ========================= Insert Table 3 about here ========================= ========================= Insert Figure 2 about here =========================

13 Psychometrika Submission December 17, Based on this simulation it may be concluded that there is little to distinguish the results obtained using a rounding error of w = 1e-10 from the intrinsic error in the algorithm (i.e., w = 0). On the other hand, w 1e-5 has a relatively large influence both on the convergence of the algorithm and on the bias and error of the resulting parameter estimates. Conclusions The number of computations required by the EM algorithm proposed by Halpin and De Boeck (in press) grows quadratically in the number of observed events, making its application to large data sets infeasible. This paper has shown that the runtime of the algorithm can be reduced by introducing rounding error into the computation of the Q function (i.e. the objective function of the M step of the EM algorithm). In three applications involving response functions with right tails asymptoting at zero, this was shown to result in linear growth. The consequences for convergence of the algorithm and parameter recovery were also considered. A rounding error of 1e 10 was found to have negligible effects, but larger values did not. While more research can be done to optimize the rounding error for specific applications of the algorithm, it can be concluded that the approach presented here provides an acceptable compromise between runtime computational accuracy.

14 Psychometrika Submission December 17, References Barabási, A. L. (2005). The origin of bursts and heavy tails in human dynamics. Nature, 435, Crane, R., & Sornette, D. (2008). Robust dynamic classes revealed by measuring the response function of a social system. Proceedings of the National Academy of Sciences, 105, Daley, D. J., & Vere-Jones, D. (2003). An introduction to the theory of point processes: Elementary theory and methods (Second ed., Vol. 1). New York: Springer. Halpin, P. F., & Boeck, P. D. (in press). Modeling dyadic interaction using hawkes process. Psychometrika. Hawkes, A. G., & Oakes, D. (1974). A cluster representation of a self-exciting process. Journal of Applied Probability, 11, McLachlan, G. J., & Peel, D. (2000). Finite mixture models. New York: John Wiley and Sons. Ogata, Y. (1988). Statistical models for earthquake occurrences and residual analysis for point processes. Journal of the American Statistical Association, 83, Rasmussen, J. G. (2011). Bayesian inference for Hawkes processes. Methodology and Computing in Applied Probability, DOI: /s Seber, G. A. F., & Wild, C. J. (2003). Non-linear regression (2nd ed.). Hoboken, NJ: John Wiley & Sons. Truccolo, W., Eden, U. T., Fellows, M. R., Donoghue, J. P., & Brown, E. N. (2005). A point process framework for relating neural spiking activity to spiking history, neural ensemble, and e extrinsic covariate effects. Journal of Neurophysiology, 93, Veen, A., & Schoenberg, F. P. (2008). Estimation of space-time branching process models in seismology using an EM-type algorithm. Journal of the American Statistical

15 Psychometrika Submission December 17, Association, 103,

16 Psychometrika Submission December 17, Tables

17 Psychometrika Submission December 17, Table 1. Growth of the Q Function in Number of Time Points (Simulated Data) n = 500 n = 1500 n = 5000 Model Model Model Note: n is number of simulated time points and the table entries are the linear growth factor, W, of the modified Q function, Q, computed using the true parameter values. W n gives the number of computations required for Q and 2 W /(n 1) gives the efficiency of Q relative to the original Q function proposed by Halpin and De Boeck (in press). The models are described in the text.

18 Psychometrika Submission December 17, Table 2. Effect of Rounding Error on Log-likelihoods (Simulated Data) w = 0 w = 1e-10 w = 1e-5 w = 1e-3 Mean SD Note: Table entries are means (M) and standard deviations (SD) of differences between log-likelihoods of the estimated models and the log likelihoods computed using the true values. The means and standard deviations are reported as percentages of the values for w = 0 (i.e., percentages of the intrinsic estimation error). The MLEs were obtained using the EM algorithm described by Halpin and De Boeck (in press) with the modified Q function and the indicated levels of rounding error, w.

19 Psychometrika Submission December 17, Table 3. Effect of Rounding Error on Parameter Recovery (Simulated Data) µ α κ β True values w = (12.707) (14.282) (11.812) (49.986) w = 1e (12.725) (14.315) (11.824) (50.664) w = 1e (10.857) (11.592) (11.215) (22.937) w = 1e (9.969) (8.786) (17.114) (3.618) Note: Table entries are bias (error) of maximum likelihood estimates (MLEs) as percentages of the true values. µ denotes the baseline parameter of Hawkes process, α the intensity parameter, κ the shape parameters of the two-parameter gamma response kernel, and β its scale parameter. MLEs were obtained using the EM algorithm described by Halpin and De Boeck (in press) with the modified Q function and the indicated levels of rounding error, w.

20 Psychometrika Submission December 17, w = 0 w = 1-e10 True values r =.998 True values r = MLEs MLEs w = 1e-5 w = 1e-3 True values r =.998 True values r = MLEs MLEs Figure 1. Relation of log-likelihoods at convergence with log-likelihoods computed using the data generating values (simulated data). The model was estimated using the EM algorithm described by Halpin and De Boeck (in press) with the modified Q function presenting in this paper and the indicated levels of rounding error, w.

21 Psychometrika Submission December 17, w = 0 w = 0 Frequency Frequency MLEs: Shape MLEs: Scale w = 1e-10 w = 1e-10 Frequency Frequency MLEs: Shape MLEs: Scale Figure 2. Histograms of maximum likelihood estimates (MLEs) of the two-parameter gamma density kernel (simulated data). Bold vertical line indicates the value of the data generating parameters. MLEs were obtained using the EM algorithm described by Halpin and De Boeck (in press) with the modified Q function presented in this paper and the indicated levels of rounding error, w.

MODELLING DYADIC INTERACTION WITH HAWKES PROCESS

MODELLING DYADIC INTERACTION WITH HAWKES PROCESS Peter F. Halpin + & Paul De Boeck university of amsterdam November 8, 2012 + Now at New York University; Now at Ohio State University. This research was