Model Based Clustering of Count Processes Data

Model Based Clustering of Count Processes Data Tin Lok James Ng, Brendan Murphy Insight Centre for Data Analytics School of Mathematics and Statistics May 15, 2017 Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 1 / 25

Background: Model Based Clustering Sample observations arise from a distribution that is a mixture of two or more components. Each component is described by a probability density function and an associated probability in the mixture. Commonly used to model p-dimensional observations x 1,, x n, e.g. mixture of G components, each of which is multivariate normal with density f g (x µ g, Σ g ), g = 1,, G. Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 2 / 25

Background: Model Based Clustering Majority of research in model based clustering focus on sample observations consist of multivariate data taking values in R p with p 1 (Fraley and Raftery, 2002; Raftery and Dean, 2006). Recently, clustering techniques for functional data have been proposed where the observation is a stochastic process taking values in an infinite dimensional space (Abraham et al., 2003; Giacofci et al., 2013). Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 3 / 25

Model Based Clustering for Count Processes In various applications, one typically observes a collection of count processes data, where each count process consists of event times observed over a fixed time interval. Some examples are occurrence of natural disasters in various locations, observed times of failure of multiple light bulbs, temporal sequence of action potentials generated by neurons. The area of model based clustering techniques for count processes data are under-developed. Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 4 / 25

Model Based Clustering for Count Processes We assume a collection of N count processes data x = {x i } N i=1 observed in a fixed interval [0, T ) R. Each observation x i = {x i,1, x i,2,, x i,ni } consists of a set of event times with x i,1 x i,2 x i,ni for i = 1,, N. We assume a total of G mixture components and each observation arises from one of the G components. Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 5 / 25

Model Specification: Poisson Process Non-homogeneous Poisson Process (NHPP) is a popular method in various applications where count process data are encountered. A temporal NHPP defined on [0, T ) is associated with a non-negative and locally integrable intensity function λ(x) : [0, T ) R +. For any bounded region B [0, T ), the number of events N(B) follows a Poisson distribution with rate Λ(B) = B λ(x)dx. Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 6 / 25

Model Specification: Gaussian Process A random scalar function g(s) : S R is said to have a Gaussian process prior, if for any finite collection of points s 1,, s N S, the function values g(s 1 ),, g(s N ) follow a multivariate Gaussian distribution. The Gaussian process prior can be defined by a mean function m(.) : S R and a covariance function k(.,.) : S S R which further depend on hyper-paramters φ (Rasmussen and Williams, 2006). Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 7 / 25

Model Specification: Gaussian Cox Process The combination of a Poisson process and a Gaussian process prior is known as a Gaussian Cox process. It provides an attractive modeling framework to infer the underlying intensity function since one only needs to specify the form of the mean and covariance functions. Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 8 / 25

Model Specification: Mixture of Poisson Processes We assume each count process observation x i is governed by one of the G underlying normalized intensity functions {λ g } G g=1, where λ g (x) : [0, T ) R +, T 0 λ g (x)dx = 1. We let τ g, g = 1,, G, be the probability that a count process observation is associated with intensity function λ g. We introduce a scaling factor α i for each count process observation so that x i NHPP(α i λ g ) with probability τ g. We employ the logistic density transform to the intensity functions {λ g } G g=1, λ g (x) = exp(f g (x)) T 0 exp(f g (s))ds (1) Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 9 / 25

Model Specification: Mixture of Poisson Processes A zero-mean Gaussian process prior GP(0, k(x, x )) is assigned to the random functions f = {f g } G g=1. The one dimensional squared-exponential covariance function is assumed which has the following form k(x, x ) = σ 2 exp( 1 2l 2 (x x ) 2 ) (2) with hyperparameters (σ, l) where l determines the smoothness of the covariance function. Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 10 / 25

Model Specification: Mixture of Poisson Processes To make inference more tractable, we further discretise the interval [0, T ) into m equally spaced sub-intervals, ) and let {t k } m k=1 be the mid points of the sub-intervals, k = 1,, m. [ T (k 1) m, Tk m We assume that the random functions {f g } G g=1 and hence the intensity functions {λ g } G g=1 are piecewise constant on each interval. Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 11 / 25

Inference Given observed event times {x i } of N count processes, our objective is to group them into G clusters while simulateneously estimate the intensity functions of the G Poisson processes. Let θ denote the set of parameters, the penalised likelihood function is given by L(θ; x) = N ( G n i τ g exp( α i ) i=1 g=1 j=1 )( G λ g (x i,j ) g=1 ) p(f g ) (3) where p(.) is the density associate with GP(0, k(x, x )) prior. In non-bayesian setting, the Gaussian process prior imposed on random functions f g can be understood as penalty terms. Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 12 / 25

EM Algorithm We let z = {z i } n i=1 denote the latent assignment of intensity functions to count processes, where z i = (z i,1,, z i,g ) T and { 1 if count process i has intensity function α i λ g z i,g = 0 otherwise We can write the complete data likelihood function as L(θ; x, z) = N i=1 g=1 G ( n i ) zig ( G τ g exp( α i ) (α i λ g (x i,j )) j=1 g=1 ) p(f g ) (4) Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 13 / 25

EM Algorithm For the E-step, we need to compute the probabilities of cluster assignment for each count process, π g i Pr(Z i,g = 1 x i, θ (t) ) = τ g (t) G h=1 τ (t) h exp( α (t) ) n i for i = 1,, N and g = 1,, G. i exp( α (t) i ) n i j=1 (α(t) i λ g (t) (x i,j )) j=1 (α(t) i λ (t) h (x i,j)) Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 14 / 25

EM Algorithm For the M-step, we need to update the model parameters {τ g } G g=1, {α i } N i=1 and {f g } G g=1 by expected complete data likelihood function. The updates for {α i } N i=1 and {τ g } G g=1 are straightforward. ˆα i = n i τˆ g = N i=1 πg i N No closed form updates available for {f g } G g=1. We employ Newton s method to obtain estimates for {f g } G g=1. Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 15 / 25

Standard Error Estimates for Intensity Functions It is often desirable to obtain standard errors and construct confidence intervals for the estimated intensity functions {ˆλ g } G g=1. For each observation j and each intensity function λ g, we apply the Jackknife method to obtain the estimates ˆλ (j) g with the j th count process observation removed from the data. Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 16 / 25

Standard Error Estimates for Intensity Functions For g = 1,, G and for t [0, T ), we let λ g (t) = 1 N N j=1 ˆλ (j) g (t) The esimtated variance of intensity function g at time t is given by Var jack (ˆλ g (t)) = N 1 N N j=1 (ˆλ (j) g (t) λ g (t)) 2 The confidence interval for estimated intensity function ˆλ g is given by (ˆλg (t k ) c Var jack (ˆλ g (t)), ˆλ ) g (t) + c Var jack (ˆλ g (t)) for some c > 0. Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 17 / 25

Model Selection: Cross Validated Likelihood We want to determine the number of mixture components G and the hyperparameters (σ, l). The cross validated likelihood approach (Smyth, 2000) is used to choose the number of components and hyperparameters in the mixture model. For a fixed model, the method works by repetitively partitioning the data set into two sets, one of which is to train the model and obtain parameter estimates by maximising the log-likelihood and the other is for testing the model by evaluating its log-likelihood. Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 18 / 25

Application: Washington Bike Sharing Scheme Data Set Information including the start date and time of a trip, the end date and time of a trip, start station and end stations are publicly available. 1. For each station in the bike sharing scheme, we consider the end time of a trip as an event. We fit the model to the data over the period of 18 March 2016 through 24 March 2016 with 362 active stations and a total of 62234 events. 1 Historical data available at https://www.capitalbikeshare.com/ Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 19 / 25

Application: Washington Bike Sharing Scheme Data Set Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 20 / 25

Application: Washington Bike Sharing Scheme Data Set Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 21 / 25

Reference Abraham, C., Cornillon, P. A., Matzner-Lø ber, E., and Molinari, N. (2003), Unsupervised curve clustering using B-splines, Scand. J. Statist., 30, 581 595. Eagle, N. and Pentland, A. (2006), Reality mining: sensing complex social systems, Journal Personal and Ubiquitous Computing, 10, 255 268. Fraley, C. and Raftery, A. E. (2002), Model-based clustering, discriminant analysis, and density estimation, J. Amer. Statist. Assoc., 97, 611 631. Giacofci, M., Lambert-Lacroix, S., Marot, G., and Picard, F. (2013), Wavelet-based clustering for mixed-effects functional models in high dimension, Biometrics, 69, 31 40. Raftery, A. E. and Dean, N. (2006), Variable selection for model-based clustering, J. Amer. Statist. Assoc., 101, 168 178. Rasmussen, C. E. and Williams, C. K. I. (2006), Gaussian processes for machine learning, Adaptive Computation and Machine Learning, MIT Press, Cambridge, MA. Smyth, P. (2000), Model selection for probabilistic clustering using cross-validated likelihood, Statistics and Computing, 10, 63 72. Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 22 / 25

Application: Reality Mining Data Set The Reality Mining Dataset was collected in 2004 (Eagle and Pentland, 2006). The goal of this experiment was to explore the capabilities of the smart phones that enabled social scientists to investigate human interactions. 75 students or faculty in MIT Media Laboratory and 25 students at the MIT Slogan business school were the subjects of the experiement where the time of phone calls and SMS messages were recorded. For each subject in the study, we treat the time when an outgoing call was made as an event time. We focus on the core of 86 people who have made outgoing calls during the period between 24 September 2004 and 8 January 2005 2. 2 Data obtained from http://sociograph.blogspot.ie/2011/04/communication-networks-part-2-mit.html Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 23 / 25

Application: Reality Mining Data Set Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 24 / 25

Application: Reality Mining Data Set Figure: Jackknife Estimates of Standard Errors for Reality Mining Data set Tin Lok James Ng, Brendan Murphy (Insight) CASI May 15, 2017 25 / 25