A Bivariate Point Process Model with Application to Social Media User Content Generation

Size: px

Start display at page:

Download "A Bivariate Point Process Model with Application to Social Media User Content Generation"

Cathleen Ross
5 years ago
Views:

1 1 / 33 A Bivariate Point Process Model with Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu Department of Management Science The Miami Business School, University of Miami

2 Data Description: Sina Weibo Data 2 / 33 Source: Sina Weibo, the largest twitter-type online social media in China. The dataset contains posts from 5,913 followers of the official Beijing University Guanghua MBA Weibo account. For each user, all of his/her posts during the period of Jan 1st to Jan 30th, 2014, including the time stamp of each post, have been collected. Each post can be a post with original contents or a repost.

3 Data Description: Trump s Twitter Data 3 / 33 Source: Twitter data collected from Donald Trump (@realdonaldtrump) from Jan 2013 to Apr Twitter archive of Donald Trump can be downloaded from Twitter shows the device used for each tweet; devices may be Android, Web Client, iphone, and others. We consider the tweets posted by using a Android device before and an iphone after the election. This results in a total of 17,518 tweets; the average number of monthly tweets is 278. Each tweet is either an original tweet or a retweet.

4 Data Description: Sina Weibo Data 4 / 33 User 3 User 2 User 1 01/01 01/05 01/10 01/15 01/20 01/25 01/30 date Figure : The posting times of three users.

5 Data Description: Sina Weibo Data 5 / e e e e e e hour Figure : Average empirical pair correlation function.

6 Observations from Data 6 / 33 A user s posting activity may alternate between active and inactive states. During an active state, the user may publish one or more posts (often with short inter-post time distances). During an inactive state, no post is being produced until the start of the next active state. There may be daily patterns in posting times. It s a bivariate point process (i.e., posts and reposts).

7 Graphical Illustration: Univariate Process 7 / 33 Episodes: clusters of posting time locations. Adjacent episodes are nonoverlapping and separated by the inactive period in between.

8 Graphical Illustration: Bivariate Process 8 / 33 post segment repost segment post segment episode Inactive episode Each episode contains subepisodes of posts and reposts. Posts (reposts) tend to be followed by posts (reposts). Reposts may be more clustered than posts. Number of reposts may be related to number of followees.

9 Clustered Point Process 9 / 33 Goal: Model the clustered posting times for social media posting time data (do not distinguish between posts and reposts for now). Existing Methods: Hawkes process The Neyman-Scott process Barlett-Lewis process Interrupted poisson process We propose a new class of clustered temporal point processes that is easy to interpret and also can be easily generalized to the bivariate case.

10 Model Formulation 10 / 33 For each episode, the parent event generates a Poisson number of offspring events with mean µ. Each offspring location, relative to the location of the previous event in the same cluster, follows an exponential distribution with parameter ρ. Once all the events in an episode have been observed, the parent event in the following episode is generated following a hazard function λ(t; β).

11 Model Formulation 11 / 33 By observing the daily cyclic pattern in the average pair correlation function, we may assume that p λ(t; β) = exp β 0 + [β j1 cos(ω j t) + β j2 sin(ω j t)] j=1 where ω j = 2jπ and β = {β 0, β j1, β j2 : j = 1,, p}. Other nonparametric models can also be used.

12 Model Formulation 12 / 33 Define event time locations {T l : l = 1,..., N} and indicator variables {Y l : l = 1,..., N}, where Y l = 1 denote parent events and Y l = 0 offspring events. Let T 0 = 0. Define the gap time D l = T l T l 1, l = 1,, N. Let f l0 (x) and f l1 (x) be the probability density functions of D l given that Y l = 0 and Y l = 1. Assume f l0 (x) = ρ exp( ρx), and f l1 (x) = λ(t l 1 + x; β) exp [ tl 1 +x t l 1 ] λ(t; β)dt.

13 Model Formulation 13 / 33 Assume the first event is a parent event and all events in the last episode are contained in [0, T ]. The complete-data likelihood can then be written as L(θ; t, y) = n l=1 m=0 1 [f ] [ ] k lm (d l ; θ) I(y l =m) P(N i = n i ) P(D n+1 > T t n ), where D n+1 is the gap time between t n and the next parent event, P(N i = n i ) = exp( µ)µn i, n i! and P(D n+1 > T t n ) = exp [ i=1 T t n λ(t; β)dt ].

14 Composite Likelihood Estimation 14 / 33 The observed-data likelihood is y L(θ; t, y), where the summation is over all 2 n possibilities of y!!! Divide W = [0, T ] into J non-overlapping unit windows of length s, i.e., W = J j=1 W j where W j = [(j 1)s, js). As before, we assume The first event in W j is a parent event, All events in the last episode of W j are contained in W j. Define t j = {t i : t i W j } and y j = {y i : t i W j }. Then the observed-data likelihood on W j is y j L(θ; t j, y j ). We estimate θ by maximizing the composite likelihood J L(θ; t) = L(θ; t j, y j ). j=1 yj

15 Composite Likelihood Estimation 15 / 33 Each summation in the CLE is over 2 n j terms where n j is the number of events in W j. Note that J j=1 2n j << 2 n so significant computational gains can be achieved. There is a potential bias problem since The first event in W j may not be a parent event, Not all events in the last episode of W j are contained in W j. The bias problem can be mitigated if we choose the blocks wisely. Convergence can be a problem since multiple parameters need to be estimated simultaneously and the likelihood surface is often quite flat.

16 A Composite Likelihood EM Algorithm 16 / 33 Let T j and Y j be the random version of t j and y j. In the E-Step, we take expectation of the log likelihood l(θ; t j, Y j ) with respect to the conditional distribution of Y j T j = t j, ˆθ prev, i.e., Q j (θ ˆθ prev ) = E Yj T j =t j, ˆθ prev l(θ; t j, Y j ). Define Q(θ ˆθ prev ) = J Q j (θ ˆθ prev ). j=1 In the M-step, Q(θ ˆθ prev ) is maximized with respect to θ.

17 A Composite Likelihood EM Algorithm 17 / 33 For the expectation, we need to calculate for t l W j, P θ (Y l = m T j = t j ) which is y j y l =m P θ (Y l = m T j = t j ) = L(θ; t j, y j ). y j L(θ; t j, y j ) If there are a large number of events in W j, we employ a standard Metropolis- Hasting algorithm to sample from the conditional distribution Y j T j = t j, θ for the E-step. Closed form expressions can be obtained for ˆθ (except for ˆβ) in the M-step. Convergence is no issue.

18 A Composite Likelihood EM Algorithm 18 / 33 Theorem The log-composite likelihood l(θ; t) = log L(θ; t) satisfies l(θ p ; t) l(θ p 1 ; t), p = 1, 2,..., where θ p is the pth update from the E-M algorithm. The theorem guarantees that log-composite likelihood is nondecreasing at each EM iteration. The convergence of ˆθ p to a stationary point as p is guaranteed by Theorem 2 in Wu (1983). Standard techniques such as running the EM algorithm from multiple starting point can help locate the global maximum. Consistency and asymptotic normality can be established for the global maximum (assuming the model is right).

19 Extension to Bivariate Case 19 / 33 For each episode, there are a Poisson number of subepisodes with mean γ. Post and repost episodes alternate. The first subepisode is post with probability α. There are a Poisson number of offspring in each post (repost) subepisode with mean µ 1 (µ 0 ). For each offspring in a post (repost) subepisode, its location relative to that of the previous event in the same episode follows an exponential distribution with parameter ρ 1 (ρ 0 ). Once all the events in an episode have been observed, the parent event in the following episode is generated following a hazard function λ(t; β). The composite likelihood E-M algorithm can be modified to fit the model.

20 Application to Trump s Twitter Data 20 / 33 α γ µ 1 µ ρ ρ number of tweets per episode hour episode length Figure : Parameters estimated from Donald Trump s monthly Twitter data. The two red dashed lines mark June 2015 (candidacy announcement) and Jan 2017 (assumes office), respectively.

21 Figure : Estimated parent event hazard functions from Donald Trump s monthly Twitter data. The two red dashed lines mark June 2015 (candidacy announcement) and Jan 2017 (assumes office), respectively. 21 / 33

22 / Figure : Goodness of fit plots of the model fitted for Jan From left to right are the envelop plot (first plot) with the upper and lower envelopes marked in red dashed lines, goodness of fit plots for the original offspring post (second plot), offspring repost (third plot) and parent (last plot) inter-event distances. Red solid lines are calculated from cdf of exponential distributions. The grey bands are the 95% confidence intervals.

23 Application to Sina Weibo Data 23 / 33 User 3 User 2 User 1 01/01 01/05 01/10 01/15 01/20 01/25 01/30 date Figure : The posting times of three users.

24 24 / 33 α γ µ 1 µ 0 ρ 1 ρ 0 User (0.008) (0.004) (0.010) (0.014) (7.166) (6.124) User (0.009) (0.006) (0.010) (0.010) (13.013) (21.749) User (0.006) (0.008) (0.013) (0.012) (5.882) (7.477) Table : Estimated α, γ, µ 1, µ 0, ρ 1, ρ 0 of Users 1, 2 and 3.

25 Application to Sina Weibo Data 25 / 33 intensity User 1 User 2 User 3 12 am 12 pm 12 am time Figure : Parent hazard functions of Users 1, 2 and 3.

26 Application to Sina Weibo Data 26 / 33 mean function first eigenfunction am 12pm 12am second eigenfunction 12am 12pm 12am third eigenfunction am 12pm 12am 12am 12pm 12am Figure : Plots of the mean and first three eigenfunctions of the estimated daily parent hazard functions.

27 Characterize Sina Weibo User Behavior 27 / % 26.05% 66.6% 4.2% 20.4% 75.4% 3.2% 15.6% 81.2% Figure : Groups in the average daily parent hazard (left plot), average number of posts per episode (middle plot) and average length (in hours) of an episode (right plots). The percentages at the bottom of the boxplots show the percentage of users in each group.

28 Social Effect on Users of Sina Weibo 28 / 33 For each Sina Weibo user, we were also able to collect the number of accounts the user was following (n ) and the number of accounts that were following this user (n ). We find that there is a stronger correlation between n and µ 0 (r = 0.205). These observations indicate that users who follow more accounts are more likely to have more reposts. One explanation could be that the more accounts a user follows, the more content they can repost from. Another plausible explanation is that the followers in the social media tend to repost more.

29 Social Effect on Users of Sina Weibo 29 / 33 We find that the popular users, i.e., those whose accounts have many followers, tend to post more original content. They are also more likely to initiate their Weibo engagement by posting original content. We find that users who have strong social ties, i.e., have many followers or follow many others, are more likely to use Weibo more often. We find that users with many followers are more likely to spend more time on Weibo once they start an episode of engagement.

30 Simulation Study 30 / 33 We set the observation window length T = 100, α = 0.6. With each parameter configuration, we simulate 100 event trajectories. We set the parent event hazard function as λ(t; β) = exp [β 01 + β 11 cos(2πt) + β 12 sin(2πt)]. For estimation, we use unit window length s = 1 or 5. To model λ(t, β), we consider both the true model and the nonparametric cyclic B-spline model. For the latter, we use the knot vector (0, 0.2, 0.4, 0.6, 0.8, 1).

31 Simulation Study 31 / 33

32 Simulation Study (γ, µ 1, µ 0, ρ 1, ρ 0 ) (β 01, β 11, β 12 ; s) α γ µ 1 µ 0 ρ 1 ρ 0 (0.5,0.5,0.5,10,15) (-2,-2,2; 5) (0.010) (0.013) (0.014) (0.014) (0.261) (0.365) (0.5,0.5,0.5,10,15) (-3,-3,3; 5) (0.007) (0.011) (0.012) (0.014) (0.188) (0.284) (1.0,0.5,0.5,10,15) (-2,-2,2; 5) (0.009) (0.017) (0.011) (0.012) (0.176) (0.257) (0.5,1.0,1.0,10,15) (-2,-2,2; 5) (0.008) (0.010) (0.016) (0.017) (0.171) (0.309) (0.5,0.5,0.5,20,30) (-2,-2,2; 5) (0.008) (0.012) (0.012) (0.013) (0.460) (0.717) (0.5,0.5,0.5,10,15) (-2,-2,2; 1) (0.008) (0.010) (0.014) (0.014) (0.271) (0.309) 32 / 33

33 Summary 33 / 33 We propose a new clustered temporal point process model to model user generated posts on social media. The proposed model captures both inhomogeneity in the initial posting time and the clustering pattern in the subsequent posts following the initial post. The proposed goodness of fit procedure shows that the proposed model fits the data reasonably well. The fitted models provide valuable insights on a user s content generating behavior.

New Bayesian methods for model comparison

Back to the future New Bayesian methods for model comparison Murray Aitkin murray.aitkin@unimelb.edu.au Department of Mathematics and Statistics The University of Melbourne Australia Bayesian Model Comparison