A Bivariate Point Process Model with Application to Social Media User Content Generation

Similar documents
New Bayesian methods for model comparison

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Efficient Monitoring Algorithm for Fast News Alert

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

AN EM ALGORITHM FOR HAWKES PROCESS

On Model Fitting Procedures for Inhomogeneous Neyman-Scott Processes

EM for Spherical Gaussians

On Measurement Error Problems with Predictors Derived from Stationary Stochastic Processes and Application to Cocaine Dependence Treatment Data

Bayesian Methods for Machine Learning

An Assessment of Crime Forecasting Models

Two step estimation for Neyman-Scott point process with inhomogeneous cluster centers. May 2012

p(d θ ) l(θ ) 1.2 x x x

Doubly Inhomogeneous Cluster Point Processes

1 Degree distributions and data

Stance classification and Diffusion Modelling

1 A Tutorial on Hawkes Processes

Generalized additive modelling of hydrological sample extremes

DM-Group Meeting. Subhodip Biswas 10/16/2014

Computing the MLE and the EM Algorithm

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

Analysis of Gamma and Weibull Lifetime Data under a General Censoring Scheme and in the presence of Covariates

Quasi-likelihood Scan Statistics for Detection of

Maximum Likelihood Estimation. only training data is available to design a classifier

Point Processes. Tonglin Zhang. Spatial Statistics for Point and Lattice Data (Part II)

Mining Triadic Closure Patterns in Social Networks

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Problem (INFORMAL). Given a dynamic graph, find a set of possibly overlapping temporal subgraphs to concisely describe the given dynamic graph in a

Appendix F. Computational Statistics Toolbox. The Computational Statistics Toolbox can be downloaded from:

Inferring Latent Social Networks from Stock Holdings. Manual for the EM Algorithm

Computer Intensive Methods in Mathematical Statistics

MobiHoc 2014 MINIMUM-SIZED INFLUENTIAL NODE SET SELECTION FOR SOCIAL NETWORKS UNDER THE INDEPENDENT CASCADE MODEL

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

12 - Nonparametric Density Estimation

Extreme Value Analysis and Spatial Extremes

Statistical Analysis of Spatio-temporal Point Process Data. Peter J Diggle

Bayesian Inference for Clustered Extremes

Hypothesis testing: theory and methods

PARAMETER CONVERGENCE FOR EM AND MM ALGORITHMS

ESTIMATING FUNCTIONS FOR INHOMOGENEOUS COX PROCESSES

Lecture 2 APPLICATION OF EXREME VALUE THEORY TO CLIMATE CHANGE. Rick Katz

Model Based Clustering of Count Processes Data

Rational Spamming. Xinyu Cao MIT John R. Hauser MIT T. Tony Ke MIT Juanjuan Zhang MIT

arxiv: v1 [cs.si] 15 Nov 2018

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Empirical Bayes Unfolding of Elementary Particle Spectra at the Large Hadron Collider

EVA Tutorial #2 PEAKS OVER THRESHOLD APPROACH. Rick Katz

Lecture 9 Point Processes

Time-Sensitive Dirichlet Process Mixture Models

Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis

Frailty Modeling for Spatially Correlated Survival Data, with Application to Infant Mortality in Minnesota By: Sudipto Banerjee, Mela. P.

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

A short introduction to INLA and R-INLA

A general mixed model approach for spatio-temporal regression data

TEORIA BAYESIANA Ralph S. Silva

Discovering Geographical Topics in Twitter

Statistical Properties of Marsan-Lengliné Estimates of Triggering Functions for Space-time Marked Point Processes

STAT 461/561- Assignments, Year 2015

Lecture 10 Spatio-Temporal Point Processes

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

A Conditional Approach to Modeling Multivariate Extremes

Adjusted Empirical Likelihood for Long-memory Time Series Models

Temporal Point Processes the Conditional Intensity Function

Control Variates for Markov Chain Monte Carlo

Modeling Recurrent Events in Panel Data Using Mixed Poisson Models

Point process models for earthquakes with applications to Groningen and Kashmir data

Biostat 2065 Analysis of Incomplete Data

Burstiness Scale: A Parsimonious Model for Characterizing Random Series of Events

The Expectation-Maximization Algorithm

Multivariate Capability Analysis Using Statgraphics. Presented by Dr. Neil W. Polhemus

BUSI 460 Suggested Answers to Selected Review and Discussion Questions Lesson 7

Jesper Møller ) and Kateřina Helisová )

MCMC algorithms for fitting Bayesian models

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Empirical Likelihood Methods for Two-sample Problems with Data Missing-by-Design

Mathematical statistics

Chapter 2 Inference on Mean Residual Life-Overview

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

an introduction to bayesian inference

Interactive GIS in Veterinary Epidemiology Technology & Application in a Veterinary Diagnostic Lab

Chapter 4. Theory of Tests. 4.1 Introduction

Web-based Supplementary Material for A Two-Part Joint. Model for the Analysis of Survival and Longitudinal Binary. Data with excess Zeros

Information geometry for bivariate distribution control

Discovering Topical Interactions in Text-based Cascades using Hidden Markov Hawkes Processes

Threshold estimation in marginal modelling of spatially-dependent non-stationary extremes

Exploring spatial decay effect in mass media and social media: a case study of China

Nonparametric Bayesian Methods - Lecture I

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data

Semi-parametric estimation of non-stationary Pickands functions

A Framework of Detecting Burst Events from Micro-blogging Streams

Introduction to Maximum Likelihood Estimation

Empirical likelihood and self-weighting approach for hypothesis testing of infinite variance processes and its applications

Sparse Graph Learning via Markov Random Fields

Statistical Models for Defective Count Data

Mathematical statistics

Improving Efficiency of Inferences in Randomized Clinical Trials Using Auxiliary Covariates

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

Expectation Maximization

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Modeling population growth in online social networks

Transcription:

1 / 33 A Bivariate Point Process Model with Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu Department of Management Science The Miami Business School, University of Miami

Data Description: Sina Weibo Data 2 / 33 Source: Sina Weibo, the largest twitter-type online social media in China. The dataset contains posts from 5,913 followers of the official Beijing University Guanghua MBA Weibo account. For each user, all of his/her posts during the period of Jan 1st to Jan 30th, 2014, including the time stamp of each post, have been collected. Each post can be a post with original contents or a repost.

Data Description: Trump s Twitter Data 3 / 33 Source: Twitter data collected from Donald Trump (@realdonaldtrump) from Jan 2013 to Apr 2018. Twitter archive of Donald Trump can be downloaded from http://www.trumptwitterarchive.com/. Twitter shows the device used for each tweet; devices may be Android, Web Client, iphone, and others. We consider the tweets posted by using a Android device before and an iphone after the election. This results in a total of 17,518 tweets; the average number of monthly tweets is 278. Each tweet is either an original tweet or a retweet.

Data Description: Sina Weibo Data 4 / 33 User 3 User 2 User 1 01/01 01/05 01/10 01/15 01/20 01/25 01/30 date Figure : The posting times of three users.

Data Description: Sina Weibo Data 5 / 33 1.0e-05 1.5e-05 2.0e-05 2.5e-05 3.0e-05 3.5e-05 0 10 20 30 40 50 60 70 hour Figure : Average empirical pair correlation function.

Observations from Data 6 / 33 A user s posting activity may alternate between active and inactive states. During an active state, the user may publish one or more posts (often with short inter-post time distances). During an inactive state, no post is being produced until the start of the next active state. There may be daily patterns in posting times. It s a bivariate point process (i.e., posts and reposts).

Graphical Illustration: Univariate Process 7 / 33 Episodes: clusters of posting time locations. Adjacent episodes are nonoverlapping and separated by the inactive period in between.

Graphical Illustration: Bivariate Process 8 / 33 post segment repost segment post segment episode Inactive episode Each episode contains subepisodes of posts and reposts. Posts (reposts) tend to be followed by posts (reposts). Reposts may be more clustered than posts. Number of reposts may be related to number of followees.

Clustered Point Process 9 / 33 Goal: Model the clustered posting times for social media posting time data (do not distinguish between posts and reposts for now). Existing Methods: Hawkes process The Neyman-Scott process Barlett-Lewis process Interrupted poisson process We propose a new class of clustered temporal point processes that is easy to interpret and also can be easily generalized to the bivariate case.

Model Formulation 10 / 33 For each episode, the parent event generates a Poisson number of offspring events with mean µ. Each offspring location, relative to the location of the previous event in the same cluster, follows an exponential distribution with parameter ρ. Once all the events in an episode have been observed, the parent event in the following episode is generated following a hazard function λ(t; β).

Model Formulation 11 / 33 By observing the daily cyclic pattern in the average pair correlation function, we may assume that p λ(t; β) = exp β 0 + [β j1 cos(ω j t) + β j2 sin(ω j t)] j=1 where ω j = 2jπ and β = {β 0, β j1, β j2 : j = 1,, p}. Other nonparametric models can also be used.

Model Formulation 12 / 33 Define event time locations {T l : l = 1,..., N} and indicator variables {Y l : l = 1,..., N}, where Y l = 1 denote parent events and Y l = 0 offspring events. Let T 0 = 0. Define the gap time D l = T l T l 1, l = 1,, N. Let f l0 (x) and f l1 (x) be the probability density functions of D l given that Y l = 0 and Y l = 1. Assume f l0 (x) = ρ exp( ρx), and f l1 (x) = λ(t l 1 + x; β) exp [ tl 1 +x t l 1 ] λ(t; β)dt.

Model Formulation 13 / 33 Assume the first event is a parent event and all events in the last episode are contained in [0, T ]. The complete-data likelihood can then be written as L(θ; t, y) = n l=1 m=0 1 [f ] [ ] k lm (d l ; θ) I(y l =m) P(N i = n i ) P(D n+1 > T t n ), where D n+1 is the gap time between t n and the next parent event, P(N i = n i ) = exp( µ)µn i, n i! and P(D n+1 > T t n ) = exp [ i=1 T t n λ(t; β)dt ].

Composite Likelihood Estimation 14 / 33 The observed-data likelihood is y L(θ; t, y), where the summation is over all 2 n possibilities of y!!! Divide W = [0, T ] into J non-overlapping unit windows of length s, i.e., W = J j=1 W j where W j = [(j 1)s, js). As before, we assume The first event in W j is a parent event, All events in the last episode of W j are contained in W j. Define t j = {t i : t i W j } and y j = {y i : t i W j }. Then the observed-data likelihood on W j is y j L(θ; t j, y j ). We estimate θ by maximizing the composite likelihood J L(θ; t) = L(θ; t j, y j ). j=1 yj

Composite Likelihood Estimation 15 / 33 Each summation in the CLE is over 2 n j terms where n j is the number of events in W j. Note that J j=1 2n j << 2 n so significant computational gains can be achieved. There is a potential bias problem since The first event in W j may not be a parent event, Not all events in the last episode of W j are contained in W j. The bias problem can be mitigated if we choose the blocks wisely. Convergence can be a problem since multiple parameters need to be estimated simultaneously and the likelihood surface is often quite flat.

A Composite Likelihood EM Algorithm 16 / 33 Let T j and Y j be the random version of t j and y j. In the E-Step, we take expectation of the log likelihood l(θ; t j, Y j ) with respect to the conditional distribution of Y j T j = t j, ˆθ prev, i.e., Q j (θ ˆθ prev ) = E Yj T j =t j, ˆθ prev l(θ; t j, Y j ). Define Q(θ ˆθ prev ) = J Q j (θ ˆθ prev ). j=1 In the M-step, Q(θ ˆθ prev ) is maximized with respect to θ.

A Composite Likelihood EM Algorithm 17 / 33 For the expectation, we need to calculate for t l W j, P θ (Y l = m T j = t j ) which is y j y l =m P θ (Y l = m T j = t j ) = L(θ; t j, y j ). y j L(θ; t j, y j ) If there are a large number of events in W j, we employ a standard Metropolis- Hasting algorithm to sample from the conditional distribution Y j T j = t j, θ for the E-step. Closed form expressions can be obtained for ˆθ (except for ˆβ) in the M-step. Convergence is no issue.

A Composite Likelihood EM Algorithm 18 / 33 Theorem The log-composite likelihood l(θ; t) = log L(θ; t) satisfies l(θ p ; t) l(θ p 1 ; t), p = 1, 2,..., where θ p is the pth update from the E-M algorithm. The theorem guarantees that log-composite likelihood is nondecreasing at each EM iteration. The convergence of ˆθ p to a stationary point as p is guaranteed by Theorem 2 in Wu (1983). Standard techniques such as running the EM algorithm from multiple starting point can help locate the global maximum. Consistency and asymptotic normality can be established for the global maximum (assuming the model is right).

Extension to Bivariate Case 19 / 33 For each episode, there are a Poisson number of subepisodes with mean γ. Post and repost episodes alternate. The first subepisode is post with probability α. There are a Poisson number of offspring in each post (repost) subepisode with mean µ 1 (µ 0 ). For each offspring in a post (repost) subepisode, its location relative to that of the previous event in the same episode follows an exponential distribution with parameter ρ 1 (ρ 0 ). Once all the events in an episode have been observed, the parent event in the following episode is generated following a hazard function λ(t; β). The composite likelihood E-M algorithm can be modified to fit the model.

Application to Trump s Twitter Data 20 / 33 α γ µ 1 µ 0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.5 1.0 1.5 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.0 0.5 1.0 1.5 2.0 2.5 2013 2014 2015 2016 2017 2018 2013 2014 2015 2016 2017 2018 2013 2014 2015 2016 2017 2018 2013 2014 2015 2016 2017 2018 100 200 300 400 ρ 1 0 500 1000 1500 ρ 0 3 4 5 6 number of tweets per episode hour 0.2 0.3 0.4 0.5 0.6 0.7 0.8 episode length 2013 2014 2015 2016 2017 2018 2013 2014 2015 2016 2017 2018 2013 2014 2015 2016 2017 2018 2013 2014 2015 2016 2017 2018 Figure : Parameters estimated from Donald Trump s monthly Twitter data. The two red dashed lines mark June 2015 (candidacy announcement) and Jan 2017 (assumes office), respectively.

Figure : Estimated parent event hazard functions from Donald Trump s monthly Twitter data. The two red dashed lines mark June 2015 (candidacy announcement) and Jan 2017 (assumes office), respectively. 21 / 33

0.0 0.2 0.4 0.6 0.8 1.0 22 / 33 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.005 0.010 0.015 0.020 0.025 0.030 0.01 0.02 0.03 0.04 0.05 1 2 3 4 5 Figure : Goodness of fit plots of the model fitted for Jan 2017. From left to right are the envelop plot (first plot) with the upper and lower envelopes marked in red dashed lines, goodness of fit plots for the original offspring post (second plot), offspring repost (third plot) and parent (last plot) inter-event distances. Red solid lines are calculated from cdf of exponential distributions. The grey bands are the 95% confidence intervals.

Application to Sina Weibo Data 23 / 33 User 3 User 2 User 1 01/01 01/05 01/10 01/15 01/20 01/25 01/30 date Figure : The posting times of three users.

24 / 33 α γ µ 1 µ 0 ρ 1 ρ 0 User 1 0.343 0.024 0.099 0.241 14.444 43.442 (0.008) (0.004) (0.010) (0.014) (7.166) (6.124) User 2 0.387 0.086 0.101 0.614 163.026 618.721 (0.009) (0.006) (0.010) (0.010) (13.013) (21.749) User 3 0.644 0.227 0.445 0.309 90.983 152.253 (0.006) (0.008) (0.013) (0.012) (5.882) (7.477) Table : Estimated α, γ, µ 1, µ 0, ρ 1, ρ 0 of Users 1, 2 and 3.

Application to Sina Weibo Data 25 / 33 intensity 0 5 10 15 20 User 1 User 2 User 3 12 am 12 pm 12 am time Figure : Parent hazard functions of Users 1, 2 and 3.

Application to Sina Weibo Data 26 / 33 mean function first eigenfunction 1 2 3 4-3 -1 1 2 12am 12pm 12am second eigenfunction 12am 12pm 12am third eigenfunction -3-1 1 2-3 -1 1 2 12am 12pm 12am 12am 12pm 12am Figure : Plots of the mean and first three eigenfunctions of the estimated daily parent hazard functions.

Characterize Sina Weibo User Behavior 27 / 33 0 5 10 15 20 10 0 5 0 1 2 3 4 7.3% 26.05% 66.6% 4.2% 20.4% 75.4% 3.2% 15.6% 81.2% Figure : Groups in the average daily parent hazard (left plot), average number of posts per episode (middle plot) and average length (in hours) of an episode (right plots). The percentages at the bottom of the boxplots show the percentage of users in each group.

Social Effect on Users of Sina Weibo 28 / 33 For each Sina Weibo user, we were also able to collect the number of accounts the user was following (n ) and the number of accounts that were following this user (n ). We find that there is a stronger correlation between n and µ 0 (r = 0.205). These observations indicate that users who follow more accounts are more likely to have more reposts. One explanation could be that the more accounts a user follows, the more content they can repost from. Another plausible explanation is that the followers in the social media tend to repost more.

Social Effect on Users of Sina Weibo 29 / 33 We find that the popular users, i.e., those whose accounts have many followers, tend to post more original content. They are also more likely to initiate their Weibo engagement by posting original content. We find that users who have strong social ties, i.e., have many followers or follow many others, are more likely to use Weibo more often. We find that users with many followers are more likely to spend more time on Weibo once they start an episode of engagement.

Simulation Study 30 / 33 We set the observation window length T = 100, α = 0.6. With each parameter configuration, we simulate 100 event trajectories. We set the parent event hazard function as λ(t; β) = exp [β 01 + β 11 cos(2πt) + β 12 sin(2πt)]. For estimation, we use unit window length s = 1 or 5. To model λ(t, β), we consider both the true model and the nonparametric cyclic B-spline model. For the latter, we use the knot vector (0, 0.2, 0.4, 0.6, 0.8, 1).

Simulation Study 31 / 33

Simulation Study (γ, µ 1, µ 0, ρ 1, ρ 0 ) (β 01, β 11, β 12 ; s) α γ µ 1 µ 0 ρ 1 ρ 0 (0.5,0.5,0.5,10,15) 0.595 0.498 0.489 0.494 10.172 15.604 (-2,-2,2; 5) (0.010) (0.013) (0.014) (0.014) (0.261) (0.365) (0.5,0.5,0.5,10,15) 0.594 0.496 0.510 0.518 9.867 15.422 (-3,-3,3; 5) (0.007) (0.011) (0.012) (0.014) (0.188) (0.284) (1.0,0.5,0.5,10,15) 0.603 0.993 0.489 0.499 10.012 15.026 (-2,-2,2; 5) (0.009) (0.017) (0.011) (0.012) (0.176) (0.257) (0.5,1.0,1.0,10,15) 0.598 0.511 0.990 1.025 10.149 15.084 (-2,-2,2; 5) (0.008) (0.010) (0.016) (0.017) (0.171) (0.309) (0.5,0.5,0.5,20,30) 0.600 0.508 0.499 0.488 19.855 30.354 (-2,-2,2; 5) (0.008) (0.012) (0.012) (0.013) (0.460) (0.717) (0.5,0.5,0.5,10,15) 0.601 0.468 0.495 0.460 10.795 16.335 (-2,-2,2; 1) (0.008) (0.010) (0.014) (0.014) (0.271) (0.309) 32 / 33

Summary 33 / 33 We propose a new clustered temporal point process model to model user generated posts on social media. The proposed model captures both inhomogeneity in the initial posting time and the clustering pattern in the subsequent posts following the initial post. The proposed goodness of fit procedure shows that the proposed model fits the data reasonably well. The fitted models provide valuable insights on a user s content generating behavior.