Beyond ERGMs. Scalable methods for the statistical modeling of networks. David Hunter. Department of Statistics Penn State University

Size: px

Start display at page:

Download "Beyond ERGMs. Scalable methods for the statistical modeling of networks. David Hunter. Department of Statistics Penn State University"

Charleen Shepherd
5 years ago
Views:

1 Beyond ERGMs Scalable methods for the statistical modeling of networks David Hunter Department of Statistics Penn State University Supported by ONR MURI Award Number N4---5 University of Texas at Austin, May, 3

2 Outline Estimation and the ERGM Framework Statistical Estimation for Large, Time-Varying Networks Model-Based Clustering of Large Networks

3 Outline Estimation and the ERGM Framework Statistical Estimation for Large, Time-Varying Networks Model-Based Clustering of Large Networks

4 A network model is a probability distribution (or family of distributions) on the set of all possible networks Thus, we assign each possible network a probability, e.g., 4,, and so on. 4 But we d like to avoid explicit enumeration. (Think of Occam s Razor.) ERGMs are one way to allow the assignment to depend (explicitly) on a relatively small number of parameters. ERGM = Exponential-family Random Graph Model

5 ERGM: Exponential-Family Random Graph Model An ERGM (or p-star model) says P θ (Y = y) = exp{θ g(y)}, y Y κ(θ, Y) Y is a random network Y is the set of all possible networks,, θ is a vector of parameters g(y) is a known vector of network statistics on y κ(θ, Y) makes all the probabilities sum to

6 The Gilbert-Erdős-Rényi model: The simplest ERGM The function κ(θ, Y) can be troublesome, but not always. Consider the following case (Ann. Math. Stat., 5):

7 The Gilbert-Erdős-Rényi model: The simplest ERGM Let p be some fixed constant between and. P(Y = y) = p E(y) ( p) E(y), where E(y) is the number of edges in y and E(y) is the number of non-edges in y. Rewrite using θ = log p log( p): ( p )# of edges P(Y = y) = ( p) const p = exp{θ # of edges} κ(θ)

8 Dyadic independence ERGMs are generally tractable Gilbert-Erdős-Rényi is a special case of dyadic independence: P θ (Y = y) = P θ (D ij = d ij ) i<j Dyad D ij, directed case: Dyad D ij, undirected case: i j i j Dyadic independence models have drawbacks but they facilitate estimation; facilitate simulation; avoid degeneracy issue (cf. Schweinberger, ).

9 Statistical inference is Probability in reverse The ERGM hypothesizes: P θ (Y = y) = exp{θ g(y)}, y Y κ(θ, Y) PROBABILITY θ ERGM,, STATISTICS Statistical Goal: Use observed data to select from the given ERGM class i.e., to learn about θ. We might search for a best θ or a density p(θ data).

10 The loglikelihood function is L(θ) = P θ (Y = y obs ) The ERGM hypothesizes: P θ (Y = y) = exp{θ g(y)}, y Y κ(θ, Y) To choose a θ, we might try to search for a best theta by maximizing L(θ) or l(θ) = log L(θ) = θ g(y obs ) log κ(θ, Y) Alternatively, a Bayesian approach tries to describe an entire distribution over θ values, the posterior: p(θ Y = y obs ) L(θ) π(θ).

11 Computing the likelihood is sometimes very difficult The likelihood is L(θ) = P θ (Y = y obs ), viewed as a function of θ. For this undirected, 34-node network, computing l(θ) directly may require summation of,54,4,4,643,,4,43,,6,6,53,,33,4, 44,3,,56,5,4,6, 4,4,34,4,,4,, 53,45,355,3,535,,36, 5,6,5,6,,54,6, 5,4,55,65,,6,5, 65,3,,,,653,5 terms.

12 The log-likelihood may be written as an expectation Recall: l(θ) = log L(θ) = θ g(y obs ) log κ(θ, Y) Suppose we fix θ. A bit of algebra shows that l(θ) l(θ ) = (θ θ ) g(y obs ) log E θ [ exp { (θ θ ) t g(y ) }] = (θ θ ) g(y obs ) log E θ [blah blah Y blah]. Thus, randomly sampling networks from P θ allows approximation of l(θ) l(θ ).

13 Example Network: High School Friendship Data School : 5 Students An edge indicates a mutual friendship. Colored labels give grade level, through. Circles = female, squares = male, triangles = unknown. N.B.: Missing data ignored here, though this could be altered.

14 Fitting an ERGM to the high school dataset ERGM parameter estimates from Hunter et al (): Coefficient Coefficient edges 3.4(.) AD (Gr.) = 3.4(.4) GWESP.3(.3) AD (Gr.) =.4(.4) GWD.(.35) AD (Gr.) = 3.43(.6) GWDSP.5(.) DH (Gr. ) 6.(.56) NF (Gr. ).34(.) DH (Gr. ) 6.4(.64) NF (Gr. ).64(.5) DH (Gr. ) 4.5(.5) NF (Gr. ).55(.5) DH (Gr. ) 4.6(.5) NF (Gr. ).(.6) DH (Gr. ) 4.3(.54) NF (Gr. ).3(.6) DH (Gr. ) 4.(.5) NF (Gr. NA) 3.6(.3) DH (White).55(.6) NF (Black).5(.4) DH (Black).(.55) NF (Hisp).3(.33) DH (Hisp).(.43) NF (Nat Am).(.3) DH (Nat Am).3(.43) NF (Other).6(.6) NF (Race NA).53(.) NF (Female).(.) UH (Sex).6(.6) NF (Sex NA).(.4) NF stands for Node Factor. AD stands for Absolute Difference. DH stands for Differential Homophily. UH stands for Uniform Homophily. Significant at.5 level Significant at. level Significant at. level School : 5 Students

15 But what about Large Networks? 5 nodes does not really qualify as large in this context. The estimation techniques used previously do not scale well. School : 5 Students

16 Outline Estimation and the ERGM Framework Statistical Estimation for Large, Time-Varying Networks Model-Based Clustering of Large Networks

17 Idea: Use counting process theory to model networks t= 3 4 t= t=6 t=3 Goal: Model a dynamically evolving network using counting processes. Methods should be applicable to large network datasets (tens or hundreds of thousands of nodes) Two modeling frameworks (terminology of Butts, ): Egocentric: The counting process Ni (t) = cumulative number of events involving the ith node by time t. Relational: The counting process Nij (t) = cumulative number of events involving the (i, j)th node pair by time t. NB: Events need not be edge additions

18 Counting processes may be considered multivariate Combine the N i (t) to give a multivariate counting process N(t) = (N (t),..., N n (t)). Genuinely multivariate; no assumption about the independence of N i (t). N(t) t= 3 4 t= t=6 t=3 N (t) N 4 (t) N 3 (t) N (t) 5 5 t

19 Citation Networks may be modeled as egocentric,55 articles; 35, citations; 5,4 unique times (theoretical physics articles on arxiv) At arrival, a paper cites others that are already in the network. Main dynamic development: Number of citations received. Time N i (t): Number of citations to paper i by time t. At-risk indicator R i (t): Equal to I{t arr i < t}.

Twitter behavior provides another egocentric example,53 people; 4,6 vaccination-related tweets from Aug. to Jan.. More than 4 million follower edges Of particular interest: HN vaccination sentiment.

20 Twitter behavior provides another egocentric example,53 people; 4,6 vaccination-related tweets from Aug. to Jan.. More than 4 million follower edges Of particular interest: HN vaccination sentiment. Some express or + sentiments regarding HN vaccination. N i (t) and N + i (t) are numbers of such tweets by time t. Question of interest: Which predictors (e.g., past behavior or self / followers / followees, position in directed following network) predict the propensity to tweet or +? Is tweeting behavior about HN vaccinations contagious?

21 A multivariate counting process is a submartingale Each N i (t) is nondecreasing in time, so N(t) may be considered a submartingale; i.e., it satisfies E [N(t) past up to time s] N(s) for all t > s. N(t) N (t) N 4 (t) N 3 (t) N (t) 5 5 t

22 The so-called Doob-Meyer Decomposition uniquely decomposes any submartingale N(t) = t λ(s) ds + M(t) : λ(t) is the signal at time t, called the intensity function M(t) is the noise, a continuous-time Martingale. We will model each λ i (t) or λ ij (t).

23 We use standard models for the intensity processes In the egocentric case, consider the Cox or Aalen model for the node i process: Cox Proportional Hazard Model, fixed coefficients: λ i (t H t ) = R i (t)α (t) exp ( β s i (t) ), Aalen additive model, time-varying coefficients: where λ i (t H t ) = R i (t) ( β (t) + β(t) s i (t) ), R i (t) = I(t > ti arr ) is the at-risk indicator H t is the past of the network up to but not including time t α (t) or β (t) is the baseline hazard function β is the p-vector of coefficients to estimate s i (t) = ( s i (t),..., s ip (t) ) is a vector of statistics for node i

24 Relational case is similar (cf. Perry and Wolfe, ) Cox Proportional Hazard Model, fixed coefficients: λ ij (t H t ) = R ij (t)α (t) exp ( β s(i, j, t) ), Aalen Additive Model, time-varying coefficients: λ ij (t H t ) = R ij (t) ( β (t) + β(t) s(i, j, t) ) where R ij (t) = I(max{ti arr, tj arr } < t < t eij ) is the at-risk indicator H t is the past of the network up to but not including time t α (t) or β (t) is the baseline hazard function β or β(t) is the vector of coefficients to estimate s(i, j, t) is a p-vector of statistics for pair (i, j)

25 For large networks, maximizing the partial likelihood in the Cox model requires some computing tricks Recall: The intensity process for node i is λ i (t H t ) = R i (t)α (t) exp ( β s i (t) ) Treat α as a nuisance parameter and take a partial likelihood approach: Maximize ( ) ( ) m exp β s ie (t e ) m exp β s ie (t e ) L(β) = ( ) =. n i= R i(t e ) exp β s i (t e ) κ(t e ) e= e= Computational Trick: Write κ(t e ) = κ(t e ) + κ(t e ), then optimize κ(t e ) calculation.

26 Fitting the Aalen model uses weighted least squares Recall: The intensity process for node i is λ i (t H t ) = R i (t) ( β (t) + β(t) s i (t) ). We do inference not for the β k but rather for their time-integrals B k (t) = t Then (basically weighted least squares) β k (s)ds. () ˆB(t) = J(t e ) [ W(t e ) W(t e ) ] W(te ) N(t e ), () where te t W(t) is N(N ) p with (i, j)th row Rij (t)s(i, j, t) ; J(t) is the indicator that W(t) has full column rank.

27 Example statistics: Preferential Attachment For each cited paper j already in the network... First-order PA: s j (t) = N i= y ij(t ). Rich get richer effect Second-order PA: s j (t) = i k y ki(t )y ij (t ). Effect due to being cited by well-cited papers j Statistics in red are time-dependent. Others are fixed once j joins the network. NB: y(t ) is the network just prior to time t.

28 Example statistics: Recency PA Statistic For each cited paper j already in the network... Recency-based first-order PA (we take T w = days): s j3 (t) = N i= y ij(t )I(t ti arr < T w ). Temporary elevation of citation intensity after recent citations j Statistics in red are time-dependent. Others are fixed once j joins the network. NB: y(t ) is the network just prior to time t.

29 Example statistics: Triangle Statistics For each cited paper j already in the network... Seller statistic: s j4 (t) = i k y ki(t )y ij (t)y kj (t ). Broker statistic: s j5 (t) = i k y kj(t)y ji (t )y ki (t ). Buyer statistic: s j6 (t) = i k y jk(t)y ki (t)y ji (t ). Seller A Buyer C Broker Statistics in red are time-dependent. Others are fixed once j joins the network. NB: y(t ) is the network just prior to time t. B

30 Example statistics: Out-Path Statistics For each cited paper j already in the network... First-order out-degree (OD): s j (t) = N i= y ji(t ). Second-order OD: s j (t) = i k y jk(t )y ki (t ). j Statistics in red are time-dependent. Others are fixed once j joins the network. NB: y(t ) is the network just prior to time t.

Example statistics: Topic Modeling Statistics Additional statistics, using abstract text if available, as follows: A latent Dirichlet allocation (LDA) model (Blei et al, 3) is learned on the training

31 Example statistics: Topic Modeling Statistics Additional statistics, using abstract text if available, as follows: A latent Dirichlet allocation (LDA) model (Blei et al, 3) is learned on the training set. Figure from Wikipedia entry for Latent Dirichlet Analysis, Feb. N=words per document M=documents θ i =topic distribution for paper i We construct a vector (5 dimensions) of similarity statistics: s LDA j (ti arr ) = θ i θ j, where denotes the element-wise product of two vectors.

32 Coefficient Estimates for LDA + PPTR Model Statistics Coefficients (β) s (PA).36 s ( nd PA). s 3 (PA-).5 s 4 (Seller) -.6 s 5 (Broker) -.66 s 6 (Buyer) -.3 s ( st OD). s ( nd OD).5 All coefficient estimates are significant at the. level. Seller A Buyer C Broker B D C B Diverse seller effect: D more likely cited than A. Seller A Buyer C Broker B Diverse buyer effect: E more likely cited than C. A E B

33 average turned positive in mid October (as the vaccine became available) and remained positive for the rest of the year (Figure B). Twitter For vaccinationand sentiments measured HN online to Vaccination be meaningful, Sentiments they need to be compared to empirical data for validation. A where r~ i eii{ i aibi { P i aibi Negative (red), positive (green), and neutral (blue) tweets during fall wave of HN pandemic Salathé and Khandelwal () collected over 4 million tweets from over Twitter users. Here, counting process of interest is not formation of ties; it is expression of HN vaccination sentiments. Figure. (A) Total number of negative (red), positive (green), and neutral (blue) tweets relating to influenza A(HN) vaccination during the Fall wave of the pandemic. (B) Daily (gray) and 4 day moving average (blue) sentiment score during the same time. (C) Correlation between estimated N + vaccination rates for individuals older than 6 months, and sentiment score per HHS region (black dots) and states i (t) = # of positive tweets by i before time t (gray dots). Numbers represent the ten regions as defined by the US Department of Human Health & Services. Lines shows best fit of linear regression (blue for regions, red for states). doi:.3/journal.pcbi..g N i (t) = # of negative tweets by i before time t PLoS Computational Biology October Volume Issue e

34 Cox Model Coefficients for Twitter Dataset Intensity of Intensity of Positive Tweeting Negative Tweeting f + : # friends who tweet +. (p =.).5 (p < 3 ) f + : (+ tweets) (+ friends).43 (p =.34).3 (p < 3 ) f : # friends who tweet.5 (p < 3 ).3 (p =.) f : ( tweets) ( friends).6 (p < 3 ).6 (p < 3 ) where the statistics here are given by f + (i, t) = j F(i,t) f + (i, t) = f + (i, t) # of + tweets by j up to time t total # of tweets by j up to time t j F(i,t) (# of + tweets by j up to time t) total # of tweets by j up to time t

35 Sensitivity Analysis for Twitter Dataset The automatic evaluation algorithm for tweets can err. Randomly re-classify all tweets using small test dataset comparing human classification to automatic classification. Repeat times Response: λ i (negative tweeting) Response: λ + i (positive tweeting) Pos. Friends Pos. Tweets Pos. CIs = 44.5 % Neg. CIs = 3 % Neg. Friends Neg. Tweets Pos. CIs = 3 % Neg. CIs = % Coefficient Coefficient

36 Outline Estimation and the ERGM Framework Statistical Estimation for Large, Time-Varying Networks Model-Based Clustering of Large Networks

37 Epinions.com: Example of large network dataset Unbiased Reviews by Real People Members of Epinions.com can decide whether to trust each other. Web of Trust combined with review ratings to determine which reviews are shown to the user. Dataset of Massa and Avesani (): n = 3, nodes n(n ) =.4 billion observations 4,3 of these are nonzero (±)

38 The Goal: Cluster 3, users Basis for clustering: Patterns of trusts and distrusts in the network If possible: understand the features of the clusters by examining parameter estimates. Unbiased Reviews by Real People Notation: Throughout, we let y ij be rating of j by i and y = (y ij ). We d like to restrict attention to dyadic independence ERGMs in order to model observed (y ij ) data.

39 To model dependence, add K -component mixture structure Suppose nodes have latent (unobserved) colors Z,..., Z n. Reality: What we observe: Simplifying assumption: P(Y = y Z ) = i<j P(D ij = d ij Z i, Z j ) where D ij is the state in Y of the (i, j)th pair.

40 Consider two examples of conditional dyadic independence for the Epinions dataset. The full model of Nowicki and Snijders (): P θ (D ij = d Z i = k, Z j = l) = θ d;kl. A more parsimonious model: P θ (D ij = d ij Z i = k, Z j = l) exp{θ (y ij + y ji ) +θk y ji + θl y ij +θ y ij y ji + θ ++ y + ij y + ji } where y ij = I{Y ij = } and y + ij = I{Y ij = +}. When K = 5 components, these models have and parameters, respectively.

41 There is a problem with the simplifying assumption Our conditional independence model says Z i iid Multinomial(; γ,..., γ K ); P θ (Y = y Z ) = i<j P θ (D ij = d ij Z i, Z j ) Not so simple when we do not observe Z. The full (unconditional) loglikelihood is rather complicated: l(γ, θ) = log z P γ (Z = z)p θ (Y = y Z = z)

42 Approximate maximum likelihood estimation uses a variational EM algorithm For MLE, goal is to maximize the loglikelihood l(γ, θ). Basic idea: Establish lower bound J(γ, θ, α) l(γ, θ) (3) after augmenting parameters by adding α. Create an EM-like algorithm guaranteed to increase J(γ, θ, α) at each iteration. If we maximize the lower bound, then we re hoping that the inequality (3) will be tight enough to put us close to a maximum of l(γ, θ). We adapt the variational EM idea of Daudin, Picard, & Robin ().

43 We may derive a lower bound by simple algebra Clever variational idea: Augment the parameter set, letting α ik = P(Z i = k) for all i n and k K. Let A α (Z ) = i Mult(z i; α i ) denote the joint dist. of Z. Direct calculation gives J(γ, θ, α) def = l(γ, θ) KL {A α (Z ), P γ,θ (Z Y )} =... = E α [log P γ.θ (Y, Z )] H [A α (Z )]. Thus, an EM-like algorithm consists of alternately: maximizing J(γ, θ, α) with respect to α ( E-step ) maximizing Eα [log P γ,θ (Y, Z )] with respect to γ, θ ( M-step )

44 The variational E-step may be modified using a (non-variational) MM algorithm Idea: Use a generalized variational E-step in which J(γ, θ, α) is increased but not necessarily maximized. To this end, we create a surrogate function Q(α, γ (t), θ (t), α (t) ) of α, where t is the counter of the iteration number. In the figure, the red curve minorizes f (x) at x. The surrogate function is a minorizer of J(γ, θ, α): It has the property that maximizing or increasing its value will guarantee an increase in the value of J(γ, θ, α). f(x ) x

45 Construction of the minorizer of J(γ, θ, α) uses standard MM algorithm methods K J(γ, θ, α) = i<j n + k= l= i= k= K α ik α jl log π dij ;kl(θ) C α ik (log γ k log α ik ). We may define a minorizing function as follows: Q(α, γ, θ, α (t) ) = K K α (t) α jl ik i<j k= l= α (t) + αjl ik ( n K + α ik log γ k log α (t) ik i= k= α (t) ik α (t) jl log π dij ;kl(θ) α ik α (t) + ik Can be maximized (in α) using quadratic programming. ).

46 The parsimonious model for the Epinions dataset P θ (D ij = d ij Z i = k, Z j = l) exp{θ (y ij where y ij = I{Y ij = } and y + ij = I{Y ij = +}. + y ji ) +θk y ji + θl y ij +θ y ij y ji + θ ++ y + ij y + ji } NB: The term θ + (y + ij + y + ji ) is missing to avoid perfect collinearity. Dyad D ij, directed case: i j θ : Overall tendency toward distrust θ k : Category-specific trustedness θ : lex talionis tendency (eye for an eye) θ ++ : quid pro quo tendency (one good turn... )

47 Parameter estimates themselves are of interest Parameter Confidence Parameter Estimate Interval Negative edges (θ ) 4. ( 4., 4.) Positive edges (θ + ) Negative reciprocity (θ ).66 (.64,.6) Positive reciprocity (θ ++ ). (.,.) Cluster Trustworthiness (θ ) 6.56 ( 6.6, 6.5) Cluster Trustworthiness (θ ).65 (.66,.653) Cluster 3 Trustworthiness (θ3 ).343 (.34,.33) Cluster 4 Trustworthiness (θ4 ).4 (.,.) Cluster 5 Trustworthiness (θ5 ) 5. ( 5.5, 5.) 5% Confidence intervals based on parametric bootstrap using 5 simulated networks. NB: There are some strange aspects of the bootstrap we cannot explain yet.

48 Multiple starting points converge to the same solution Trace plots from different randomly selected starting parameter values: Iterations 3 Loglikelihood values 4 Iterations 3 Trustedness parameters Full (-parameter) model results look nothing like this. 4

49 We may use average ratings of reviews by other users as a way to ground-truth the clustering solutions 65, articles categorized by author s highest-probability component. (Vertical axis is average article rating.) Parsimonious (-parameter) Model Full (-parameter) Model Size of Cluster Size of Cluster

50 ERGMs may not be a great way to model large networks with dependencies, but... The ERGM framework is useful because it forces researchers to think about which network statistics are important. Alternative models can exploit similar ways of thinking about networks, or even exploit ERGMs themselves.

51 Cited References: ERGMs Erdős, P and Rényi, A On Random Graphs I Publicationes Mathematicae [Debrecen], 5. Gilbert, E. N. Random Graphs Annals of Mathematical Statistics, 5. Hunter, D. R., Goodreau, S. M., and Handcock M. S. Goodness of fit for social network models J. Am. Stat. Assoc.,.

52 Cited References: Counting Processes for Networks Brandes, U., Lerner, J., and Snijders, T. A. B. Networks evolving step by step: Statistical analysis of dyadic event data. In Advances in Social Network Analysis and Mining, pp. 5. IEEE,. Butts, C.T. A relational event framework for social action. Sociological Methodology, 3():55,. Cox, D. R. Regression models and life-tables. Journal of the Royal Statistical Society, Series B, 34:,. Perry, P. O. and Wolfe, P. J. Point process modeling for directed interaction networks Journal of the Royal Statistical Society, Series B, to appear. Salathé, M., Vu, D. Q., Khandelwal, S., and Hunter, D. R. The Dynamics of Health Behavior Sentiments on a Large Social Network EPJ Data Science, :4, 3. Vu, D. Q., Asuncion, A. U., Hunter, D. R., and Smyth, P. Dynamic Egocentric Models for Citation Networks, Proceedings of the th International Conference on Machine Learning (ICML ), 5 64,. Vu, D. Q., Asuncion, A. U., Hunter, D. R., and Smyth, P. Continuous-Time Regression Models for Longitudinal Networks Advances in Neural Information Processing Systems 4 (NIPS ), to appear.

53 Cited References: Variational EM for Large Networks Daudin, J. J., Picard, F., and Robin, S. A Mixture Model for Random Graphs. Statistics & Computing,. Nowicki, K and Snijders, T. A. B. Estimation and Prediction for Stochastic Blockstructures. Journal of the American Statistical Association,. Vu, D. Q., Hunter, D. R., and Schweinberger, M. Model-Based Clustering of Large Networks Annals of Applied Statistics, to appear.

54 A FEW EXTRA SLIDES

55 Maximum Pseudolikelihood: Intuition What if we assume that there is no dependence (or very weak dependence) among the Y ij? In other words, what if we approximate the marginal P(Y ij = ) by the conditional P(Y ij = Yij c = yij c)? Then the Y ij are independent with log P(Y ij = ) P(Y ij = ) = θ δ(y obs ) ij, so we obtain an estimate of θ using straightforward logistic regression. Result: The maximum pseudolikelihood estimate. For independence models, MPLE = MLE!

56 MLE vs. MPLE Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise. John W. Tukey MLE (maximum likelihood estimation): Well-established method but very hard because the normalizing constant κ(α) is difficult to evaluate, so we approximate it instead. MPLE (maximum pseudo-likelihood estimation): Easy to do using logistic regression, but based on an independence assumption that is often not justified. Several authors, notably van Duijn et al. (), argue forcefully against the use of MPLE (except when MLE=MPLE!).

57 Model construction and Testing Dataset: arxiv-th. High-energy physics theory articles, Jan. 3 Apr. 3. Timestamps are continuous time; abstract text is included. (,55 articles; 35, citations; 5,4 unique times). Statistics-building phase (,4 unique event times): Construct network history and build up network statistics.. Training phase ( unique event times): Construct partial likelihood and estimate model coefficients. 3. Test phase (5 unique event times): Evaluate predictive capability of the learned model. Statistics-building is ongoing even through the training and test phases. The phases are split along citation event times.

58 Recall Performance Recall: Proportion of true citations among largest K likelihoods.. Recall.6.4 PA PPT. PPTR LDA LDA+PPTR 5 5 Cut point K PA: pref. attach only (s (t)); PPT: s,..., s except s 3 ; PPTR: s,..., s ; LDA: LDA stats only

Social networks may be modeled as relational Irvine: Online social network of students users; 6 directed contact edges anteater Links are non-recurrent; i.e., N ij (t) is either or.

59 Social networks may be modeled as relational Irvine: Online social network of students users; 6 directed contact edges anteater Links are non-recurrent; i.e., N ij (t) is either or. At-risk indicator R ij (t) = I{max(ti arr, tj arr ) < t < t eij }. contactee contacter date :: :: :: :3: :3: :3: :: :4:6.43!

60 Relational Example: Modeling a network of contacts Irvine: Online social network of students users; 6 directed contact edges anteater Some of the statistics in the model: Sender out-degree: s (i, j, t) = h V,h i N ih(t ) Reciprocity: s 5 (i, j, t) = N ji (t ) Transitivity: s 6 (i, j, t) = h V,h i,j N ih(t )N hj (t ) Shared contacters: s (i, j, t) = h V,h i,j N hi(t )N hj (t )

61 Aalen model estimates for Irvine Data Set Aalen coefficients suggest two distinct phases of network evolution, consistent with an independent analysis (Panzarasa et al, ). On prediction experiments, Aalen/Cox outperforms logistic regression. Coefficient 3e 5 Coefficient.3 (a) Sender Out-Degree 6//4 //4 //4 Time (c) Transitivity 6//4 //4 //4 Time Coefficient. Coefficient 5e 5 (b) Reciprocity 6//4 //4 //4 Time (d) Shared Contacters 6//4 //4 //4 Time Recall..3.4 Adaptive LR (:) Adaptive LR (:5) Adaptive Cox Aalen (Uniform ) Aalen (Uniform ) 5 5 Cut Point K

Continuous-Time Regression Models for Longitudinal Networks

Continuous-Time Regression Models for Longitudinal Networks Continuous- Regression Models for Longitudinal Networks Duy Q. Vu Department of Statistics Pennsylvania State University University Park, PA 16802 dqv100@stat.psu.edu David R. Hunter Department of Statistics