Latent Factor Models for Relational Data Peter Hoff Statistics, Biostatistics and Center for Statistics and the Social Sciences University of Washington
Outline Part 1: Multiplicative factor models for network data Relational data Eigenvalue decomposition model Example: Adolescent social network Singular value decomposition model Example: International conflicts Part 2: Extension to multiway data Multiway data Multiway latent factor models Example: Cold war cooperation and conflict Summary and future work
Relational data Relational data: consist of a set of units or nodes A, and a set of measurements Y {y i,j } specific to pairs of nodes (i, j) A A. Examples: International relations A =countries, y i,j = indicator of a dispute initiated by i with target j Needle-sharing network A=IV drug users, y i,j = needle-sharing activity between i and j Protein-protein interactions A=proteins, y i,j = the interaction between i and j Document analysis A 1 =words, A 2 =documents, y i,j = wordcount of i in document j
Adolescent health social network Data on 82 12th graders from a single high school: 54 boys, 28 girls ˆPr(y i,j = 1 same sex) = 0.077 ˆPr(y i,j = 1 opposite sex) = 0.056
Inferential goals in the regression framework 0 Y = B @ y i,j measures i j, y 1,1 y 1,2 y 1,3 NA y 1,5 y 2,1 y 2,2 y 2,3 y 2,4 y 2,5 y 3,1 NA y 3,3 y 3,4 NA y 4,1 y 4,2 y 4,3 y 4,4 y 4,5..... x i,j is a vector of explanatory variables. 1 C A 0 X = B @ Consider a basic (generalized) linear model A model can provide y i,j β x i,j + e i,j x 1,1 x 1,2 x 1,3 x 1,4 x 1,5 x 2,1 x 2,2 x 2,3 x 2,4 x 2,5 x 3,1 x 3,2 x 3,3 x 3,4 x 3,5 x 4,1 x 4,2 x 4,3 x 4,4 x 4,5 a measure of the association between X and Y: ˆβ, se( ˆβ) imputations of missing observations: p(y 1,4 Y, X) a probabilistic description of network features: g(ỹ), Ỹ p(ỹ Y, X)..... 1 C A
Model fit glm(formula = y ~ x, family = binomial(link = "logit")) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -2.8332 0.1123-25.24 <2e-16 *** x 0.3471 0.1428 2.43 0.0151 * This result says that a model with preferential association is a better description of the data than an i.i.d. binary model. 0.0 1.0 2.0 0.0 0.5 1.0 1.5 2.0 1.0 0.5 0.0 0.5 1.0 log odds ratio 1.0 0.5 0.0 0.5 1.0 log odds ratio
Nodal heterogeneity and independence assumptions
Model lack of fit Neither of these models do well in terms of representing other features of the data - for example, transitivity: t(y) = y i,j y j,k y k,i i<j<k 0.000 0.010 0.020 0 100 300 500 transitive triples 0.000 0.010 0 100 300 500 transitive triples
Latent variable models Deviations from ordinary regression models can be represented as y i,j β x i,j + e i,j A simple latent variable model might include additive node effects: e i,j = u i + u j + ɛ i,j y i,j β x i,j + u i + u j + ɛ i,j {u 1,..., u n } represent across-node heterogeneity that is additive on the scale of the regressors. Inclusion of these effects in the model can dramatically improve within-sample model fit (measured by R 2, likelihood ratio, BIC, etc.); out-of-sample predictive performance (measured by cross-validation). But this model only captures heterogeneity of outdegree/indegree, and can t represent more complicated structure, such as clustering, transitivity, etc.
Fit of additive effects model 0.0 1.0 2.0 1.0 0.5 0.0 0.5 1.0 log odds ratio 0.000 0.004 0.008 0.012 0 100 300 500 transitive triples
Latent variable models via exchangeability X represents known information about the nodes E represents deviations from the regression model In this case we might be willing to use a model for E in which {e i,j } d = {e gi,gj } for all permutations g. This is a type of exchangeability for arrays, sometimes called weak exchangeability. Theorem (Aldous, Hoover): Let {e i,j } be a weakly exchangeable array. Then {e i,j } = d {ei,j }, where e i,j = f (u i, u j, ɛ i,j ) and {u i }, {ɛ i,j } are all independent random variables.
The eigenvalue decomposition model E = M + E M represents systematic patterns and E represents noise. Every symmetric M has a representation of the form M = UΛU where U is an n n matrix with orthonormal columns Λ is an n n diagonal matrix, with elements {λ 1,..., λ n } Many data analysis procedures for symmetric matrix-valued data Y are related to this decomposition. Given a model of the form Y = M + E where E is independent noise, the ED provides Interpretation:y i,j = u i Λu j + ɛ i,j, u i and u j are the ith, jth rows of U Estimation: ˆMR = Û[,1:R]ˆΛ [1:R,1:R] Û [,1:R] if M is assumed to be of rank R.
Generalized bilinear regression y i,j β x i,j + u iλu j + ɛ i,j Interpretation: Think of {u 1,..., u n } as vectors of latent nodal attributes: u iλu j = R λ k u i,k u j,k r=1 In general, a latent variable model relating X to Y is g(y i,j ) = β x i,j + u iλv j + ɛ i,j and the parameters can be estimated using a rank-likelihood or multinomial probit. Alternatively, parametric models include If y i,j is binary, If y i,j is count data, If y i,j is continuous, log odds (y i,j = 1) = β x i,j + u i Λu j + ɛ i,j log E[y i,j ] = β x i,j + u i Λu j + ɛ i,j E[y i,j ] = β x i,j + u i Λu j + ɛ i,j Estimation: Given Λ, the predictor is linear in U. This bilinear structure can be exploited (EM, Gibbs sampling).
Eigenmodel fit Parameters this model can be fit with the eigenmodel package in R: eigenmodel_mcmc(y,x,r=3) 0.0 0.5 1.0 1.5 2.0 2.5 1.0 0.5 0.0 0.5 1.0 log odds ratio 0.000 0.002 0.004 0.006 0 100 300 500 transitive triples The latent factors are able to represent the network transitivity.
Underlying structure
Missing variables
Missing variables The eigenmodel, without having explicit race information, captures a large degree of the racial homophily in friendship: 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 1.5 0.5 0.5 1.5 additive model log odds ratio 1.5 0.5 0.5 1.5 eigenmodel log odds ratio
Multiway data Data on pairs is sometimes called two-way data. More generally, data on triples, quadruples, etc. is called multi-way data. A 1, A 2,..., A p represent classes of objects. y i1,i 2,...,i p is the measurement specific to i 1 A 1,..., i p A p. Examples: International relations A 1 = A 2 = countries, A 3 = time, y i,j,t = indicator of a dispute between i and j in year t. Social networks A 1 = A 2 = individuals, A 3 = information sources, y i,j,k = source k s report of the relationship between i and j. Document analysis A 1 = A 2 = words, A 3 = documents, y i,j,k = co-occurrence of words i and j in document k.
Factor models for multiway data Recall the decomposition of a two-way array of rank R: m i,j = u iλu j = Now generalize to a three-way array: m i,j,k = R u i,r u j,r λ r r=1 R u i,r u j,r w k,r λ r r=1 {u 1,..., u n } represents variation among the nodes; {w 1,..., w m } represents variation across the networks. Consider the kth slab of M, which is an n n matrix: m i,j,k = = R u i,r u j,r w k,r λ r r=1 R u i,r u j,r λ k,r = u iλ k u j r=1 where λ k,r replaces w k,r λ k
Cold war cooperation and conflict data βtrade 0.2 0.0 0.2 0.4 βgdp 0.2 0.0 0.2 0.4 βdistance 0.2 0.0 0.2 0.4 1950 1960 1970 1980 1950 1960 1970 1980 1950 1960 1970 1980 λ1 0.5 0.0 0.5 1.0 λ2 0.5 0.0 0.5 1.0 λ3 0.5 0.0 0.5 1.0 1950 1960 1970 1980 1950 1960 1970 1980 1950 1960 1970 1980
Cold war cooperation and conflict data u2 1950 1960 1970 USR USR USR SYR CUB IRQ IRQ SYR FRN EGY CAM FRN EGY POR IND AFG AFG TUN CAM ALG EGY ZAM ECU NEW PHI TAW PRK IND AUL ICE PRK IND GUA RVN GRC DOM HAI MYA ROK HUN RUM DRV NEW AUL RVN PHI PRK TUR BUL ARG PER ROK BEL ETH CHL AUS JPN HUN DRV TUR LAO SOM NEP ROK LAO VEN PAK CHN THI PAK GUY INS CHN THI PAK CHN YUG SAL MOR IRN IRN ZIM CAN SEN USA JOR USA GUI JOR UKG ITA USA ISR UKGISR NOR NTH JOR UKG SAF ISR 1985 USR ANG CUB IRQ SYR CAM FRN LEB EGY AFG TUN IND NEW PHI PRK KUW BOT GRCCYP DRV LAO HON MLT PAN ROK TUR LBR BEL COS DEN ETH IRE BFO MAL MLI NIC SAU SOM SRI CZE SPN GDR MYA THI PAK CHN SAL IRN USA GFR LIB UKG ITA ISR NTH SAF u3 u1 USA 1950 ROK AUL NEW PHI UKG TUR ISR PAK HUN CANYUG TAW DOM BUL FRN GRC HAI RUM ECU MYA EGY IND PER AFG JOR SYRPRK CHN USR u1 USA 1960 NOR JPN PAK UKG TUR ROK RVN THI ISR NTH LAO ITA BEL NEP HUN ETH MOR SOM CHL ARG FRNTUN EGY IND AFG JOR ICE AUS IRN IRQ PRK CHN DRV CUB INS USR CAM USA u1 1970 UKG ROK RVN AULTHI ISR NEW LAO ZAM PHI IND CAM PAK SAF VEN GUY SEN GUA ALG GUI PORSAL EGY JOR IRN IRQ ZIM SYRPRK CHN DRV USR USA u1 1985 PAK UKG TUR KUW SAU ROK THI ISR MAL GFR HON NTH BEL SAF ITA NEW LAO MLT PAN LBR COS DEN ETH IRE BFO FRN GRC SOM MLI SRI LIB LEB TUN SAL SPN CYP MYA EGY PHI IND AFGBOTCZE GDR ANG IRN NIC IRQ SYRPRK CHN DRV CUB USR CAM u3 u1 USA 1950 PAK UKG TUR ISR ROK AUL HUN NEW CANYUGDOM BUL HAI RUM TAW MYA GRC PER ECU EGY FRN JOR PHI IND AFG CHN PRK SYR USR u1 USA 1960 NOR PAK UKG TUR JPN ISR NTH ROK RVN ITA THI LAO BEL HUN MOR NEP SOM CHL ARG ETH TUN EGY FRN JOR ICE IND AFG IRN CHN AUS IRQ INS DRV PRK CUB USR CAM u1 USA 1970 UKG PAK ISR SAF ROK RVN THI LAO AUL NEW ZAM GUI SEN GUY VEN SAL GUA ALG EGY POR JORZIM PHI IND IRN CHN IRQ DRV PRK SYR CAM USR u1 USA 1985 PAK UKG TUR KUW SAU ISR NTH SAF GFR MAL HON ITA THI ROK LAO BEL DEN COSNEW BFO ETH IRE MLT LBR PAN MLI LIB SAL SOM SRI GRC LEB SPN MYA CYP TUN EGY FRN CZE IND BOT PHI AFG IRN CHN GDR NIC IRQ ANG DRV PRK SYR CUB USR CAM u2 u2 u2 u2
Summary and future directions Summary: Latent factor models are a natural way to represent patterns in relational or array-structured data. The latent factor structure can be incorporated into a variety of model forms. Model-based methods Future Directions: give parameter estimates; accommodate missing data; provide predictions; are easy to extend. Dynamic network inference Generalization to multi-level models
The SVD model for asymmetric data E = M + E M represents systematic patterns and E represents noise. Every M has a representation of the form M = UDV where, in the case m n, Recall, U is an m n matrix with orthonormal columns; V is an n n matrix with orthonormal columns; D is an n n diagonal matrix, with diagonal elements {d 1,..., d n } typically taken to be a decreasing sequence of non-negative numbers. The squared elements of the diagonal of D are the eigenvalues of M M and the columns of V are the corresponding eigenvectors. The matrix U can be obtained from the first n eigenvectors of MM. The number of non-zero elements of D is the rank of M. Writing the row vectors as {u 1,..., u m }, {v 1,..., v n }, m i,j = u i Dv j.
Data analysis with the singular value decomposition Many data analysis procedures for matrix-valued data Y are related to the SVD. Given a model of the form Y = M + E where E is independent noise, the SVD provides Interpretation:y i,j = u i Dv j + ɛ i,j, u i and v j are the ith, jth rows of U, V Estimation: ˆMR = Û[,1:R] ˆD [1:R,1:R] ˆV [,1:R] if M is assumed to be of rank R. Applications: biplots (Gabriel 1971, Gower and Hand 1996) reduced-rank interaction models (Gabriel 1978, 1998) analysis of relational data (Harshman et al., 1982) Factor analysis, image processing, data reduction,... Notes: How to select R? Given R, is ˆM R a good estimator? (E[Y Y] = M M + mσ 2 I ) Work on model-based factor analysis: Rajan and Rayner (1997), Minka (2000), Lopes and West (2004).
International conflict network, 1990-2000 11 years of international relations data (Mike Ward and Xun Cao) y i,j = indicator of a militarized disputes initiated by i with target j; x i,j an 8-dimensional covariate vector containing an intercept and 1. population initiator 2. population target 3. polity score initiator 4. polity score target 5. polity score initiator polity score target 6. log distance 7. number of shared intergovernmental organizations Model: y i,j are independent binary random variables with log-odds log odds(y i,j = 1) = β x i,j + u idv j + ɛ i,j
International conflict network: parameter estimates log distance polity initiator intergov org polity target polity interaction log pop target log pop initiator 3 2 1 0 1 regression coefficient
International conflict network: description of latent variation AFG ANG ARG AUL BAH BNG BOT BUI CAN CAO CHA CHN COS CUB CYP DOM DRC EGY FRN GHA GRC GUI HON IND INS IRN IRQ ISR ITA JOR JPN KEN LBR LES LIB MAL MYA NIC NIG NIR NTH OMA PAK PHI PRK QAT ROK RWA SAF SAL SAU SEN SIE SRI SUD SWA SYR TAW TAZ THI TOG TUR UAE UGA UKG USA VEN AFG ALB ANG ARG AUL BAH BEL BEN BNG BUI CAM CAN CDI CHA CHL CHN COL CON COS CUB CYP DRC EGY FRN GHA GNB GRC GUI GUY HAI HON IND INS IRN IRQ ISR ITA JOR JPN LBR LES LIB MAA MLI MON MOR MYA MZM NAM NIC NIG NIR NTH OMA PAK PHI PNG PRK QAT ROK RWA SAF SAL SAU SEN SIE SIN SPN SUD SYR TAW TAZ THI TOG TRI TUR UAE UGA UKG USA VEN YEM ZAM ZIM
International conflict network: prediction experiment # of links found 0 20 40 60 80 100 K=0 K=1 K=2 K=3 0 20 40 60 80 100 # of pairs checked
Factor models for multiway data Recall the decomposition of a two-way array of rank R: m i,j = u idv j = Now generalize to a three-way array: m i,j,k = R u i,r v j,r d r r=1 R u i,r v j,r w k,r d r {u 1,..., u n1 } represents variation along the 1st dimension {v 1,..., v n2 } represents variation along the 2nd dimension {w 1,..., w n3 } represents variation along the 3rd dimension Consider the kth slab of M, which is an n 1 n 2 matrix: R m i,j,k = u i,r v j,r w k,r d r = r=1 r=1 R u i,r v j,r d k,r = u id k v j r=1 where d k,r = w k,r d k