Latent Factor Models for Relational Data Peter Hoff Statistics, Biostatistics and Center for Statistics and the Social Sciences University of Washington
Outline Part 1: Multiplicative factor models for network data Relational data Models via exchangeability Models via matrix representations Examples: conflict data, protein interaction data Part 2: Extension to multiway data Multiway data Multiway latent factor models Example: Cold war cooperation and conflict Summary and future work
Relational data Relational data: consist of a set of units or nodes A, and a set of measurements Y {y i,j } specific to pairs of nodes (i, j) A A. Examples: International relations A =countries, y i,j = indicator of a dispute initiated by i with target j. Needle-sharing network A=IV drug users, y i,j = needle-sharing activity between i and j. Protein-protein interactions A=proteins, y i,j = the interaction between i and j. Document analysis A 1 =words, A 2 =documents, y i,j = wordcount of i in document j.
Inferential goals in the regression framework 0 Y = B @ y i,j measures i j, y 1,1 y 1,2 y 1,3 NA y 1,5 y 2,1 y 2,2 y 2,3 y 2,4 y 2,5 y 3,1 NA y 3,3 y 3,4 NA y 4,1 y 4,2 y 4,3 y 4,4 y 4,5..... x i,j is a vector of explanatory variables. 1 C A 0 X = B @ Consider a basic (generalized) linear model A model can provide y i,j β x i,j + e i,j a measure of the association between X and Y: x 1,1 x 1,2 x 1,3 x 1,4 x 1,5 x 2,1 x 2,2 x 2,3 x 2,4 x 2,5 x 3,1 x 3,2 x 3,3 x 3,4 x 3,5 x 4,1 x 4,2 x 4,3 x 4,4 x 4,5... ˆβ, se( ˆβ) predictions of missing or future observations: p(y 1,4 Y, X).. 1 C A
How might node heterogeneity affect network structure?
Latent variable models Deviations from ordinary regression models can be represented as y i,j β x i,j + e i,j A simple latent variable model might include row and column effects: e i,j = u i + v j + ɛ i,j y i,j β x i,j + u i + v j + ɛ i,j u i and v j induce across-node heterogeneity that is additive on the scale of the regressors. Inclusion of these effects in the model can dramatically improve within-sample model fit (measured by R 2, likelihood ratio, BIC, etc.); out-of-sample predictive performance (measured by cross-validation). But this model only captures heterogeneity of outdegree/indegree, and can t represent more complicated structure, such as clustering, transitivity, etc.
Latent variable models via exchangeability X represents known information about the nodes E represents deviations from the regression model In this case we might be willing to use a model for E in which {e i,j } d = {e gi,hj } for all permutations g and h. This is a type of exchangeability for arrays, sometimes called weak exchangeability. Theorem (Aldous, Hoover): Let {e i,j } be a weakly exchangeable array. Then {e i,j } = d {ei,j }, where e i,j = f (µ, u i, v j, ɛ i,j ) and µ, {u i }, {v j }, {ɛ i,j } are all independent random variables.
The singular value decomposition model E = M + E M represents systematic patterns and E represents noise. Every M has a representation of the form M = UDV where, in the case m n, Recall, U is an m n matrix with orthonormal columns; V is an n n matrix with orthonormal columns; D is an n n diagonal matrix, with diagonal elements {d 1,..., d n } typically taken to be a decreasing sequence of non-negative numbers. The squared elements of the diagonal of D are the eigenvalues of M M and the columns of V are the corresponding eigenvectors. The matrix U can be obtained from the first n eigenvectors of MM. The number of non-zero elements of D is the rank of M. Writing the row vectors as {u 1,..., u m }, {v 1,..., v n }, m i,j = u i Dv j.
Data analysis with the singular value decomposition Many data analysis procedures for matrix-valued data Y are related to the SVD. Given a model of the form Y = M + E where E is independent noise, the SVD provides Interpretation:y i,j = u i Dv j + ɛ i,j, u i and v j are the ith, jth rows of U, Estimation: ˆMK = Û[,1:K] ˆD [1:K,1:K] ˆV [,1:K] if M is assumed to be of rank K. Applications: biplots (Gabriel 1971, Gower and Hand 1996) reduced-rank interaction models (Gabriel 1978, 1998) analysis of relational data (Harshman et al., 1982) Factor analysis, image processing, data reduction,... Notes: How to select K? Given K, is ˆM K a good estimator? (E[Y Y] = M M + mσ 2 I ) Work on model-based factor analysis: Rajan and Rayner (1997), Minka (2000), Lopes and West (2004).
Generalized bilinear regression y i,j β x i,j + u idv j + ɛ i,j Interpretation: Think of {u 1,..., u m }, {v 1,..., v n } as vectors of latent nodal attributes: K u idv j = d k u i,k v j,k k=1 In general, a latent variable model relating X to Y is g(y i,j ) = β x i,j + u idv j + ɛ i,j and the parameters can be estimated using a rank-likelihood or multinomial probit. Alternatively, parametric models include If y i,j is binary, If y i,j is count data, If y i,j is continuous, log odds (y i,j = 1) = β x i,j + u i Dv j + ɛ i,j log E[y i,j ] = β x i,j + u i Dv j + ɛ i,j E[y i,j ] = β x i,j + u i Dv j + ɛ i,j Estimation: Given D, V, the predictor is linear in U. This bilinear structure can be exploited (EM, Gibbs sampling, variational methods).
International conflict network, 1990-2000 11 years of international relations data (Mike Ward and Xun Cao) y i,j = indicator of a militarized disputes initiated by i with target j; x i,j an 8-dimensional covariate vector containing an intercept and 1. population initiator 2. population target 3. polity score initiator 4. polity score target 5. polity score initiator polity score target 6. log distance 7. number of shared intergovernmental organizations Model: y i,j are independent binary random variables with log-odds log odds(y i,j = 1) = β x i,j + u idv j + ɛ i,j
International conflict network: parameter estimates log distance polity initiator intergov org polity target polity interaction log pop target log pop initiator 3 2 1 0 1 regression coefficient
International conflict network: description of latent variation AFG ANG ARG AUL BAH BNG BOT BUI CAN CAO CHA CHN COS CUB CYP DOM DRC EGY FRN GHA GRC GUI HON IND INS IRN IRQ ISR ITA JOR JPN KEN LBR LES LIB MAL MYA NIC NIG NIR NTH OMA PAK PHI PRK QAT ROK RWA SAF SAL SAU SEN SIE SRI SUD SWA SYR TAW TAZ THI TOG TUR UAE UGA UKG USA VEN AFG ALB ANG ARG AUL BAH BEL BEN BNG BUI CAM CAN CDI CHA CHL CHN COL CON COS CUB CYP DRC EGY FRN GHA GNB GRC GUI GUY HAI HON IND INS IRN IRQ ISR ITA JOR JPN LBR LES LIB MAA MLI MON MOR MYA MZM NAM NIC NIG NIR NTH OMA PAK PHI PNG PRK QAT ROK RWA SAF SAL SAU SEN SIE SIN SPN SUD SYR TAW TAZ THI TOG TRI TUR UAE UGA UKG USA VEN YEM ZAM ZIM
International conflict network: prediction experiment # of links found 0 20 40 60 80 100 K=0 K=1 K=2 K=3 0 20 40 60 80 100 # of pairs checked
Analyzing undirected data In many applications y i,j = y j,i by design, and so (y i,j = y j,i ) β x i,j + e i,j and E = {e i,j } is a symmetric array. How should E be modeled? Modeling via exchangeability: Let E be a symmetric array (with an undefined diagonal), such that {e i,j } d = {e gi,gj }. Then {e i,j } d = {e i,j }, where e i,j = f (µ, u i, u j, ɛ i,j ), µ, {u i }, {ɛ i,j } are all independent and f (, u i, u j, ) = f (, u j, u i, ). Modeling via matrix decomposition: Write E = M + E, with all matrices symmetric. All such M have an eigenvalue decomposition This suggests a model of the form M = UΛU m i,j = u iλu j (y i,j = y j,i ) β x i,j + u iλu j + ɛ i,j
Protein-protein interaction network Interactions among 270 proteins in E. coli (Butland, 2005). 1 Data: ( 270 i<j y i,j 0.01 2 ) Model: log odds(y i,j = 1) = µ + u i Λu j + ɛ i,j
Protein-protein interaction network 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 Positive Negative acca accc accd acpp aidb alas b1731 b2097 b2341 b2342 b2496 bola cada cbpa clpa clpp clpx cobb cspa cspb cspc cspd cspe dead dnaa dnab dnae dnaj dnak dnaq eno faba fabi fabz fba fdhd flda ftse ftsz gada gadb gida grea greb grpe gyra gyrb hepa hfq hima hns holb holc hold hole hsca hsdm hsdr hslu hslv hupa hupb hypc hypd ibpa infc iscs iscu kdsa ldcc lig lpxd lyss lysu malt metk mopa mopb mreb mukb muke mura narg narh narj nary ndk nfi nusa nusg panc parc pept pflb phes phet pnp pola ppa prsa pssa pstb purc pyka recb recd rece recj recq rfad rfbc rhlb rne rnha rnpa rpld rpoa rpob rpoc rpod rpoh rpon rpos rpoz seca secb selb smpb spot srmb ssb suca sucb tdcd tgt thdf topa topb tpia tsf tufa tufb usg uvrc vacb yacg yacl ybcj ybdq ybez ybjx ycby ycea ycec ycff ycil yeaz yedo yfcb yfgb yfid yfif ygdp ygfa yggh ygjd yhbu yhby yhhp yien yjef yjfq ylea ynhc ynhd ynhe accb aas fabb plsb acps fabf fabg glmu ybgc yiid ybbn dnan ydhd rho rplb rpli rplv rpsc rpsb rpsg narz clpb cspg rpsm rplc rple rplm rpsa rpsd rpse rpla rplx rpsj rpst cspi ygif rpls rplu rpsf rpsh rpsi rpsn rpso rpsp rpsr dnac dnax hola reca entb sapa b1742 ffh rplf gcpe ubix fusa cafa hrpa lon tig hlpa himd b1410 b1808 pria malp hyfg hyce ibpb rpll gaty mukf nei infb usha sbcb rplo rplt b2372 qor manx ycbb rpmb recg rplq yhir rplr yibl rplw yhbv fucu yabb rpsl
Protein-protein interaction network rpoa greb b2372 usg infb grea rpoc manx rpos rpon pola b1731 hepa nusa yacl rpod nusg rpob rpoz RNA polymerase Negative 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.85 0.90 0.95 1.00 Positive
Protein-protein interaction network: prediction experiment # of links found 0 50 100 150 0 50 100 150 200 250 300 350 # of pairs checked
Multiway data Data on pairs is sometimes called two-way data. More generally, data on triples, quadruples, etc. is called multi-way data. A 1, A 2,..., A p represent classes of objects. y i1,i 2,...,i p is the measurement specific to i 1 A 1,..., i p A p. Examples: International relations A 1 = A 2 = countries, A 3 = time, y i,j,t = indicator of a dispute between i and j in year t. Social networks A 1 = A 2 = individuals, A 3 = information sources, y i,j,k = source k s report of the relationship between i and j. Document analysis A 1 = A 2 = words, A 3 = documents, y i,j,k = co-occurrence of words i and j in document k.
Factor models for multiway data Recall the decomposition of a two-way array of rank R: m i,j = u idv j = Now generalize to a three-way array: m i,j,k = R u i,r v j,r d r r=1 R u i,r v j,r w k,r d r {u 1,..., u n1 } represents variation along the 1st dimension {v 1,..., v n2 } represents variation along the 2nd dimension {w 1,..., w n3 } represents variation along the 3rd dimension Consider the kth slab of M, which is an n 1 n 2 matrix: R m i,j,k = u i,r v j,r w k,r d r = r=1 r=1 R u i,r v j,r d k,r = u id k v j r=1 where d k,r = w k,r d k
Cold war cooperation and conflict data βtrade 0.2 0.0 0.2 0.4 βgdp 0.2 0.0 0.2 0.4 βdistance 0.2 0.0 0.2 0.4 1950 1960 1970 1980 1950 1960 1970 1980 1950 1960 1970 1980 λ1 0.5 0.0 0.5 1.0 λ2 0.5 0.0 0.5 1.0 λ3 0.5 0.0 0.5 1.0 1950 1960 1970 1980 1950 1960 1970 1980 1950 1960 1970 1980
Cold war cooperation and conflict data u2 1950 1960 1970 USR USR USR SYR CUB IRQ IRQ SYR FRN EGY CAM FRN EGY POR IND AFG AFG TUN CAM ALG EGY ZAM ECU NEW PHI TAW PRK IND AUL ICE PRK IND GUA RVN GRC DOM HAI MYA ROK HUN RUM DRV NEW AUL RVN PHI PRK TUR BUL ARG PER ROK BEL ETH CHL AUS JPN HUN DRV TUR LAO SOM NEP ROK LAO VEN PAK CHN THI PAK GUY INS CHN THI PAK CHN YUG SAL MOR IRN IRN ZIM CAN SEN USA JOR USA GUI JOR UKG ITA USA ISR UKGISR NOR NTH JOR UKG SAF ISR 1985 USR ANG CUB IRQ SYR CAM FRN LEB EGY AFG TUN IND NEW PHI PRK KUW BOT GRCCYP DRV LAO HON MLT PAN ROK TUR LBR BEL COS DEN ETH IRE BFO MAL MLI NIC SAU SOM SRI CZE SPN GDR MYA THI PAK CHN SAL IRN USA GFR LIB UKG ITA ISR NTH SAF u3 u1 USA 1950 ROK AUL NEW PHI UKG TUR ISR PAK HUN CANYUG TAW DOM BUL FRN GRC HAI RUM ECU MYA EGY IND PER AFG JOR SYRPRK CHN USR u1 USA 1960 NOR JPN PAK UKG TUR ROK RVN THI ISR NTH LAO ITA BEL NEP HUN ETH MOR SOM CHL ARG FRNTUN EGY IND AFG JOR ICE AUS IRN IRQ PRK CHN DRV CUB INS USR CAM USA u1 1970 UKG ROK RVN AULTHI ISR NEW LAO ZAM PHI IND CAM PAK SAF VEN GUY SEN GUA ALG GUI PORSAL EGY JOR IRN IRQ ZIM SYRPRK CHN DRV USR USA u1 1985 PAK UKG TUR KUW SAU ROK THI ISR MAL GFR HON NTH BEL SAF ITA NEW LAO MLT PAN LBR COS DEN ETH IRE BFO FRN GRC SOM MLI SRI LIB LEB TUN SAL SPN CYP MYA EGY PHI IND AFGBOTCZE GDR ANG IRN NIC IRQ SYRPRK CHN DRV CUB USR CAM u3 u1 USA 1950 PAK UKG TUR ISR ROK AUL HUN NEW CANYUGDOM BUL HAI RUM TAW MYA GRC PER ECU EGY FRN JOR PHI IND AFG CHN PRK SYR USR u1 USA 1960 NOR PAK UKG TUR JPN ISR NTH ROK RVN ITA THI LAO BEL HUN MOR NEP SOM CHL ARG ETH TUN EGY FRN JOR ICE IND AFG IRN CHN AUS IRQ INS DRV PRK CUB USR CAM u1 USA 1970 UKG PAK ISR SAF ROK RVN THI LAO AUL NEW ZAM GUI SEN GUY VEN SAL GUA ALG EGY POR JORZIM PHI IND IRN CHN IRQ DRV PRK SYR CAM USR u1 USA 1985 PAK UKG TUR KUW SAU ISR NTH SAF GFR MAL HON ITA THI ROK LAO BEL DEN COSNEW BFO ETH IRE MLT LBR PAN MLI LIB SAL SOM SRI GRC LEB SPN MYA CYP TUN EGY FRN CZE IND BOT PHI AFG IRN CHN GDR NIC IRQ ANG DRV PRK SYR CUB USR CAM u2 u2 u2 u2
Summary and future directions Summary: Latent factor models are a natural way to represent patterns in relational or array-structured data. The latent factor structure can be incorporated into a variety of model forms. Model-based methods Future Directions: give parameter estimates; accommodate missing data; provide predictions; are easy to extend. Dynamic network inference Generalization to multi-level models