Latent Factor Models for Relational Data

Size: px

Start display at page:

Download "Latent Factor Models for Relational Data"

Wendy Bell
5 years ago
Views:

1 Latent Factor Models for Relational Data Peter Hoff Statistics, Biostatistics and Center for Statistics and the Social Sciences University of Washington 1

2 Outline Part 1: Multiplicative factor models for network data 1. Relational data 2. Models via exchangeability 3. Models via matrix representations 4. Examples: (a) International conflict data (b) Protein-protein interaction data Part 2: Rank selection with the SVD model 1. A Gibbs sampling scheme for rank estimation 2. A simulation study 2

3 Relational data Relational data: consist of a set of units or nodes A, and a set of measurements Y {y i,j } specific to pairs of nodes (i, j) A A. Examples: International relations A =countries, y i,j = indicator of a dispute initiated by i with target j. Needle-sharing network A=IV drug users, y i,j = needle-sharing activity between i and j. Protein-protein interactions A=proteins, y i,j = the interaction between i and j. Business locations A 1 =banks, A 2 =cities, y i,j = presence of an office of bank i in city j. 3

4 Inferential goals in the regression framework 0 Y = y i,j measures i j, y 1,1 y 1,2 y 1,3 NA y 1,5 y 2,1 y 2,2 y 2,3 y 2,4 y 2,5 y 3,1 NA y 3,3 y 3,4 NA y 4,1 y 4,2 y 4,3 y 4,4 y 4, x i,j is a vector of explanatory variables. 1 C A Consider a basic (generalized) linear model 0 X = x 1,1 x 1,2 x 1,3 x 1,4 x 1,5 x 2,1 x 2,2 x 2,3 x 2,4 x 2,5 x 3,1 x 3,2 x 3,3 x 3,4 x 3,5 x 4,1 x 4,2 x 4,3 x 4,4 x 4, C A y i,j β x i,j + ɛ i,j A model can provide a measure of the association between X and Y: ˆβ, se( ˆβ) predictions of missing or future observations: p(y 1,4 Y, X) 4

5 How might node heterogeneity affect inference? 5

6 Latent variable models Deviations from ordinary regression models can be represented as y i,j β x i,j + z i,j A simple latent variable model might include row and column effects: z i,j = u i + v j + ɛ i,j y i,j β x i,j + u i + v j + ɛ i,j u i and v j induce across-node heterogeneity that is additive on the scale of the regressors. Inclusion of these effects in the model can dramatically improve within-sample model fit (measured by R 2, likelihood ratio, BIC, etc. ); out-of-sample predictive performance (measured by cross-validation). But this model only captures heterogeneity of outdegree/indegree, and can t represent more complicated structure, such as clustering, transitivity, etc. 6

7 Latent variable models via exchangeability X represents known information about the nodes Z represents any additional patterns In this case we might be willing to use a model in which {z i,j } d = {z gi,hj } for all permutations g and h. This is a type of exchangeability for arrays, sometimes called weak exchangeability. Theorem (Aldous, Hoover): Let {z i,j } be a weakly exchangeable array. Then {z i,j } = d {zi,j }, where z i,j = f(µ, u i, v j, ɛ i,j ) and µ, {u i }, {v j }, {ɛ i,j } are all independent random variables. 7

8 The singular value decomposition model Z = M + E M represents systematic patterns in the effects and E represents noise. Every M has a representation of the form M = UDV where, in the case m n, Recall, U is an m n matrix with orthonormal columns; V is an n n matrix with orthonormal columns; D is an n n diagonal matrix, with elements {d 1,..., d n } typically taken to be a decreasing sequence of non-negative numbers. The squared elements of the diagonal of D are the eigenvalues of M M and the columns of V are the corresponding eigenvectors. The matrix U can be obtained from the first n eigenvectors of MM. The number of non-zero elements of D is the rank of M. Writing the row vectors as {u 1,..., u m }, {v 1,..., v n }, m i,j = u i Dv j. 8

9 Multiplicative effects Interpretation: y i,j β x i,j + u idv j + ɛ i,j Think of {u 1,..., u m }, {v 1,..., v n } as vectors of latent nodal attributes: u idv j = K d k u i,k v j,k k=1 In general, a latent variable model relating X to Y is For example, some potential models are g(e[y i,j ]) = β x i,j + u idv j + ɛ i,j If y i,j is binary, If y i,j is count data, If y i,j is continuous, log odds (y i,j = 1) = β x i,j + u idv j + ɛ i,j log E[y i,j ] = β x i,j + u idv j + ɛ i,j E[y i,j ] = β x i,j + u idv j + ɛ i,j Estimation: Given D, V, the predictor is linear in U. This bilinear structure can be exploited (EM, Gibbs sampling, variational methods). 9

10 International conflict network, years of international relations data (thanks to Mike Ward and Xun Cao) y i,j = indicator of a militarized disputes initiated by i with target j; x i,j an 8-dimensional covariate vector containing an intercept and 1. population initiator 2. population target 3. polity score initiator 4. polity score target 5. polity score initiator polity score target 6. log distance 7. number of shared intergovernmental organizations Model: y i,j are independent binary random variables with log-odds log odds(y i,j = 1) = β x i,j + u idv j + ɛ i,j 10

11 International conflict network: parameter estimates log distance polity initiator intergov org polity target polity interaction log pop target log pop initiator regression coefficient 11

12 International conflict network: description of latent variation AFG ANG ARG AUL BAH BNG BOT BUI CAN CAO CHA CHN COS CUB CYP DOM DRC EGY FRN GHA GRC GUI HON IND INS IRN IRQ ISR ITA JOR JPN KEN LBR LES LIB MAL MYA NIC NIG NIR NTH OMA PAK PHI PRK QAT ROK RWA SAF SAL SAU SEN SIE SRI SUD SWA SYR TAW TAZ THI TOG TUR UAE UGA UKG USA VEN AFG ALB ANG ARG AUL BAH BEL BEN BNG BUI CAM CAN CDI CHA CHL CHN COL CON COS CUB CYP DRC EGY FRN GHA GNB GRC GUI GUY HAI HON IND INS IRN IRQ ISR ITA JOR JPN LBR LES LIB MAA MLI MON MOR MYA MZM NAM NIC NIG NIR NTH OMA PAK PHI PNG PRK QAT ROK RWA SAF SAL SAU SEN SIE SIN SPN SUD SYR TAW TAZ THI TOG TRI TUR UAE UGA UKG USA VEN YEM ZAM ZIM 12

13 International conflict network: prediction experiment # of links found K=0 K=1 K=2 K= # of pairs checked 13

14 Model details: prior distribution for U Parameters to estimate include the diagonal matrix D and: U V K,m = {m K matrices with orthonormal columns} V V K,n = {n K matrices with orthonormal columns} A constructive definition for p(u): 1. z 1 uniform[m-sphere], set U [,1] = z 1 ; 2. z 2 uniform[(m 1)-sphere], set U [,2] = Null(U [,1] )z 2 ;. K. z K uniform [(m K + 1)-sphere], U [,K] = Null(U [,1],..., U [,K 1] )z K. The resulting distribution for U is the uniform (invariant) distribution on the Steifel manifold, and is exchangeable under row and column permutations. Prior conditional distributions: (U [,j] U [, j] ) d = Null(U [,1],..., U [,j 1], U [,j+1],..., U [,K] )z j, z j uniform [(m K + 1)-sphere] 14

15 Estimation details: posterior distribution for U UDV = d 1 U [,1] V [,1] + + d K U [,K] V [,K], so for any column j Z UDV = (Z U [, j] D [ j, j] V [, j]) d j U [,j] V [,j] R j d j U [,j] V [,j] Z UDV 2 = R j d j U [,j] V [,j] 2 = R j 2 2d j U [,j] R j V [,j] + d j U [,j] V [,j] 2 = R j 2 2d j U [,j] R j V [,j] + d 2 j Recall Z = UDV + E where E is normally distributed noise with variance 1/φ: p(z U, V, D, φ) = (2πφ) mn/2 exp{ 1 2 φ R j 2 + φd j U [,j] R j V [,j] 1 2 φd2 j} p(u [,j] Z, U [, j], V, D) p(u [,j] U [, j] ) exp{ φd j U [,j] R j V [,j] } (U [,j] U [, j], Z, V, D) d = Null(U [,1],..., U [,j 1], U [,j+1],..., U [,K] )z j N j z j, p(z j ) exp{z j µ}, µ = φd j N jr j V [,j] 15

16 The von Mises-Fisher distribution p(z µ) = c(κ) exp{z µ}, where κ = µ κ=0 κ=2 κ=4 16

17 Analyzing undirected data In many applications y i,j = y j,i by design, and so (y i,j = y j,i ) β x i,j + z i,j, and Z = {z i,j } is a symmetric array. How should Z be modeled? Modeling via exchangeability: Let Z be a symmetric array (with an undefined diagonal), such that {z i,j } = d {z gi,gj }. Then {z i,j } = d {zi,j }, where z i,j = f(µ, u i, u j, ɛ i,j ), µ, {u i }, {ɛ i,j } are independent random variables, and f(, u i, u j, ) = f(, u j, u i, ). Modeling via matrix decomposition: Write Z = M + E, with all matrices symmetric. All such M have an eigenvalue decomposition This suggests a model of the form M = UΛU m i,j = u iλu j (y i,j = y j,i ) β x i,j + u iλu j + ɛ i,j 17

18 Protein-protein interaction network Interactions among 270 proteins in E. coli (Butland, 2005). 1 Data: ( ) i<j y i,j 0.01 Model: log odds(y i,j = 1) = µ + u i Λu j + ɛ i,j (analysis by Ryan Bressler) 18

19 Protein-protein interaction network Positive Negative acca accc accd acpp aidb alas b1731 b2097 b2341 b2342 b2496 bola cada cbpa clpa clpp clpx cobb cspa cspb cspc cspd cspe dead dnaa dnab dnae dnaj dnak dnaq eno faba fabi fabz fba fdhd flda ftse ftsz gada gadb gida grea greb grpe gyra gyrb hepa hfq hima hns holb holc hold hole hsca hsdm hsdr hslu hslv hupa hupb hypc hypd ibpa infc iscs iscu kdsa ldcc lig lpxd lyss lysu malt metk mopa mopb mreb mukb muke mura narg narh narj nary ndk nfi nusa nusg panc parc pept pflb phes phet pnp pola ppa prsa pssa pstb purc pyka recb recd rece recj recq rfad rfbc rhlb rne rnha rnpa rpld rpoa rpob rpoc rpod rpoh rpon rpos rpoz seca secb selb smpb spot srmb ssb suca sucb tdcd tgt thdf topa topb tpia tsf tufa tufb usg uvrc vacb yacg yacl ybcj ybdq ybez ybjx ycby ycea ycec ycff ycil yeaz yedo yfcb yfgb yfid yfif ygdp ygfa yggh ygjd yhbu yhby yhhp yien yjef yjfq ylea ynhc ynhd ynhe accb aas fabb plsb acps fabf fabg glmu ybgc yiid ybbn dnan ydhd rho rplb rpli rplv rpsc rpsb rpsg narz clpb cspg rpsm rplc rple rplm rpsa rpsd rpse rpla rplx rpsj rpst cspi ygif rpls rplu rpsf rpsh rpsi rpsn rpso rpsp rpsr dnac dnax hola reca entb sapa b1742 ffh rplf gcpe ubix fusa cafa hrpa lon tig hlpa himd b1410 b1808 pria malp hyfg hyce ibpb rpll gaty mukf nei infb usha sbcb rplo rplt b2372 qor manx ycbb rpmb recg rplq yhir rplr yibl rplw yhbv fucu yabb rpsl 19

20 Protein-protein interaction network RNA polymerase Negative rpoa greb b2372 usg infb grea rpoc manx rpos rpon rpob rpoz yacl rpod nusg pola b1731 hepa nusa Positive 20

21 Protein-protein interaction network: prediction experiment # of links found # of pairs checked 21

22 Data analysis with the singular value decomposition Many data analysis procedures for matrix-valued data Y are related to the SVD. Given a model of the form Y = M + E where E is independent normal noise, the SVD provides Interpretation: y i,j = u i Dv j + ɛ i,j, u i and v j are the ith, jth rows of U, V Estimation: Applications: ˆMK = Û[,1:K] ˆD [1:K,1:K] ˆV [,1:K] if M is assumed to be of rank K. biplots (Gabriel 1971, Gower and Hand 1996) reduced-rank interaction models (Gabriel 1978, 1998) analysis of relational data (Harshman et al., 1982) Factor analysis, image processing, data reduction,... But: How to select K? Given K, is ˆM K a good estimator? (E[Y Y] = M M + mσ 2 I ) Bayesian work on factor analysis models: Rajan and Rayner (1997), Minka (2000), Lopes and West (2004). 22

23 A toy example Y = UDV + E = X d k U [,k] V [,k] + E singular values

24 A toy example Y = UDV + E = X d k U [,k] V [,k] + E singular values

25 Evaluating the dimension It can all be done by comparing M 1 : Y = duv + E M 0 : Y = E To decide whether or not to add a dimension, we need to calculate p(m 1 Y) p(m 0 Y) = p(m 1) p(y M 1 ) p(m 0 ) p(y M 0 ) The numerator of the Bayes factor is the hard part. We need to integrate the following w.r.t. the prior distribution: p(y M 1, u, v, d) = (2π) nm/2 exp{ 1 2 φ Y 2 /2 + φdu Yv φd 2 /2} = p(y M 0 ) exp{ φd 2 /2} exp{u [dφy]v} Computing E[e u Av ] over u Sm and v S n is difficult but doable. It involves the integral of a Bessel function and a weird hypergeometric function. These results provide a description of a joint distribution to represent bivariate dependence among points on spheres: p(u, v A) = c(a) exp{u Av} 25

26 Simulation study 100 datasets were generated for each of (m, n) { (10, 10), (100, 10), (100, 100)} U uniform(v 5,m ), V uniform(v 5,n ) ; D = diag{d 1,..., d 5 }, {d 1,..., d 5 } i.i.d. uniform( 1 2 µ mn, 3 2 µ mn), where µ 2 mn = m + n + 2 mn. Y = UDV + E, where E is an m n matrix of standard normal noise. The choice of µ mn is not arbitrary: The largest eigenvalue of E is approximately (Edelman, 1988) λ max (m + n + 2 mn). 26

27 Simulation study m=10 n=10 m=100 n=10 m=100 n=100 posterior mode 27

28 Simulation study m=10 n=10 m=100 n=10 m=100 n=100 max eigengap posterior mode 28

29 Simulation study m=10 n=10 m=100 n=10 m=100 n=100 eigenvalue of corr max eigengap posterior mode 29

30 Simulation study E[p(K Y)] m=10, n=10 p(k^ ) Density E[p(K Y)] m=100, n=10 p(k^ ) Density E[p(K Y)] m=100, n=100 p(k^ ) Density K K^ ASE B ASE LS 30

31 Summary and future directions Summary: Relevance disclaimer: If your goal is to represent large components of variation in the data, then maybe you don t need a model. If you believe Y M + E, and you want to represent large components of variation in M, then a model can be helpful. SVD models are a natural way to represent patterns in relational data; The SVD structure can be incorporated into a variety of model forms; BMA on model rank can be done. Potentially straightforward extensions: Representation of higher dimensional arrays: m i,j,k = L l=1 d lu i,l v j,l w l,k ; Dynamic network inference (times series models for Y, U, D, V); Restrict latent characteristics to be orthogonal to design vectors. 31

Latent SVD Models for Relational Data

Latent SVD Models for Relational Data Peter Hoff Statistics, iostatistics and enter for Statistics and the Social Sciences University of Washington 1 Outline 1 Relational data 2 Exchangeability for arrays