Latent Factor Models for Relational Data

Similar documents
Latent Factor Models for Relational Data

Latent Factor Models for Relational Data

Latent SVD Models for Relational Data

Latent SVD Models for Relational Data

Hierarchical models for multiway data

Mean and covariance models for relational arrays

Probability models for multiway data

Higher order patterns via factor models

Simulation of the matrix Bingham-von Mises-Fisher distribution, with applications to multivariate and relational data

The International-Trade Network: Gravity Equations and Topological Properties

Inferring Latent Preferences from Network Data

Clustering and blockmodeling

Hierarchical multilinear models for multiway data

Agriculture, Transportation and the Timing of Urbanization

Robust Detection of Hierarchical Communities from Escherichia coli Gene Expression Data

A re examination of the Columbian exchange: Agriculture and Economic Development in the Long Run

Lecture 10 Optimal Growth Endogenous Growth. Noah Williams

Multilinear tensor regression for longitudinal relational data

Supplemental Information. Exploiting Natural Fluctuations. to Identify Kinetic Mechanisms. in Sparsely Characterized Systems

Dyadic data analysis with amen

Network Analysis with Pajek

Economic Growth: Lecture 1, Questions and Evidence

Landlocked or Policy Locked?

Lecture Note 13 The Gains from International Trade: Empirical Evidence Using the Method of Instrumental Variables

ECON 581. The Solow Growth Model, Continued. Instructor: Dmytro Hryshko

Random Effects Models for Network Data

Equivariant and scale-free Tucker decomposition models

Partial correlation coefficient between distance matrices as a new indicator of protein protein interactions

Econometrics I KS. Module 1: Bivariate Linear Regression. Alexander Ahammer. This version: March 12, 2018

IDENTIFYING MULTILATERAL DEPENDENCIES IN THE WORLD TRADE NETWORK

Linear Algebra (Review) Volker Tresp 2018

Linear Equations and Matrix

Modeling homophily and stochastic equivalence in symmetric relational data

Lecture 9 Endogenous Growth Consumption and Savings. Noah Williams

Model averaging and dimension selection for the singular value decomposition

INSTITUTIONS AND THE LONG-RUN IMPACT OF EARLY DEVELOPMENT

Review of Linear Algebra

Lecture 6: Methods for high-dimensional problems

Assignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran

CS 143 Linear Algebra Review

Exponential Random Graph Models (ERGM)

The Singular Value Decomposition

Linear Algebra (Review) Volker Tresp 2017

Manning & Schuetze, FSNLP (c) 1999,2000

Sampling and incomplete network data

Probabilistic Latent Semantic Analysis

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

WP/18/117 Sharp Instrument: A Stab at Identifying the Causes of Economic Growth

arxiv:math/ v1 [math.st] 1 Sep 2006

Multivariate Statistical Analysis

Positive Definite Matrix

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

!" #$$% & ' ' () ) * ) )) ' + ( ) + ) +( ), - ). & " '" ) / ) ' ' (' + 0 ) ' " ' ) () ( ( ' ) ' 1)

PCA and admixture models

Scalable Gaussian process models on matrices and tensors

International Investment Positions and Exchange Rate Dynamics: A Dynamic Panel Analysis

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

A Hierarchical Perspective on Lee-Carter Models

Introduction to Machine Learning

Stat 579: Generalized Linear Models and Extensions

Properties of Matrices and Operations on Matrices

Eigenvalues and diagonalization

Accurate proteome-wide protein quantification from high-resolution 15 N mass spectra

Model averaging and dimension selection for the singular value decomposition

Linear Algebra Review. Vectors

Singular Value Decomposition and Principal Component Analysis (PCA) I

Example Linear Algebra Competency Test

Strassen-like algorithms for symmetric tensor contractions

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Nonparametric Bayesian Matrix Factorization for Assortative Networks

For Adam Smith, the secret to the wealth of nations was related

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Manning & Schuetze, FSNLP, (c)

Linear Methods in Data Mining

= main diagonal, in the order in which their corresponding eigenvectors appear as columns of E.

Appendix A: Matrices

CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery

Practical Linear Algebra: A Geometry Toolbox

Cluster Analysis of Protein-protein Interaction Network of Mycobacterium tuberculosis during Host infection

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Review of Linear Algebra

Correspondence Analysis

MP 472 Quantum Information and Computation

Default Priors and Effcient Posterior Computation in Bayesian

Math 520 Exam 2 Topic Outline Sections 1 3 (Xiao/Dumas/Liaw) Spring 2008

BIOINFORMATICS Datamining #1

Functional Analysis Review

5 Linear Algebra and Inverse Problem

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Roots of Autocracy. January 26, 2017

linearly indepedent eigenvectors as the multiplicity of the root, but in general there may be no more than one. For further discussion, assume matrice

Modeling heterogeneity in random graphs

14 Singular Value Decomposition

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

The System of Linear Equations. Direct Methods. Xiaozhou Li.

BIOINFORMATICS. An automated comparative analysis of 17 complete microbial genomes. Arvind K. Bansal

Faloutsos, Tong ICDE, 2009

DEN: Linear algebra numerical view (GEM: Gauss elimination method for reducing a full rank matrix to upper-triangular

Transcription:

Latent Factor Models for Relational Data Peter Hoff Statistics, Biostatistics and Center for Statistics and the Social Sciences University of Washington

Outline Part 1: Multiplicative factor models for network data Relational data Models via exchangeability Models via matrix representations Examples: conflict data, protein interaction data Part 2: Extension to multiway data Multiway data Multiway latent factor models Example: Cold war cooperation and conflict Summary and future work

Relational data Relational data: consist of a set of units or nodes A, and a set of measurements Y {y i,j } specific to pairs of nodes (i, j) A A. Examples: International relations A =countries, y i,j = indicator of a dispute initiated by i with target j. Needle-sharing network A=IV drug users, y i,j = needle-sharing activity between i and j. Protein-protein interactions A=proteins, y i,j = the interaction between i and j. Document analysis A 1 =words, A 2 =documents, y i,j = wordcount of i in document j.

Inferential goals in the regression framework 0 Y = B @ y i,j measures i j, y 1,1 y 1,2 y 1,3 NA y 1,5 y 2,1 y 2,2 y 2,3 y 2,4 y 2,5 y 3,1 NA y 3,3 y 3,4 NA y 4,1 y 4,2 y 4,3 y 4,4 y 4,5..... x i,j is a vector of explanatory variables. 1 C A 0 X = B @ Consider a basic (generalized) linear model A model can provide y i,j β x i,j + e i,j a measure of the association between X and Y: x 1,1 x 1,2 x 1,3 x 1,4 x 1,5 x 2,1 x 2,2 x 2,3 x 2,4 x 2,5 x 3,1 x 3,2 x 3,3 x 3,4 x 3,5 x 4,1 x 4,2 x 4,3 x 4,4 x 4,5... ˆβ, se( ˆβ) predictions of missing or future observations: p(y 1,4 Y, X).. 1 C A

How might node heterogeneity affect network structure?

Latent variable models Deviations from ordinary regression models can be represented as y i,j β x i,j + e i,j A simple latent variable model might include row and column effects: e i,j = u i + v j + ɛ i,j y i,j β x i,j + u i + v j + ɛ i,j u i and v j induce across-node heterogeneity that is additive on the scale of the regressors. Inclusion of these effects in the model can dramatically improve within-sample model fit (measured by R 2, likelihood ratio, BIC, etc.); out-of-sample predictive performance (measured by cross-validation). But this model only captures heterogeneity of outdegree/indegree, and can t represent more complicated structure, such as clustering, transitivity, etc.

Latent variable models via exchangeability X represents known information about the nodes E represents deviations from the regression model In this case we might be willing to use a model for E in which {e i,j } d = {e gi,hj } for all permutations g and h. This is a type of exchangeability for arrays, sometimes called weak exchangeability. Theorem (Aldous, Hoover): Let {e i,j } be a weakly exchangeable array. Then {e i,j } = d {ei,j }, where e i,j = f (µ, u i, v j, ɛ i,j ) and µ, {u i }, {v j }, {ɛ i,j } are all independent random variables.

The singular value decomposition model E = M + E M represents systematic patterns and E represents noise. Every M has a representation of the form M = UDV where, in the case m n, Recall, U is an m n matrix with orthonormal columns; V is an n n matrix with orthonormal columns; D is an n n diagonal matrix, with diagonal elements {d 1,..., d n } typically taken to be a decreasing sequence of non-negative numbers. The squared elements of the diagonal of D are the eigenvalues of M M and the columns of V are the corresponding eigenvectors. The matrix U can be obtained from the first n eigenvectors of MM. The number of non-zero elements of D is the rank of M. Writing the row vectors as {u 1,..., u m }, {v 1,..., v n }, m i,j = u i Dv j.

Data analysis with the singular value decomposition Many data analysis procedures for matrix-valued data Y are related to the SVD. Given a model of the form Y = M + E where E is independent noise, the SVD provides Interpretation:y i,j = u i Dv j + ɛ i,j, u i and v j are the ith, jth rows of U, Estimation: ˆMK = Û[,1:K] ˆD [1:K,1:K] ˆV [,1:K] if M is assumed to be of rank K. Applications: biplots (Gabriel 1971, Gower and Hand 1996) reduced-rank interaction models (Gabriel 1978, 1998) analysis of relational data (Harshman et al., 1982) Factor analysis, image processing, data reduction,... Notes: How to select K? Given K, is ˆM K a good estimator? (E[Y Y] = M M + mσ 2 I ) Work on model-based factor analysis: Rajan and Rayner (1997), Minka (2000), Lopes and West (2004).

Generalized bilinear regression y i,j β x i,j + u idv j + ɛ i,j Interpretation: Think of {u 1,..., u m }, {v 1,..., v n } as vectors of latent nodal attributes: K u idv j = d k u i,k v j,k k=1 In general, a latent variable model relating X to Y is g(y i,j ) = β x i,j + u idv j + ɛ i,j and the parameters can be estimated using a rank-likelihood or multinomial probit. Alternatively, parametric models include If y i,j is binary, If y i,j is count data, If y i,j is continuous, log odds (y i,j = 1) = β x i,j + u i Dv j + ɛ i,j log E[y i,j ] = β x i,j + u i Dv j + ɛ i,j E[y i,j ] = β x i,j + u i Dv j + ɛ i,j Estimation: Given D, V, the predictor is linear in U. This bilinear structure can be exploited (EM, Gibbs sampling, variational methods).

International conflict network, 1990-2000 11 years of international relations data (Mike Ward and Xun Cao) y i,j = indicator of a militarized disputes initiated by i with target j; x i,j an 8-dimensional covariate vector containing an intercept and 1. population initiator 2. population target 3. polity score initiator 4. polity score target 5. polity score initiator polity score target 6. log distance 7. number of shared intergovernmental organizations Model: y i,j are independent binary random variables with log-odds log odds(y i,j = 1) = β x i,j + u idv j + ɛ i,j

International conflict network: parameter estimates log distance polity initiator intergov org polity target polity interaction log pop target log pop initiator 3 2 1 0 1 regression coefficient

International conflict network: description of latent variation AFG ANG ARG AUL BAH BNG BOT BUI CAN CAO CHA CHN COS CUB CYP DOM DRC EGY FRN GHA GRC GUI HON IND INS IRN IRQ ISR ITA JOR JPN KEN LBR LES LIB MAL MYA NIC NIG NIR NTH OMA PAK PHI PRK QAT ROK RWA SAF SAL SAU SEN SIE SRI SUD SWA SYR TAW TAZ THI TOG TUR UAE UGA UKG USA VEN AFG ALB ANG ARG AUL BAH BEL BEN BNG BUI CAM CAN CDI CHA CHL CHN COL CON COS CUB CYP DRC EGY FRN GHA GNB GRC GUI GUY HAI HON IND INS IRN IRQ ISR ITA JOR JPN LBR LES LIB MAA MLI MON MOR MYA MZM NAM NIC NIG NIR NTH OMA PAK PHI PNG PRK QAT ROK RWA SAF SAL SAU SEN SIE SIN SPN SUD SYR TAW TAZ THI TOG TRI TUR UAE UGA UKG USA VEN YEM ZAM ZIM

International conflict network: prediction experiment # of links found 0 20 40 60 80 100 K=0 K=1 K=2 K=3 0 20 40 60 80 100 # of pairs checked

Analyzing undirected data In many applications y i,j = y j,i by design, and so (y i,j = y j,i ) β x i,j + e i,j and E = {e i,j } is a symmetric array. How should E be modeled? Modeling via exchangeability: Let E be a symmetric array (with an undefined diagonal), such that {e i,j } d = {e gi,gj }. Then {e i,j } d = {e i,j }, where e i,j = f (µ, u i, u j, ɛ i,j ), µ, {u i }, {ɛ i,j } are all independent and f (, u i, u j, ) = f (, u j, u i, ). Modeling via matrix decomposition: Write E = M + E, with all matrices symmetric. All such M have an eigenvalue decomposition This suggests a model of the form M = UΛU m i,j = u iλu j (y i,j = y j,i ) β x i,j + u iλu j + ɛ i,j

Protein-protein interaction network Interactions among 270 proteins in E. coli (Butland, 2005). 1 Data: ( 270 i<j y i,j 0.01 2 ) Model: log odds(y i,j = 1) = µ + u i Λu j + ɛ i,j

Protein-protein interaction network 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 Positive Negative acca accc accd acpp aidb alas b1731 b2097 b2341 b2342 b2496 bola cada cbpa clpa clpp clpx cobb cspa cspb cspc cspd cspe dead dnaa dnab dnae dnaj dnak dnaq eno faba fabi fabz fba fdhd flda ftse ftsz gada gadb gida grea greb grpe gyra gyrb hepa hfq hima hns holb holc hold hole hsca hsdm hsdr hslu hslv hupa hupb hypc hypd ibpa infc iscs iscu kdsa ldcc lig lpxd lyss lysu malt metk mopa mopb mreb mukb muke mura narg narh narj nary ndk nfi nusa nusg panc parc pept pflb phes phet pnp pola ppa prsa pssa pstb purc pyka recb recd rece recj recq rfad rfbc rhlb rne rnha rnpa rpld rpoa rpob rpoc rpod rpoh rpon rpos rpoz seca secb selb smpb spot srmb ssb suca sucb tdcd tgt thdf topa topb tpia tsf tufa tufb usg uvrc vacb yacg yacl ybcj ybdq ybez ybjx ycby ycea ycec ycff ycil yeaz yedo yfcb yfgb yfid yfif ygdp ygfa yggh ygjd yhbu yhby yhhp yien yjef yjfq ylea ynhc ynhd ynhe accb aas fabb plsb acps fabf fabg glmu ybgc yiid ybbn dnan ydhd rho rplb rpli rplv rpsc rpsb rpsg narz clpb cspg rpsm rplc rple rplm rpsa rpsd rpse rpla rplx rpsj rpst cspi ygif rpls rplu rpsf rpsh rpsi rpsn rpso rpsp rpsr dnac dnax hola reca entb sapa b1742 ffh rplf gcpe ubix fusa cafa hrpa lon tig hlpa himd b1410 b1808 pria malp hyfg hyce ibpb rpll gaty mukf nei infb usha sbcb rplo rplt b2372 qor manx ycbb rpmb recg rplq yhir rplr yibl rplw yhbv fucu yabb rpsl

Protein-protein interaction network rpoa greb b2372 usg infb grea rpoc manx rpos rpon pola b1731 hepa nusa yacl rpod nusg rpob rpoz RNA polymerase Negative 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.85 0.90 0.95 1.00 Positive

Protein-protein interaction network: prediction experiment # of links found 0 50 100 150 0 50 100 150 200 250 300 350 # of pairs checked

Multiway data Data on pairs is sometimes called two-way data. More generally, data on triples, quadruples, etc. is called multi-way data. A 1, A 2,..., A p represent classes of objects. y i1,i 2,...,i p is the measurement specific to i 1 A 1,..., i p A p. Examples: International relations A 1 = A 2 = countries, A 3 = time, y i,j,t = indicator of a dispute between i and j in year t. Social networks A 1 = A 2 = individuals, A 3 = information sources, y i,j,k = source k s report of the relationship between i and j. Document analysis A 1 = A 2 = words, A 3 = documents, y i,j,k = co-occurrence of words i and j in document k.

Factor models for multiway data Recall the decomposition of a two-way array of rank R: m i,j = u idv j = Now generalize to a three-way array: m i,j,k = R u i,r v j,r d r r=1 R u i,r v j,r w k,r d r {u 1,..., u n1 } represents variation along the 1st dimension {v 1,..., v n2 } represents variation along the 2nd dimension {w 1,..., w n3 } represents variation along the 3rd dimension Consider the kth slab of M, which is an n 1 n 2 matrix: R m i,j,k = u i,r v j,r w k,r d r = r=1 r=1 R u i,r v j,r d k,r = u id k v j r=1 where d k,r = w k,r d k

Cold war cooperation and conflict data βtrade 0.2 0.0 0.2 0.4 βgdp 0.2 0.0 0.2 0.4 βdistance 0.2 0.0 0.2 0.4 1950 1960 1970 1980 1950 1960 1970 1980 1950 1960 1970 1980 λ1 0.5 0.0 0.5 1.0 λ2 0.5 0.0 0.5 1.0 λ3 0.5 0.0 0.5 1.0 1950 1960 1970 1980 1950 1960 1970 1980 1950 1960 1970 1980

Cold war cooperation and conflict data u2 1950 1960 1970 USR USR USR SYR CUB IRQ IRQ SYR FRN EGY CAM FRN EGY POR IND AFG AFG TUN CAM ALG EGY ZAM ECU NEW PHI TAW PRK IND AUL ICE PRK IND GUA RVN GRC DOM HAI MYA ROK HUN RUM DRV NEW AUL RVN PHI PRK TUR BUL ARG PER ROK BEL ETH CHL AUS JPN HUN DRV TUR LAO SOM NEP ROK LAO VEN PAK CHN THI PAK GUY INS CHN THI PAK CHN YUG SAL MOR IRN IRN ZIM CAN SEN USA JOR USA GUI JOR UKG ITA USA ISR UKGISR NOR NTH JOR UKG SAF ISR 1985 USR ANG CUB IRQ SYR CAM FRN LEB EGY AFG TUN IND NEW PHI PRK KUW BOT GRCCYP DRV LAO HON MLT PAN ROK TUR LBR BEL COS DEN ETH IRE BFO MAL MLI NIC SAU SOM SRI CZE SPN GDR MYA THI PAK CHN SAL IRN USA GFR LIB UKG ITA ISR NTH SAF u3 u1 USA 1950 ROK AUL NEW PHI UKG TUR ISR PAK HUN CANYUG TAW DOM BUL FRN GRC HAI RUM ECU MYA EGY IND PER AFG JOR SYRPRK CHN USR u1 USA 1960 NOR JPN PAK UKG TUR ROK RVN THI ISR NTH LAO ITA BEL NEP HUN ETH MOR SOM CHL ARG FRNTUN EGY IND AFG JOR ICE AUS IRN IRQ PRK CHN DRV CUB INS USR CAM USA u1 1970 UKG ROK RVN AULTHI ISR NEW LAO ZAM PHI IND CAM PAK SAF VEN GUY SEN GUA ALG GUI PORSAL EGY JOR IRN IRQ ZIM SYRPRK CHN DRV USR USA u1 1985 PAK UKG TUR KUW SAU ROK THI ISR MAL GFR HON NTH BEL SAF ITA NEW LAO MLT PAN LBR COS DEN ETH IRE BFO FRN GRC SOM MLI SRI LIB LEB TUN SAL SPN CYP MYA EGY PHI IND AFGBOTCZE GDR ANG IRN NIC IRQ SYRPRK CHN DRV CUB USR CAM u3 u1 USA 1950 PAK UKG TUR ISR ROK AUL HUN NEW CANYUGDOM BUL HAI RUM TAW MYA GRC PER ECU EGY FRN JOR PHI IND AFG CHN PRK SYR USR u1 USA 1960 NOR PAK UKG TUR JPN ISR NTH ROK RVN ITA THI LAO BEL HUN MOR NEP SOM CHL ARG ETH TUN EGY FRN JOR ICE IND AFG IRN CHN AUS IRQ INS DRV PRK CUB USR CAM u1 USA 1970 UKG PAK ISR SAF ROK RVN THI LAO AUL NEW ZAM GUI SEN GUY VEN SAL GUA ALG EGY POR JORZIM PHI IND IRN CHN IRQ DRV PRK SYR CAM USR u1 USA 1985 PAK UKG TUR KUW SAU ISR NTH SAF GFR MAL HON ITA THI ROK LAO BEL DEN COSNEW BFO ETH IRE MLT LBR PAN MLI LIB SAL SOM SRI GRC LEB SPN MYA CYP TUN EGY FRN CZE IND BOT PHI AFG IRN CHN GDR NIC IRQ ANG DRV PRK SYR CUB USR CAM u2 u2 u2 u2

Summary and future directions Summary: Latent factor models are a natural way to represent patterns in relational or array-structured data. The latent factor structure can be incorporated into a variety of model forms. Model-based methods Future Directions: give parameter estimates; accommodate missing data; provide predictions; are easy to extend. Dynamic network inference Generalization to multi-level models