Latent SVD Models for Relational Data

Similar documents
Latent Factor Models for Relational Data

Hierarchical models for multiway data

Mean and covariance models for relational arrays

Higher order patterns via factor models

Probability models for multiway data

Model averaging and dimension selection for the singular value decomposition

arxiv:math/ v1 [math.st] 1 Sep 2006

Model averaging and dimension selection for the singular value decomposition

Multilinear tensor regression for longitudinal relational data

Properties of Matrices and Operations on Matrices

Lecture 6: Methods for high-dimensional problems

Dyadic data analysis with amen

Random Effects Models for Network Data

Latent SVD Models for Relational Data

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Hierarchical multilinear models for multiway data

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Data Mining Techniques

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

The International-Trade Network: Gravity Equations and Topological Properties

DS-GA 1002 Lecture notes 10 November 23, Linear models

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Linear Algebra Review. Vectors

Probabilistic Latent Semantic Analysis

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Vector Space Models. wine_spectral.r

A Hierarchical Perspective on Lee-Carter Models

Linear Algebra (Review) Volker Tresp 2018

Equivariant and scale-free Tucker decomposition models

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Weighted Low Rank Approximations

Assignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran

CS 143 Linear Algebra Review

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Review of Linear Algebra

Sampling and incomplete network data

Modeling homophily and stochastic equivalence in symmetric relational data

Maximum Likelihood Estimation

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

CMSC858P Supervised Learning Methods

Positive Definite Matrix

Stat 579: Generalized Linear Models and Extensions

Singular Value Decomposition and Principal Component Analysis (PCA) I

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Linear Algebra (Review) Volker Tresp 2017

Latent Factor Models for Relational Data

8.1 Concentration inequality for Gaussian random matrix (cont d)

The Singular Value Decomposition

4 Bias-Variance for Ridge Regression (24 points)

Inferring Latent Preferences from Network Data

Linear algebra for computational statistics

Nonparametric Bayesian Matrix Factorization for Assortative Networks

Fall TMA4145 Linear Methods. Exercise set Given the matrix 1 2

Least Squares Regression

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Eigenvalues and diagonalization

Review of similarity transformation and Singular Value Decomposition

Lecture 4: Principal Component Analysis and Linear Dimension Reduction

CS47300: Web Information Search and Management

[y i α βx i ] 2 (2) Q = i=1

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Functional Analysis Review

17.1 Directed Graphs, Undirected Graphs, Incidence Matrices, Adjacency Matrices, Weighted Graphs

Statistical Inference on Large Contingency Tables: Convergence, Testability, Stability. COMPSTAT 2010 Paris, August 23, 2010

PCA and admixture models

Expectation Maximization

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Lecture 9: Low Rank Approximation

linearly indepedent eigenvectors as the multiplicity of the root, but in general there may be no more than one. For further discussion, assume matrice

Math 408 Advanced Linear Algebra

Bayes Decision Theory - I

Bi-cross-validation for factor analysis

CS 340 Lec. 6: Linear Dimensionality Reduction

Graphical Model Selection

CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery

Bayes methods for categorical data. April 25, 2017

Linear Algebra. Session 12

Conjugate Analysis for the Linear Model

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Probabilistic & Unsupervised Learning

Example Linear Algebra Competency Test

Multivariate Statistical Analysis

Association studies and regression

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

A note on multiple imputation for general purpose estimation

Spectral Theory of Unsigned and Signed Graphs Applications to Graph Clustering. Some Slides

Total Least Squares Approach in Regression Methods

DYNAMIC AND COMPROMISE FACTOR ANALYSIS


BIOINFORMATICS Datamining #1

Learning with Singular Vectors

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Least Squares Regression

Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians

Quick Tour of Linear Algebra and Graph Theory

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Data Mining and Analysis: Fundamental Concepts and Algorithms

Transcription:

Latent SVD Models for Relational Data Peter Hoff Statistics, iostatistics and enter for Statistics and the Social Sciences University of Washington 1

Outline 1 Relational data 2 Exchangeability for arrays 3 Latent variable models 4 Subtalk: Rank selection for the SVD model (a) Prior distributions on the Steifel manifold (b) A Gibbs sampling scheme for rank estimation (c) A simulation study 5 Examples: (a) Modeling international conflict (b) Finding protein-protein interactions 2

Relational data Relational data consists of a set of units or nodes A, and a set of variables Y {y i,j } specific to pairs of nodes (i, j) A A Examples: Needle-sharing network: Let A index a set of IV drug users Let y i,j indicate needle-sharing activity between i and j International relations: Let A index the set of countries Let y i,j be the number of military disputes initiated by i with target j Protein-protein interactions: Let A index a sets of proteins Let y i,j measure the interaction between i and j Gene-protein interactions: Let A = {A 1,A 2 },witha 1 indexing a set of genes and A 2 indexing a set of proteins Let y i,j measure the relationship between i A 1 and j A 2 3

Inferential goals in the regression framework Let y i,j be the measurement on i j, andx i,j be a vector of explanatory variables If y i,j is binary, then maybe we have Y = 0 @ 1 0 NA 0 0 0 0 1 0 NA 1 NA 0 1 0 0 1 A X = 0 @ x 1,2 x 1,3 x 1,4 x 1,5 x 2,1 x 2,3 x 2,4 x 2,5 x 3,1 x 3,2 x 3,4 x 3,5 x 4,1 x 4,2 x 4,3 x 4,5 x 5,1 x 5,2 x 5,3 x 5,4 1 A onsider a basic generalized linear model log odds(y i,j =1)=β x i,j A model can provide a measure of the association between X and Y: ˆβ, se(ˆβ) predictions of missing or future observations: Pr(y 1,4 =1 Y, X) 4

How might node heterogeneity affect network structure? 5

Latent variable models Deviations from ordinary logistic regression can be represented with log odds(y i,j =1)=β x i,j + γ i,j Perhaps the first latent variable model was similar to a p 1 -type model γ i,j = a i + b j log odds(y i,j =1)=β x i,j + a i + b j a i and b j induce across-node heterogeneity that is additive on the log-odds scale Inclusion of these effects in the model can dramatically improve within-sample model fit (measured by R 2, likelihood ratio, I, etc ); out-of-sample predictive performance (measured by cross-validation) ut this model only captures heterogeneity of outdegree/indegree, and can t represent more complicated structure, such as clustering, transitivity, etc 6

Exchangeability for arrays For relational data in general we might be willing to assume something like {γ i,j } d = {γ gi,hj } for all permutations g and h This is a type of exchangeability for arrays, sometimes called weak exchangeability Theorem (Aldous, Hoover): Let {γ i,j } be a weakly exchangeable array Then {γ i,j } d = {γ i,j },where γ i,j = f(µ, u i,v j,ɛ i,j ) and µ, {u i }, {v j }, {ɛ i,j } are all independent uniform random variables on [0, 1] There are variants of this theorem for symmetric sociomatrices and other exchangeability assumptions 7

Matrix representation techniques Exchangeability suggests random effects models of the form γ i,j = f(µ, u i,v j,ɛ i,j ) What should f be? Model: log odds(y i,j =1)=µ + u idv j + ɛ i,j Interpretation: Think of {u 1,,u m }, {v 1,,v m } as vectors of latent nodal attributes u idv j = K d k u i,k v j,k k=1 8

Model Extension: Latent SVD model General model for relational data : Let Y be an m n matrix of relational data; X be an m n p array of explanatory variables A latent variable model relating X to Y is g(e[y i,j ]) = β x i,j + u idv j + ɛ i,j For example, some potential models are If y i,j is binary, If y i,j is count data, If y i,j is continuous, log odds (y i,j =1)=β x i,j + u i Dv j + ɛ i,j log E[y i,j ]=β x i,j + u i Dv j + ɛ i,j E[y i,j ]=β x i,j + u i Dv j + ɛ i,j (similar to the generalized bilinear models of Gabriel 1998) Estimation and dimension selection for u idv j relates to a more basic SVD model 9

eginning of subtalk: The singular value decomposition Every m n matrix M has a representation of the form M = UDV where, in the case m n, Recall, U is an m n matrix with orthonormal columns; V is an n n matrix with orthonormal columns; D is an n n diagonal matrix, with diagonal elements {d 1,,d n } typically taken to be a decreasing sequence of non-negative numbers The squared elements of the diagonal of D are the eigenvalues of M M and the columns of V are the corresponding eigenvectors The matrix U can be obtained from the first n eigenvectors of MM The number of non-zero elements of D is the rank of M 10

Data analysis with the singular value decomposition Many data analysis procedures for matrix-valued data Y are related to the SVD Given a model of the form Y = M + E the SVD provides Interpretation: y i,j = u i Dv j + ɛ i,j, u i and v j are the ith, jth rows of U, V Estimation: Applications: ˆMK = Û[,1:K] ˆD [1:K,1:K] ˆV [,1:K] if M is assumed to be of rank K biplots (Gabriel 1971, Gower and Hand 1996) reduced-rank interaction models (Gabriel 1978, 1998) analysis of relational data (Harshman et al, 1982) Factor analysis, image processing, data reduction, ut: How to select K? GivenK, is ˆM K a good estimator? (E[Y Y]=M M + mσ 2 I ) ayesian work on factor analysis models: Rajan and Rayner (1997), Minka (2000), Lopes and West (2004) 11

Atoyexample Y 40 30 = UDV + E = X d k U [,k] V [,k] + E singular values 0 5 10 15 20 0 5 10 15 20 25 30 12

Atoyexample Y 40 30 = UDV + E = X d k U [,k] V [,k] + E singular values 0 5 10 15 20 0 5 10 15 20 25 30 13

Atoyexample Y 40 30 = UDV + E = X d k U [,k] V [,k] + E singular values 0 5 10 15 20 0 5 10 15 20 25 30 14

P(K Y) 000 005 010 015 020 Atoyexample 1 3 5 7 9 11 13 15 3 2 1 0 1 2 3 K Truth Estimate 4 2 0 2 MSE(ayes) = 020 MSE(LSE) = 042 15

SVD Model Let Y be an m n data matrix (suppose m n) onsider a model of the form Y = UDV + E where K {0,,n} ; U V K,m, V V K,n ; D =diag{d 1,,d K }; E is a matrix of independent normal noise Given a prior distribution on {U, D, V} we would like to calculate p({u 1,,u K }, {v 1,,v K }, {d 1,,d K } Y,K) p(k Y) 16

Prior distribution for U 1 z 1 uniform[m-sphere], set U [,1] = z 1 ; 2 z 2 uniform[(m 1)-sphere], set U [,2] =Null(U [,1] )z 2 ; K z K uniform [(m K + 1)-sphere], U [,K] =Null(U [,1],,U [,K 1] )z K The resulting distribution for U is the uniform (invariant) distribution on the Steifel manifold, and is exchangeable under row and column permutations Prior conditional distributions: (U [,j] U [, j] ) d = Null(U [,1],,U [,j 1], U [,j+1],,u [,K] )z j, z j uniform [(m K + 1)-sphere] 17

Posterior distribution for U Recall UDV = K k=1 d ku [,k] V [,k], so so Y UDV = (Y U [, j] D [ j, j] V [, j]) d j U [,j] V [,j] E j d j U [,j] V [,j] Y UDV 2 = E j d j U [,j] V [,j] 2 = E j 2 2d j U [,j] E j V [,j] + d j U [,j] V [,j] 2 = E j 2 2d j U [,j] E j V [,j] + d 2 j p(y U, V, D,φ) = (2πφ) mn/2 exp{ 1 2 φ E j 2 + φd j U [,j] E j V [,j] 1 2 φd2 j} p(u [,j] Y, U [, j], V, D) p(u [,j] U [, j] )exp{φd j U [,j] E j V [,j] } (U [,j] U [, j], Y, V, D) d = Null(U [,1],,U [,j 1], U [,j+1],,u [,K] )z j N j z j, p(z j ) exp{z j µ}, µ = φd j N j E jv [,j] 18

Evaluating the dimension It can all be done by comparing M 1 : Y = duv + E M 0 : Y = E To decide whether or not to add a dimension, we need to calculate p(m 1 Y) p(m 0 Y) = p(m 1) p(y M 1 ) p ( M 0 ) p(y M 0 ) The numerator of the ayes factor is the hard part We need to integrate the following wrt the prior distribution: p(y M 1, u, v, d) = (2π) nm/2 exp{ 1 2 φ Y 2 /2+φdu Yv φd 2 /2} = p(y M 0 ) exp{ φd 2 /2} exp{u [dφy]v} omputing E[e u Av ]overu Sm and v S n is difficult but doable 19

Simulation study For each of (m, n) {(10, 10), (100, 10), (100, 100)}, 100 datasets were generated as follows: U uniform(v 5,m ), V uniform(v 5,n ); D =diag{d 1,,d 5 }, {d 1,,d 5 } iid uniform( 1 2 µ mn, 3 2 µ mn) Y = UDV + E, wheree is an m n matrix of standard normal noise where µ mn = m + n +2 mn This choice was not arbitrary: The largest eigenvalue of E is approximately (Edelman, 1988) λ max (m + n +2 mn) 20

E[p(K Y)] 00 02 04 06 E[p(K Y)] 00 02 04 06 E[p(K Y)] 00 02 04 06 m=10, n=10 0 2 4 6 8 10 m=100, n=10 0 2 4 6 8 10 m=100, n=100 0 2 4 6 8 10 K Simulation study 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 K^ 21 p(k^ ) 00 02 04 06 p(k^ ) 00 02 04 06 p(k^ ) 00 02 04 06 Density 0 1 2 3 4 5 6 Density 0 1 2 3 4 5 6 Density 0 1 2 3 4 5 6 00 05 10 15 00 05 10 15 00 05 10 15 ASE ASELS

International onflict Network, 1990-2000 11 years of international relations data (thanks to Mike Ward and Xun ao) y i,j = number of militarized disputes initiated by i with target j; x i,j an 11-dimensional covariate vector containing 1 log distance 2 log imports 3 log exports 4 number of shared intergovernmental organizations 5 population initiator 6 population target 7 log gdp initiator 8 log gdp target 9 polity score initiator 10 polity score target 11 polity score initiator polity score target Model: y i,j are independent Poisson random variables with log-mean log E[y i,j ]=β x i,j + u idv j + ɛ i,j 22

International conflict network, 1990-2000: Parameter estimates p(k Y) 00 01 02 03 04 log distance polity initiator intergov org polity target polity interaction log imports log gdp target log exports log gdp initiator log pop target log pop initiator 0 2 4 6 8 K 10 05 00 05 standardized coefficient 23

AFG ANG ARG AUL AH NG OT UI AN AO HA HN OS U YP DOM DR EGY FRN GHA GR GUI HON IND INS IRN IRQ ISR ITA JOR JPN KEN LR LES LI MAL MYA NI NIG NIR NTH OMA PAK PHI PRK QAT ROK RWA SAF SAL SAU SEN SIE SRI SUD SWA SYR TAW TAZ THI TOG TUR UAE UGA UKG USA VEN AFG AL ANG ARG AUL AH EL EN NG UI AM AN DI HA HL HN OL ON OS U YP DR EGY FRN GHA GN GR GUI GUY HAI HON IND INS IRN IRQ ISR ITA JOR JPN LR LES LI MAA MLI MON MOR MYA MZM NAM NI NIG NIR NTH OMA PAK PHI PNG PRK QAT ROK RWA SAF SAL SAU SEN SIE SIN SPN SUD SYR TAW TAZ THI TOG TRI TUR UAE UGA UKG USA VEN YEM ZAM ZIM 24

Protein-protein interaction data Experiments are run on pairs of protein to see if they bind: 1 if proteins i, j bind; y i,j = 0 else There are many proteins, and ( ) many 2 possible experiments, only a subset of which are usually performed an we predict interactions from a subset of the experiments? Missing data experiment: Ecoli protein interaction network (utland et al 2005); All interactions checked on a 270 270 protein network; A very sparse network: ȳ =001 25

3 1 0 1 2 3 3 1 0 1 2 3 Full dataset u v 3 1 0 1 2 3 75% training data u v 25% test data u v 0 50 100 150 # of additional experiments # of new observed interactions 0 10 20 30 40 50 60 Finding missing links 26

Summary and future directions Summary: SVD models are a natural way to represent patterns in relational data; The SVD structure can be incorporated into a variety of model forms; MA on model rank can be done Future Directions: Extensions to higher dimensional arrays; Dynamic network inference (times series models for Y, U, D, V); Restrict latent characteristics to be orthogonal to design vectors 27