Vertex Nomination via Attributed Random Dot Product Graphs

Size: px

Start display at page:

Download "Vertex Nomination via Attributed Random Dot Product Graphs"

Ronald Boone
6 years ago
Views:

Vertex Nomination via Attributed Random Dot Product Graphs Carey E. Priebe cep@jhu.

1 Vertex Nomination via Attributed Random Dot Product Graphs Carey E. Priebe Department of Applied Mathematics & Statistics Johns Hopkins University 58th Session of the International Statistical Institute August 21-26, 2011 Dublin, Ireland David J. Marchette Glen A. Coppersmith 1 / 20

2 Vertex Nomination V \M κ n,m,m,r,p,s p M s p 2 / 20

3 Framework I A graph is a pair G = (V, E) of vertices V = {1,..., n} and edges E V (2), where V (2) denotes the set of unordered pairs of vertices. We define an attributed graph G a = (V, E a, ϕ V, ϕ E ) as follows. Let Φ V = {1,..., K V } and Φ E = {0, 1,..., K E }, and consider vertex attribution function ϕ V : V Φ V and edge attribution function ϕ E : V (2) Φ E. These functions attribute each vertex (resp. edge) of G a with an element of Φ V (resp. Φ E ). The edge set E a for G a is given by E a = {vw : ϕ(vw) > 0}; we write v w to mean that vw E a. For simplicity, we will assume that K E = K V = 2, so that there are two distinguished attributes for each of the vertices and edges. 3 / 20

4 Framework II For the vertex nomination problem, we write M = {v : ϕ V (v) = 1}; this is the collection of vertices of particular interest. We observe not G a = (V, E a, ϕ V, ϕ E ) but G o = (V, E a, ϕ V, ϕ E), where ϕ V : V Φ V {0} and ϕ V (v) = ϕ V (v) = 1 for v M M and ϕ V (v) = 0 for v V \ M. Thus we observe a subset M of cardinality m = M of the vertices of particular interest, and we wish to infer others we wish to nominate vertices from V \ M which are, we hope, in M \ M. 4 / 20

5 Vertex Nomination Our model is a κ graph with n = V vertices, m = M of which have attribute red and n m = V \ M of which have attribute green. We observe attributes for only m = M identified set vertices (filled circles) with the remaining n m = V \ M candidate set vertices (open circles) having occluded attributes. Edges with attribute green are of the not of interest topic (2) and red edges are the interesting topic (1). Pairs of red vertices (regardless of occlusion) are connected according to probability distribution over topics s, while pairs of vertices where at least one is labeled green are connected according to p. The vertex nomination task is to select one vertex from the candidate set V \ M (the vertices with occluded attributes, shown here as open circles) that is in M \ M (truly red) with high probability. 5 / 20

6 Vertex Nomination V \M κ n,m,m,r,p,s p M s p M M\M V \M M V \M V 6 / 20

7 Methods I Random dot product graph (RDPG) models are a special case of latent position models. These latent position models posit a social space associated with the vertices, where the relative positions in this social space determine the relationship (edge) probabilities. We will first give the basic definitions, then discuss estimation in the model. The basic idea is that each vertex v has associated with it a vector x v R d, and these vectors determine the edge probabilities of the random graph via P [v w] = x T v x w. (The vectors must satisfy 0 x T v x w 1.) We denote by X the n d matrix whose v th row corresponds to x v. 7 / 20

8 Methods II Given a graph G = (V, E) and a target dimensionality d for the latent vectors, we can obtain an estimate X in a single step as follows: 1. Let Ã = A + D, where D = deg(v)/(n 1) is the diagonal matrix with the normalized vertex degrees on the diagonal. 2. Compute the eigenvectors U and eigenvalues Λ of Ã, and set all negative entries in Λ to zero. 3. X = U1,...,d Λ1,...,d. This yields X = arg min X Ã XT X F. Imputing the diagonal of Ã allows for a one-step solution. The justification for the choice deg(v)/(n 1) is based on the observation that E[deg(v)] = w v xt v x w ; if all the vectors are the same (x w = x v w) then E[deg(v)] = (n 1)x T v x v. Under the assumption that the latent vectors are random variables and are independent and identically distributed, E[deg(v)] (n 1)E[Xv T X v ] by Cauchy-Schwarz. 8 / 20

9 Methods III We extend the RDPG model to allow for edge attributes in a very natural way. The idea is to posit that the edge existence and edge attributes are fundamentally tied. Intuitively, we are modeling the edges as if they were communications, with the attributes corresponding to topics, and the vectors in the RDPG model encoding the interest level of each individual in each of the topics. Let us consider the case where we allocate one dimension per edge attribute. Since Φ E = {0, 1,, K}, we define X to be an n K matrix where each row x v statisfies x vk 0 for all k and k x vk 1. (This is sufficient, but not necessary, to guarantee that the vectors satisfy 0 x T v x w 1.) Then the attributed RDPG model ARDPG(X) is given by P [v w] = x T v x w and P [ϕ E (vw) = k v w] = x vkx wk x T v x w for k = 1,, K. 9 / 20

10 Thus, Methods IV P [v w ϕ E (vw) = k] = x vk x wk. That is, the random variable ϕ E (vw) is distributed according to the discrete distribution K ϕ E (vw) D({0, 1,, K}, [1 x vk x wk, x v1 x w1,, x vk x wk ]). k=1 10 / 20

11 Methods V If the matrix X is itself random for instance, if the individual vertex vectors (rows of X) are obtained by first drawing X v Dirichlet(α v ) where the α v R K+1 + so that the vectors X v satisfy the constraint that X T v X w [0, 1] then the random variables ϕ E (vw) for ARDPG(X) are not independent but are conditionally independent, given X. (In a slight abuse of standard notation, we will consider X v Dirichlet(α v ) with α v R K+1 + to be a length K random vector, with X vk 0 for all k and k X vk 1; that is, we drop the (K + 1)st element of a standard Dirichlet in which X v(k+1) = 1 K k=1 X vk.) It is straightforward to extend this model to allow the vectors corresponding to each attribute to be multi-dimensional. 11 / 20

12 Methods VI We modify the unattributed RDPG algorithm to produce an estimation of the vectors in the attributed case: 1. Let L(G, d) denote the linear algebra approach to fit vectors of dimension d to the unattributed graph G. Thus L(G, d) is an n d matrix X that is the estimate of the RDPG vectors (the parameters of the RDPG model) that produced G. 2. Given attributed graph G a, for k {1,..., K} let G a k = (V, Ek a) where Ea k = {vw : ϕ E(vw) = k} is the subset of edges in E a with attribute k. 3. ˆx k = L(G a k, 1). 4. Xo = (ˆx 1,..., ˆx K ). We can ensure that the resulting vectors are in the first orthant, although post-processing is necessary if we demand that the vectors be in the simplex (or the unit ball) so that x T v x w [0, 1]. The simplicity of the algorithm is due to the decomposition Step 2. NB: dependency is lost, as we process each G k independently. 12 / 20

13 Methods VII Given an observed attributed graph G o with only some of the vertex attributes observed, the vertex nomination procedure we propose is as follows. (1) Fit an attributed RDPG, obtaining X o. Notice that this estimate does not use vertex attributes. (2) Using the subset M of vertices with observed vertex attributes, rank the candidate vertices V \ M according to their relationship to M. Having recovered latent space representations for the vertices in (1), the inferential step (2) may be performed in numerous ways; we present a few simple approaches in the experiments below. 13 / 20

14 Simulation Model I We define an attributed RDPG model κ(n, π V \M, m, π M, r) for the vertex nomination task as follows. Let π V \M and π M be in the standard (unit) K-simplex K R K+1 ; that is, they are (K + 1)-dimensional probability vectors. Let n > m > 0 and r > 0 be given. Consider and X V \M iid Dirichlet(rπ V \M + 1) X M iid Dirichlet(rπ M + 1) to be (n m) K and m K-dimensional matrices, respectively. Write X = (X V \M, X M ) as the n K matrix of (random) vertex vectors. Then κ(n, π V \M, m, π M, r) = ARDPG(X) is our attributed random dot product graph model, with each dimension corresponding to an edge attribute as described above. Again, the extension to multi-dimensional vertex vectors for each attribute is straightforward. 14 / 20

15 Simulation Model II G a κ(n, π V \M, m, π M, r), and for 0 < m < m the observed attributed graph G o κ(n, π V \M, m, π M, r; m ), involves selecting a random subset M M all n m vertex attributes for V \ M are missing and m m of the vertex attributes for M are missing completely at random. For our simulation experiments, we consider π V \M = [0.2, 0.2, 0.6] T and π M = [q, 0.2, 0.8 q] T with q ranging from 0.2 (no signal) to 0.8. The parameter r controls the variability of the Dirichlet random vectors; r = 0 gives a uniform distribution on the simplex (no signal) and r yields point mass (no variability); we use r = 100. We consider m = 10 and m = 5, and consider two cases for n (100 and 250). 15 / 20

16 Simulation Performance Criteria Tabular simulation results are presented, based on 1000 Monte Carlo replicates. We obtain the estimate of the latent vertex vector matrix X o as described above, and then rank the candidate vertices V \ M and evaluate based on (1) p = P [v M \ M ] where v is the top-ranked candidate vertex and (2) Normalized Sum of Reciprocal Ranks given by NSRR = v M\M 1 / rank(v) ( m m i=1 ) 1. i Our candidate vertex rankings, based on the latent space estimate X o, are given by (1) the number N(v) of observed signal vertices M amongst the m nearest neighbors of v and (2) the posteriors ρ(v) from a linear discriminant analysis classifier. 16 / 20

17 Simulation Results n = 100 n = 250 N(v) ρ(v) N(v) ρ(v) q p NSRR p NSRR p NSRR p NSRR Vertex nomination results as a function of q, based on 1000 Monte Carlo replicates. As expected, the performance improves as q increases. Note that in this case the performance improves as n increases. While the latent vertex vector estimation improves as n increases, larger n (for fixed m, m ) results in a harder vertex nomination problem. This trade-off deserves further study. 17 / 20

18 Enron Results Vertex nomination results on two Enron graphs, G a T 0 (Sep Feb , left) and G a T 1 (May Sep , right). The symbols correspond to M (squares) and V \ M (circles). A two-sample Wilcoxon rank sum test on the ranks r v of the nearest element of M to v yields p < 0.01 for G a T 1 and p 0.7 for G a T 0 statistically significant signal (M vs. V \ M) is identified for T 1 but not for T / 20

19 Conclusions & Discussion V \M κ n,m,m,r,p,s p M s p 19 / 20

give to mathematics a new direction and a new

20 Kronecker Quote The wealth of your practical experience with sane and interesting problems will give to mathematics a new direction and a new impetus. Leopold Kronecker to Hermann von Helmholtz 20 / 20

Optimizing the Quantity/Quality Trade-off in Connectome Inference

Optimizing the Quantity/Quality Trade-off in Carey E. Priebe Department of Applied Mathematics & Statistics Johns Hopkins University October 31, 2011 Motivation The connections made by cortical brain cells