Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec

Size: px

Start display at page:

Download "Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec"

Joan Nicholson
5 years ago
Views:

1 Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec Jiezhong Qiu Tsinghua University February 21, 2018 Joint work with Yuxiao Dong (MSR), Hao Ma (MSR), Jian Li (IIIS, Tsinghua), Kuansan Wang (MSR), Jie Tang (DCST, Tsinghua)

2 Motivation and Problem Formulation Problem Formulation Give a network G = (V, E), aim to learn a function f : V R p to capture neighborhood similarity and community membership. Applications: link prediction community detection label classification Figure 1: A toy example (Figure from DeepWalk).

History of Network Embedding 2017 metapath2vec [Dong et al.

] 2014 DeepWalk [Perozzi et al.] word2vec (skip-gram) [Mikolov et al.

2002 2000 A large body of literature [Pothen et al.

3 History of Network Embedding 2017 metapath2vec [Dong et al.] 2016 node2vec [Grover & Leskovec] 2015 LINE & PTE [Tang et al.] 2014 DeepWalk [Perozzi et al.] word2vec (skip-gram) [Mikolov et al.] 2009 Spectral Clustering v.s. Kernel k-means [Dhillon et al.] Spectral Clustering [Ng et al.] Image Segmentation [Shi & Malik] SocDim [Tang & Liu] A large body of literature [Pothen et al.] [Simon] [Bolla], [Hagen & Kahng] [Hendrickson & Leland] [Van Driessche & Roose], [Barnard et al.] [Spielman & Teng], [Guattery & Miller] Fiedler Vector [Fiedler] Spectral Partitioning [Donath, Hoffman]

4 Contents Preliminaries Main Theoretic Results Notations DeepWalk (KDD 14) LINE (WWW 15) PTE (KDD 15) node2vec (KDD 16) NetMF NetMF for a Small Window Size T NetMF for a Large Window Size T Experiments

5 Notations Consider an undirected weighted graph G = (V, E), where V = n and E = m. Adjacency matrix A R n n + : { a i,j > 0 (i, j) E A i,j = 0 (i, j) E. Degree matrix D = diag(d 1,, d n ), where d i is the generalized degree of vertex i. Volume of the graph G: vol(g) = i j A i,j. Assumption G = (V, E) is connected, undirected, and not bipartite, which makes P (w) = a unique stationary distribution. dw vol(g)

6 DeepWalk Roadmap Input G=(V,E) Random Walk Skip-gram Output: Node Embedding

7 DeepWalk a Two-step Algorithm Algorithm 1: DeepWalk 1 for n = 1, 2,..., N do 2 Pick w1 n according to a probability distribution P (w 1); 3 Generate a vertex sequence (w1 n,, wn L ) of length L by a random walk on network G; 4 for j = 1, 2,..., L T do 5 for r = 1,..., T do 6 Add vertex-context pair (wj n, wn j+r ) to multiset D; 7 Add vertex-context pair (wj+r n, wn j ) to multiset D; 8 Run SGNS on D with b negative samples.

8 DeepWalk Roadmap Input G=(V,E) Random Walk Skip-gram Output: Node Embedding Levy & Goldberg (NIPS 14) #(w, c) Co-occurrence of w and c D Total number of word-context pairs b Number of negative samples #(w) Occurrence of word w #(c) Occurrence of context c

9 Skip-gram with Negative Sampling (SGNS) SGNS maintains a multiset D which counts the occurrence of each word-context pair (w, c). Objective: L = ( #(w, c) log g ( x ) b#(w)#(c) wy c + log g ( x ) ) D wy c, w c where x w, y c R d, g is the sigmoid function, and b is the number of negative samples for SGNS. For sufficiently large dimensionality d, equivalent to factorizing PMI matrix (Levy & Goldberg, NIPS 14): ( ) #(w, c) D log. b#(w)#(c)

10 DeepWalk Roadmap Input G=(V,E) Random Walk Skip-gram Output: Node Embedding Levy & Goldberg (NIPS 14) #(w, c) Co-occurrence of w and c D Total number of word-context pairs b Number of negative samples #(w) Occurrence of word w #(c) Occurrence of context c

11 DeepWalk (c, d) a b c d e (c, a) (c, e) Question Suppose the multiset D is constructed ( ) based on random walk on graph, can we interpret log #(w,c) D b#(w)#(c) with graph theory terminologies?

12 DeepWalk (c, d) a b c d e (c, a) (c, e) Question Suppose the multiset D is constructed ( ) based on random walk on graph, can we interpret log #(w,c) D b#(w)#(c) with graph theory terminologies? Challange We mix so many things together, i.e., direction and distance.

13 DeepWalk (c, d) a b c d e (c, a) (c, e) Question Suppose the multiset D is constructed ( ) based on random walk on graph, can we interpret log #(w,c) D b#(w)#(c) with graph theory terminologies? Challange We mix so many things together, i.e., direction and distance. Solution Let s distinguish them!

14 DeepWalk Partition the multiset D into several sub-multisets according to the way in which vertex and its context appear in a random walk sequence. More formally, for r = 1,, T, we define D r = { (w, c) : (w, c) D, w = w n j, c = w n j+r}, D r = { (w, c) : (w, c) D, w = w n j+r, c = w n j }. (c, d) D! 1 a b c d e (c, a) (c, e) D 2 D! 2

15 DeepWalk as Implicit Matrix Factorization Some observations Observation 1: ( ) #(w, c) D log b#(w) #(c) = log #(w,c) D b #(w) D #(c) D Observation 2: #(w, c) D = 1 2T T ( #(w, c) r r=1 D r + #(w, c) r D r ). Sufficient to characterize #(w,c) r D r and #(w,c) r D r.

16 DeepWalk Theorems Theorem Denote P = D 1 A, when the length of random walk L, #(w, c) r D r p d w vol(g) (P r ) w,c and #(w, c) r D r p d c vol(g) (P r ) c,w. Theorem When the length of random walk L, we have #(w, c) D p 1 2T T r=1 ( dw vol(g) (P r ) w,c + d ) c vol(g) (P r ) c,w. Theorem For DeepWalk, when the length of random walk L, ( ) #(w, c) D p vol(g) 1 T (P r ) #(w) #(c) 2T d w,c + 1 T (P r ) c d c,w. w r=1 r=1

17 DeepWalk Conclusion Theorem DeepWalk is asymptotically and implicitly factorizing ( ( vol(g) 1 T ( log D 1 A ) ) ) r D 1. b T r=1

18 DeepWalk Roadmap Input G=(V,E) Random Walk Skip-gram Output: Node Embedding Levy & Goldberg (NIPS 14) Adjacency matrix Degree matrix b Number of negative samples

19 LINE Objective of LINE: V V ( L = A i,j log g ( x ) bd i d j i y j + vol(g) log g ( ) x ) i y j. i=1 j=1 Align it with the Objective of SGNS: L = ( #(w, c) log g ( x ) b#(w)#(c) wy c + log g ( x ) ) D wy c. w c LINE is actually factorizing ( ) vol(g) log D 1 AD 1 b Recall DeepWalk s matrix form: ( ( vol(g) 1 T ( log D 1 A ) ) ) r D 1. b T r=1 Observation LINE is a special case of DeepWalk (T = 1).

20 PTE Figure 2: Heterogeneous Text Network. word-word network G ww, A ww R #word #word. document-word network G dw, A dw R #doc #word. label-word network G lw, A lw R #label #word.

21 PTE as Implicit Matrix Factorization log α vol(g ww)(drow) 1 A ww (D ww β vol(g dw )(Drow) dw 1 A dw (D dw γ vol(g lw )(Drow) lw 1 A lw (D lw col ) 1 col ) 1 col ) 1 log b, The matrix is of shape (#word + #doc + #label) #word. b is the number of negative samples in training. {α, β, γ} are hyper-parameters to balance the weights of the three networks. In PTE, {α, β, γ} satisfy α vol(g ww ) = β vol(g dw ) = γ vol(g lw )

22 node2vec 2nd Order Random Walk 1 p (u, v) E, (v, w) E, u = w; 1 (u, v) E, (v, w) E, u w, (w, u) E; T u,v,w = 1 q (u, v) E, (v, w) E, u w, (w, u) E; 0 otherwise. P u,v,w = Prob (w j+1 = u w j = v, w j 1 = w) = Stationary Distribution P u,v,w X v,w = X u,v w T u,v,w u T. u,v,w Existence guaranteed by Perron-Frobenius theorem, but may not be unique.

23 node2vec as Implicit Matrix Factorization Theorem node2vec is asymptotically and implicitly factorizing a matrix whose entry at w-th row, c-th column is log ( 1 2T T r=1 ( u X w,up r c,w,u + u X c,up r )) w,c,u b ( u X w,u) ( u X c,u)

24 Contents Preliminaries Main Theoretic Results Notations DeepWalk (KDD 14) LINE (WWW 15) PTE (KDD 15) node2vec (KDD 16) NetMF NetMF for a Small Window Size T NetMF for a Large Window Size T Experiments

25 Roadmap Input G=(V,E) Random Walk Skip-gram Output: Node Embedding Levy & Goldberg (NIPS 14) Matrix Factorization

26 NetMF Factorize the DeepWalk matrix: ( ( vol(g) 1 T ( log D 1 A ) ) ) r D 1. b T r=1 For numerical reason, we use truncated logarithm log(x) = log (max(1, x)) Figure 3: Truncated Logarithm

27 NetMF for a Small Window Size T Algorithm 2: NetMF for a Small Window Size T 1 Compute P 1,, P T ; ( 2 Compute M = vol(g) T ) bt r=1 P r D 1 ; 3 Compute M = max(m, 1); 4 Rank-d approximation by SVD: log M = U d Σ d Vd ; 5 return U d Σd as network embedding.

28 NetMF for a Large Window Size T Observations ( 1 T We want to factorize log ( vol(g) b ( 1 T T r=1 ( D 1 A ) ) ) r D 1. We know the property of normalized graph Laplacian T r=1 D 1/2 AD 1/2 = UΛU where Λ = diag(λ 1,, λ n ) and λ i [ 1, 1]. ( D 1 A ) ) ( ( r D 1 = D 1/2) 1 T = (D 1/2) U T r) ( (D 1/2 AD 1/2) D 1/2) r=1 ( 1 T ) T Λ r r=1 } {{ } A polynomial U (D 1/2)

29 NetMF for a Large Window Size T Observations Eigenvalues After Filtering T = 1 T = 2 T = 5 T = Eigenvalues Before Filtering Figure 4: f(λ) = 1 T T r=1 λr Idea This polynomial implicitly filters out negative eigenvalues and small positive eigenvalues, why not do it explicitly.

30 NetMF for a Large Window Size T Algorithm Algorithm 3: NetMF for a Large Window Size T 1 Eigen-decomposition D 1/2 AD 1/2 U h Λ h Uh ; 2 Approximate M with ( ˆM = vol(g) b D 1/2 U 1 ) T h T r=1 Λr h Uh D 1/2 ; 3 Compute ˆM = max( ˆM, 1); 4 Rank-d approximation by SVD: log ˆM = U d Σ d Vd ; 5 return U d Σd as network embedding.

31 Setup Label Classification: BlogCatelog, PPI, Wikipedia, Flickr Logistic Regression NetMF (T = 1) v.s. LINE NetMF (T = 10) v.s. DeepWalk Table 1: Statistics of Datasets. Dataset BlogCatalog PPI Wikipedia Flickr V 10,312 3,890 4,777 80,513 E 333,983 76, ,812 5,899,882 #Labels

32 Experimental Results 50 BlogCatalog 30 PPI 70 Wikipedia 40 Flickr Micro-F1 (%) Macro-F1 (%) NetMF (T=1) LINE NetMF (T=10) DeepWalk Figure 5: Predictive performance on varying the ratio of training data. The x-axis represents the ratio of labeled data (%), and the y-axis in the top and bottom rows denote the Micro-F1 and Macro-F1 scores respectively.

33 Conclusion Table 2: The matrices that are implicitly approximated and factorized by DeepWalk, LINE, PTE, and node2vec. Algorithm Matrix ( ( ) ) 1 T DeepWalk log vol(g) T r=1 (D 1 A) r D 1 log b LINE log ( vol(g)d 1 AD 1) log b PTE log α vol(g ww)(drow) 1 A ww (Dcol ww) 1 β vol(g dw )(Drow) dw 1 A dw (Dcol dw) 1 log b ( γ vol(g lw )(Drow) lw 1 A lw (Dcol lw ) 1 1 T 2T node2vec log r=1( ) u Xw,uP r c,w,u + u Xc,uP r w,c,u) ( u Xw,u)( Xc,u) log b u

34 Thanks. Standing on the shoulders of giants Isaac Newton Code available at github.com/xptree/netmf Q&A

35 DeepWalk Sketched Proof Theorem Denote P = D 1 A, when L, we have #(w, c) r D r p d w vol(g) (P r ) w,c and #(w, c) r D r p d c vol(g) (P r ) c,w. Proof. Consider the special case when N = 1, thus we only have one vertex sequence w 1,, w L generated by random walk. Let Y j (j = 1,, L T ) be the indicator function for event that w j = w and w j+r = c w c w j w j+r

36 Proof (Con t) Observation E[Y j ] = Prob(w j = w, w j+r = c) L T j=1 Y j. #(w,c) r D r = 1 L T Cov(Y i, Y j ) 0 as i j. dw vol(g) (P r ) w,c. Lemma (S.N. Bernstein Law of Large Numbers) Let Y 1, Y 2 be a sequence of random variables with finite expectation E[Y j ] and variance Var(Y j ) < K, j 1, and covariances are s.t. Cov(Y i, Y j ) 0 as i j. Then the law of large numbers (LLN) holds. #(w, c) r D r = 1 L T p 1 Y j L T L T j=1 L T j=1 E(Y j ) d w vol(g) (P r ) w,c

37 Time Complexity Eigen-Decomposition (Implicitly Restarted Lanczos Method) O(mhI + nh 2 I + h 3 I). Reconstruction O(n 2 h) Element-wise logarithm O(n 2 ). SVD (a naive implementation with eigen-decomposition): O(n 2 di + nd 2 I + d 3 I).

38 Future Work Comprehend high-order cases, e.g., node2vec. ( 1 T ( 2T r=1 u log X w,up r c,w,u + u X c,up r )) w,c,u b ( u X w,u) ( u X c,u) Design scalable algorithm (e.g., using spectral sparsification of random-walk polynomials). ( ( vol(g) 1 T ( log D 1 A ) ) ) r D 1. b T r=1 Connection with graph convolutional networks (Kipf & Welling, ICLR 17).

arxiv: v2 [cs.si] 11 Oct 2017

arxiv: v2 [cs.si] 11 Oct 2017 Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PE, and node2vec Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie ang Department of Computer Science and echnology,