Memory-Efficient Low Rank Approximation of Massive Graphs

Size: px

Start display at page:

Download "Memory-Efficient Low Rank Approximation of Massive Graphs"

Thomas Watts
5 years ago
Views:

1 Fast and Memory-Efficient Low Rank Approximation of Massive Graphs Inderjit S. Dhillon University of Texas at Austin Purdue University Jan 31, 2012 Joint work with Berkant Savas, Donghyuk Shin, Si Si Han Song, Yin Zhang, Lili Qiu and Xin Sui, Keshav Pingali

2 Networks Social networks (Facebook, MySpace, LiveJournal,...) Communication networks (SKT) Biological networks (linking genes to diseases) Web graphs Collaboration and citation networks... Challenges Massive scale Facebook s social graph has over 750 million vertices Multiple sources of information Need scalable algorithms and good predictions

3 Social Networks Nodes: individual actors (people or groups of people) Edges: relationships (social interactions) between nodes Example: Karate Club Network:

4 Social Network Analysis Structural properties, models, and dynamics of social networks Behavioral Analysis Game Theory Structural Properties Proximity measures distances between pairs of vertices Community detection Strong ties vs Weak ties Link prediction who will become whose friend in the future Identify important people (centrality) Quantifying information flow through a link (communicability) Identify bridging links (betweenness)

5 Link Prediction Problem Consider a social network that evolves with time, A (t 2) A (t 1) A (t)? A t+1 Eve 5 Eve 5 Carol 3 1 Alice Carol 3 1 Alice Dave 4 2 Bob Dave 4 2 Bob Q: Can we predict links that will form at time step t + 1? A: Many models exist, e.g. common neighbors, the Katz measure, etc.

6 Matrix Functions The adjacency matrix A = [a ij ] of graph G = (V, E) is given by { w ij, if there is an edge between nodes i and j a ij = 0, otherwise Powers of A: (A n ) ij counts the number of length-n walks between nodes i and j Matrix Functions serve as useful conceptual and computational tool, e.g. matrix exponential and Katz measure: f (A) = e A = I + A + A2 2! + A3 3! + + Ak k! + f (A) = (I βa) 1 = I + βa + β 2 A 2 + β 3 A 3 + E. Estrada and D. J. Higham. Network Properties Revealed through Matrix Functions. SIAM Review, 2010.

7 Matrix Functions Centrality: f (A) ii, quantifies node i s well-connectedness Communicability: f (A) ij, measures how easy it is for information to pass from node i to j Betweenness: quantifies influence of a node as information flows around a network, how communicability changes when a node is removed i j,i r,j r f (A) ij f (A E(r)) ij f (A) ij

8 Low-Rank Approximations Truncated SVD: A i V Λ i V T allows approximations of matrix functions: e A Ve Λ V T (I βa) 1 V (I βλ) 1 V T BUT SVD has Major Drawback. For example, in LiveJournal graph: 3.8 million nodes and 65 million edges Average degree = 17.1 Requires about 1GB of memory to store adjacency matrix Rank-100 approximation requires about 6GB of memory PROBLEM: SVD Does Not Preserve SPARSITY

9 SVD is NOT Optimal Social network with 13,500 nodes and 365,000 edges (from LiveJournal) Relative reconstruction error in % SVD Our method with 10 clusters Rank of approximation Relative reconstruction error in % SVD Our method with 10 clusters Memory usage (number of floating point numbers) x 10 6

10 Proposed Method

11 Graph Partitioning/Clustering In many applications, goal is to partition/cluster nodes of a graph: Karate Club Network

12 Graph clustering Cluster vertices V into c disjoint sets V i, V = c i=1v i and V i V j = i j Objective functions for clustering RCut = NCut = KLObj = min V 1,...,V c min V 1,...,V c min V 1,...,V c c i=1 c i=1 c i=1 links(v i, VnV i ) V i links(v i, VnV i ) degree(v i ) links(v i, VnV i ) V i (Chan et al. 1994) (Shi and Malik 2000) s.t. V i = V j (Kernighan and Lin 1970) Spectral clustering can be expensive for massive graphs Fast clustering algorithms without eigenvector computations GRACLUS (Dhillon et al. 2007) and METIS (Karypis and Kumar 1999)

13 Multilevel partitioning scheme Repeated coarsening and partitioning

8 % 77.9 % # of vertices 21,363 21,363 A11.... A=.

14 Clustering example: arxiv data graph condmat # of clust 10 (graclus) 10 (metis) edges in clust 79.8 % 77.9 % # of vertices 21,363 21,363 A A=.. Ac1 # of edges 182, ,628 relative error A1c... Acc

15 Low rank vs clustered low-rank Low rank: A UΣV T [ ] [ ] [ U1 0 S11 S Clustered low rank: A 12 V1 0 0 U 2 S 21 S 22 0 V 2 ] T Observe diag(u 1, U 2,, U c ) and has the same memory usage as U Experiments show that most of U is contained in diag(u 1, U 2,, U c ) Similarly of V and diag(v 1, V 2,, V c )

16 Algorithm: Clustered low rank approximation Input: An m m adjacency matrix A of a graph, number of clusters c Output: Clustered low rank approximation of A 1: Cluster the graph into c clusters 2: Compute a low rank approximation of each cluster Ui Si Vi Aii 3: Extend the cluster-wise approximations, into an approximation for the entire matrix A Sij = UiT Aij Vj

17 Approximations within each cluster Deterministic methods truncated SVD using ARPACK, PROPACK, SVDPACK. Stochastic methods Y = AΩ or Y = (AA T ) q AΩ where q is small (1,2 or 3), and Ω is a random matrix (standard Gaussian). An approximation is obtained with A Â = P Y A = Y (Y T Y ) 1 Y T A where P Y is an orthogonal projection onto the range space of Y

18 Deterministic clustered error bound For any partitioning and any Y (i) form Y = P Y and each P Y (i) Theorem Y (1)... A 11 A 1c A =..... A c1 A cc Y (c) P Y = are orthogonal projectors P Y (1)... PY (c) c (I P Y )A 2 F (I P Y (i))a ij 2 F i,j=1

19 Stochastic error bounds in the clustered setting Theorem E (I P Y )A 2 F E (I P Y )A 2 + c i=1 i=1 ( 1 + k i p i + 1 ) Σ (i) 2 2 F + c ( c ( ) ki 1 + Σ (i) pi 1 c i,j=1, i j A ij 2 i,j=1, i j A ij 2 F e k i + p i [ ] A ii U (i) Σ (i) (V (i) ) T = U (i) Σ (i) 1 0 [ ] T 0 Σ (i) V (i) 1 V (i) 2 2 p i Σ (i) 2 F )

20 LiveJournal graph 3.8 million vertices; 65 million edges Edges within clusters: 76.9%, 69.3%, 66.3%; graph not highly clusterable Ranks are k = 20, 50, 100, 150, 200. (117 clusters) Relative error in % svds 22 clusters w/ svds 61 clusters w/ svds 117 clusters w/ svds Number of parameters (memory) x 10 9 Relative error in % svds randalg randalgpow svds C randalg C randalgpow C Number of parameters (memory) x 10 9

21 Multi-scale Prediction

22 Hierarchical Clustering: arxiv data graph condmat Networks exhibit a hierarchical structure. 21,363 vertices and 182,628 edges.

23 Hierarchical Approach A level 0 A (1) 11 A (1) 22 level 1 A (2) 11 A (2) 22 A (2) 33 A (2) 44 level 2

24 Hierarchical Approach A U level 0 A (1) 11 U (1) 1 A (1) 22 U (1) 2 level 1 A (2) 11 U (2) 1 A (2) 22 U (2) 2 A (2) 33 U (2) 3 A (2) 44 U (2) 4 level 2

25 Hierarchical Approach A U level 0 A (1) 11 U (1) 1 A (1) 22 U (1) 2 level 1 A (2) 11 U (2) 1 A (2) 22 U (2) 2 A (2) 33 U (2) 3 A (2) 44 U (2) 4 level 2 A has 2 clustered low rank approximations: Level 1: A A (1) = [ ] [ U (1) U (1) 2 S (1) 11 S (1) 12 S (1) 21 S (1) 22 ] [ U (1) 1 0 ] T 0 U (1) 2

26 Hierarchical Approach A U level 0 A (1) 11 U (1) 1 A (1) 22 U (1) 2 level 1 A (2) 11 U (2) 1 A (2) 22 U (2) 2 A (2) 33 U (2) 3 A (2) 44 U (2) 4 level 2 A has 2 clustered low rank approximations: Level 2: A A (2) = U (2) U (2) U (2) U (2) 4 S (2) 11 S (2) 12 S (2) 13 S (2) 14 S (2) 21 S (2) 22 S (2) 23 S (2) 24 S (2) 31 S (2) 32 S (2) 33 S (2) 34 S (2) 41 S (2) 42 S (2) 43 S (2) 44 U (2) U (2) U (2) U (2) 4 T

27 SVD of parent node A (P ) U (P ) A (C) 11 U (C) 1 A (C) 22 U (C) 2 Q: Given U (C) 1 and U (C) 2., each of rank k, how to compute U (P)? Alternative 1: U (P) = [ U (C) U (C) 2 ]. But then U (P) = U (C) 1 U (C) 2. Alternative 2: U (P) comes from truncated rank-k SVD of A (P). But computationally slow if U (P) is computed from scratch.

28 Closeness to SVD: Principal Angles arxiv data graph AstroPh: 17,903 nodes, 394,003 edges. ( [ ]) principal angles svd(a (P) U (C) ), U (C) Cosine of principal angles Basis

29 Fast approximation of SVD of parent node Input: A = A (P), Ω = Output: U (P) [ ] U (C) U (C) and integer q. 2 1: Compute Y = (AA T ) q AΩ. 2: Compute Q as an orthonormal basis for the range of Y, e.g., via the QR factorization Y = QR. 3: Compute B = Q T AQ. (A Q(Q T AQ)Q T ) 4: Compute the eigen decomposition of B = V ΛV T. 5: Compute U (P) = QV. Just q steps of power(subspace) iteration followed by orthogonal projection onto resulting subspace.

30 Closeness to SVD: Principal Angles arxiv data graph AstroPh: 17,903 nodes, 394,003 edges. principal angles ( svd(a (P) ), U (P)) Cosine of principal angles Base Reuse (q=1) Rand q=1 0.6 Rand q=2 Rand q= Rand q=4 Rand q= Basis

31 Multi-Scale prediction Compute: f (A (0) ), f (A (1) ), f (A (2) ),..., f (A (l) ). Single-scale prediction: g(f ( )). Multi-scale prediction: g(w 0 f (A (0) ) + w 1 f (A (1) ) w l f (A (l) )), or combine g(f (A (i) )), i = 0, 1,..., l.

32 Link Inference Karate-Club: 34 nodes, 78 edges CN Katz CLRA Katz MultiScale Katz RandClust Katz Number of Hits Top k

33 Link Inference soc-epinions: 32,223 nodes, 684,026 edges CN Katz(lanczos) CLRA Katz MultiScale Katz False negative rate False positive rate x 10 4

34 Link Prediction LiveJournal: 1,757,326 nodes, T1:84,366,676 edges, T2: 85,666,494 edges CN Katz(lanczos) CLRA Katz HCLRA Katz False negative rate False positive rate x 10 3

35 Parallelization

36 Parallelized Clustered Low Rank Approximation ParMETIS partitions the graph in parallel Each process computes the SVDs of the diagonal blocks independently Heuristics to balance the approximation of off-diagonal blocks Matrix A ParMETIS Graph Reorganization SVD SVD SVD Off-Diagonal Matrix Approximations Approximated Matrix Â

37 Parallel Implementation Lonestar(TACC) machine: 1,888 nodes, each node has two Xeon 6-core 3.3GHz processors and 40GB/s of point-to-point bandwidth MPI, ParMETIS 3.1.1, ARPACK and GotoBLAS 1.30 One process is allocated on one node LiveJournal Data 3,828,682 vertices; 39,870,459 edges

38 Parallel implementation: Relations Data 9,746,232 vertices; 38,955,401 edges

39 ParMETIS becomes the bottleneck Breakdown of runtime: (a) LiveJournal (b) Relations

40 Summary and Conclusions Structural Analysis of Massive Social Networks New Method Clustered low rank approximation Combines fast clustering with low rank approximation Memory-efficient respects sparsity of graph Improved quality of approximation Efficiently parallelizable Applied to computation of proximity measures & link prediction on large social networks Future Work Hierarchical/Multilevel Approximation AND Prediction Develop better (parallel) multilevel clustering for scale-free graphs

41 References B. Savas and I. S. Dhillon. Clustered low rank approximation of graphs in information science applications. SIAM Conference on Data Mining, H. H. Song, B. Savas, T. W. Cho, V. Dave, Z. Lu, I. S. Dhillon, Y. Zhang, and L. Qiu. Clustered embedding of massive online social networks. Working manuscript. I. S. Dhillon, Y. Guan and B. Kulis. Weighted graph cuts without eigenvectors a multilevel approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, E. Estrada and D. J. Higham. Network Properties Revealed through Matrix Functions. SIAM Review, D. Easley and J Kleinberg, Networks, Crowds, and Markets, Cambridge University Press, 2010.

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing Lecture 12: Graph Clustering Cho-Jui Hsieh UC Davis May 29, 2018 Graph Clustering Given a graph G = (V, E, W ) V : nodes {v 1,, v n } E: edges