CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Similar documents
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

Jure Leskovec Joint work with Jaewon Yang, Julian McAuley

Graph Clustering Algorithms

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Groups of vertices and Core-periphery structure. By: Ralucca Gera, Applied math department, Naval Postgraduate School Monterey, CA, USA

Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach

On methods to assess the significance of community structure in networks of financial time series.

1 Matrix notation and preliminaries from spectral graph theory

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Analysis of Networks Jure Leskovec, Stanford University

Karnaugh Maps Objectives

1 Matrix notation and preliminaries from spectral graph theory

Nonparametric Bayesian Matrix Factorization for Assortative Networks

Communities, Spectral Clustering, and Random Walks

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Mathematics 2260H Geometry I: Euclidean geometry Trent University, Winter 2012 Quiz Solutions

Mathematics 2260H Geometry I: Euclidean geometry Trent University, Fall 2016 Solutions to the Quizzes

1 Mechanistic and generative models of network structure

Asymptotic Modularity of some Graph Classes

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Overlapping Communities

Project in Computational Game Theory: Communities in Social Networks

A GENERALIZATION OF THE FARKAS AND KRA PARTITION THEOREM FOR MODULUS 7

D B M G Data Base and Data Mining Group of Politecnico di Torino

CSC 261/461 Database Systems Lecture 13. Spring 2018

Class Note #20. In today s class, the following four concepts were introduced: decision

Strange Combinatorial Connections. Tom Trotter

Association Rules. Fundamentals

Kristina Lerman USC Information Sciences Institute

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

Cohesive Subgroups. October 4, Based on slides by Steve Borgatti. Modifications (many sensible) by Rich DeJordy

Data Mining Techniques

6.207/14.15: Networks Lecture 12: Generalized Random Graphs

Computational Lower Bounds for Community Detection on Random Graphs

2 M13/5/MATME/SP2/ENG/TZ1/XX 3 M13/5/MATME/SP2/ENG/TZ1/XX Full marks are not necessarily awarded for a correct answer with no working. Answers must be

Modularity and Graph Algorithms

NAME: Mathematics 133, Fall 2013, Examination 3

Efficient Reassembling of Graphs, Part 1: The Linear Case

Geometry Problem Solving Drill 08: Congruent Triangles

Modeling Dynamic Evolution of Online Friendship Network

Lecture 12: Introduction to Spectral Graph Theory, Cheeger s inequality

A LINE GRAPH as a model of a social network

CS246 Final Exam. March 16, :30AM - 11:30AM

Advanced Digital Design with the Verilog HDL, Second Edition Michael D. Ciletti Prentice Hall, Pearson Education, 2011

A Multiobjective GO based Approach to Protein Complex Detection

2 k, 2 k r and 2 k-p Factorial Designs

Machine Learning for Data Science (CS4786) Lecture 11

CS224W: Social and Information Network Analysis

Design theory for relational databases

Topics in Theoretical Computer Science April 08, Lecture 8

The Trouble with Community Detection

2, find c in terms of k. x

triangles in neutral geometry three theorems of measurement

9. Submodular function optimization

Team Solutions. November 19, qr = (m + 2)(m 2)

CS 484 Data Mining. Association Rule Mining 2

Chapter 11. Matrix Algorithms and Graph Partitioning. M. E. J. Newman. June 10, M. E. J. Newman Chapter 11 June 10, / 43

Networks: Lectures 9 & 10 Random graphs

Similarity Measures for Link Prediction Using Power Law Degree Distribution

LEXLIKE SEQUENCES. Jeffrey Mermin. Abstract: We study sets of monomials that have the same Hilbert function growth as initial lexicographic segments.

ORIE 4741: Learning with Big Messy Data. Spectral Graph Theory

Express g(x) in the form f(x) + ln a, where a (4)

On Defining and Computing Communities

ECS 253 / MAE 253, Lecture 15 May 17, I. Probability generating function recap

Individual Round CHMMC November 20, 2016

Chapter 5-2: Clustering

CS168: The Modern Algorithmic Toolbox Lectures #11 and #12: Spectral Graph Theory

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS 584 Data Mining. Association Rule Mining 2

Subgraph Detection Using Eigenvector L1 Norms

Express g(x) in the form f(x) + ln a, where a (4)

19. Blocking & confounding

Lecture 13: Spectral Graph Theory

CSE 140: Components and Design Techniques for Digital Systems

Suppose we needed four batches of formaldehyde, and coulddoonly4runsperbatch. Thisisthena2 4 factorial in 2 2 blocks.

Learning latent structure in complex networks

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Using Local Spectral Methods in Theory and in Practice

IE 361 Exam 3 (Form A)

Social and Technological Network Analysis. Lecture 4: Modularity and Overlapping Dr. Cecilia Mascolo

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

Lecture 4: Four Input K-Maps

Regression Clustering

Triangles. 3.In the following fig. AB = AC and BD = DC, then ADC = (A) 60 (B) 120 (C) 90 (D) none 4.In the Fig. given below, find Z.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS 6820 Fall 2014 Lectures, October 3-20, 2014

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

Lecture 12 : Graph Laplacians and Cheeger s Inequality

Protein Complex Identification by Supervised Graph Clustering

nx + 1 = (n + 1)x 13(n + 1) and nx = (n + 1)x + 27(n + 1).

CS 5014: Research Methods in Computer Science. Experimental Design. Potential Pitfalls. One-Factor (Again) Clifford A. Shaffer.

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

UNCC 2001 Comprehensive, Solutions

1 Non-deterministic Turing Machine

Data Mining Concepts & Techniques

SMALL-WORLD NAVIGABILITY. Alexandru Seminar in Distributed Computing

Transcription:

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu

Non-overlapping vs. overlapping communities 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2

[Palla et al., 05] 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3 A node belongs to many social circles

11/16/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4

[Palla et al., 05] 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5 Two nodes belong to the same community if they can be connected through adjacent k-cliques: k-clique: Fully connected graph on k nodes Adjacent k-cliques: overlap in k-1 nodes k-clique community Set of nodes that can be reached through a sequence of adjacent k-cliques 3-clique adjacent 3-cliques

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6 Two nodes belong to the same community if they can be connected through adjacent k- cliques: 4-clique

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7 Clique Percolation Method: Find maximal-cliques (not k-cliques!) Clique overlap graph: Each clique is a node Connect two cliques if they overlap in at least k-1 nodes Communities: Connected components of the clique overlap matrix How to set k? Cliques Set k so that we get the richest (most widely distributed cluster sizes) community structure B B A A C C D Communities D

Start with graph Find maximal cliques Create clique overlap matrix Threshold the matrix at value k-1 If a ij <k-1 set 0 Communities are the connected components of the thresholded matrix (1) Graph (2) Clique overlap matrix (3) Thresholded matrix at 3 (4) Communities (connected components) 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8

[Palla et al., 07] 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9 Communities in a tiny part of a phone call network of 4 million users [Palla et al., 07]

[Farkas et. al. 07] 11/16/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10

No nice way, NP-hard combinatorial problem Maximal clique: clique that can t be extended {a,b,c} is a clique but not maximal clique {a,b,c,d} is maximal clique Algorithm: Sketch Start with a seed node Expand the clique around the seed Once the clique cannot be further expanded we found the maximal clique Note: This will generate the same clique multiple times 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11

Start with a seed vertex a Goal: Find the maximal clique Q a belongs to Observation: If some x belongs to Q then it is a member of a Why? If a,x Q but not a x, then Q is not a clique! Recursive algorithm: Q current clique R candidate vertices to expand the clique to Example: Start with a and expand around it Q= {a} {a,b} {a,b,c} bktrack {a,b,d} R= {b,c,d} {b,c,d} {d} Γ(c)={} {c} Γ(d)={} Γ(b)={c,d} Steps of the recursive algorithm Γ(u) neighbor set of u 11/17/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12

Q current clique R candidate vertices Expand(R,Q) while R {} p = vertex in R Q p = Q {p} R p = R Γ(p) if R p {}: Expand(R p, Q p ) else: output Q p R = R {p} Start: Expand(V, {}) R={a, f}, Q={} p = {a} Q p = {a} R p = {b,d} Expand(R p, Q): R = {b,d}, Q={a} p = {b} Q p = {a,b} R p = {d} Expand(R p, Q): R = {d}, Q={a,b} p = {d} Q p = {a,b,d} R p = {} : output {a,b,d} p = {d} Q p = {a,d} R p = {b} Expand(R p, Q): R = {b}, Q={a,d} p = {b} Q p = {a,d} R p = {} : output {a,d,b} 11/17/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13

Q current clique R candidate vertices Expand(R,Q) while R {} p = vertex in R Q p = Q {p} R p = R Γ(p) if R p {}: Expand(R p, Q p ) else: output Q p R = R {p} Start: Expand(V, {}) R={a, f}, Q={} p = {b} Q p = {b} R p = {a,c,d} Expand(R p, Q): R = {a,c,d}, Q={b} p = {a} Q p = {b,a} R p = {d} Expand(R p, Q): R = {d}, Q={b,a} p = {d} Q p = {b,a,d} R p = {} : output {b,a,d} p = {c} Q p = {b,c} R p = {d} Expand(R p, Q): R = {d}, Q={b,c} p = {d} Q p = {b,c,d} R p = {} : output {b,c,d} 11/17/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14

How to prevent maximal cliques to be generated multiple times? Only output cliques that are lexicographically minimum {a,b,c} < {b,a,c} Even better: Only expand to the nodes higher in the lexicographical order Start: Expand(V, {}) R={a, f}, Q={} p = {a} Q p = {a} R p = {b,d} Expand(R p, Q): R = {b,d}, Q={a} p = {b} Q p = {a,b} R p = {d} Expand(R p, Q): R = {d}, Q={a,b} p = {d} Q p = {a,b,d} R p = {} : output {a,b,d} p = {d} Q p = {a,d} R p = {b} Don t expand b < d 11/17/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15

11/17/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 16

Let s rethink what we are doing Given a network Want to find communities! Need to: Formalize the notion of a community Need an algorithm that will find sets of nodes that are good communities More generally: How to think about clusters in large networks? 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18 What is a good cluster? Many edges internally Few pointing outside Formally, conductance: S S Where: A(S).volume Small Φ(S) corresponds to good clusters

19 How community like is a set of nodes? A good cluster S has Many edges internally Few edges pointing outside Simplest objective function: Conductance {( i, j) E; i S, j S} φ( S) = s S d s Small conductance corresponds to good clusters S S

[WWW 08] 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20 Define: Network community profile (NCP) plot Plot the score of best community of size k k=5 k=7 k=10 log Φ(k) Community size, log k

Cluster score, log Φ(k) Run the favorite clustering method Each dot represents a cluster For each size find best cluster Spectral Graclus Metis Cluster size, log k 11/12/2009 Jure Leskovec, Stanford CS322: Network Analysis 21

11/10/2010 [WWW 08] Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22 Meshes, grids, dense random graphs: d-dimensional meshes California road network

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu [WWW 08] 23 Collaborations between scientists in networks [Newman, 2005] Conductance, log Φ(k) Community size, log k

[Internet Mathematics 09] 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24 Natural hypothesis about NCP: NCP of real networks slopes downward Slope of the NCP corresponds to the dimensionality of the network What about large networks?

[Internet Mathematics 09] 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25 Typical example: General Relativity collaborations (n=4,158, m=13,422)

[Internet Mathematics 09] 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26

Better and better clusters Φ(k), (score) Clusters get worse and worse Best cluster has ~100 nodes k, (cluster size) 27

As clusters grow the number of edges inside grows slower that the number crossing Φ=1/7=0.14 Φ=2/10 = 0.2 Φ=8/20 = 0.4 Φ=64/92 = 0.69 Each node has twice as many children 28

29 Empirically we note that best clusters are barely connected to the network NCP plot Core-periphery structure

Nothing happens! Nestedness of the core-periphery structure 30

31 Denser and denser core of the network Core contains 60% node and 80% edges Whiskers are responsible for good communities Nested Core-Periphery (jellyfish, octopus)

11/17/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 33 Some issues with community detection: Many different formalizations of clustering objective functions Objectives are NP-hard to optimize exactly Methods can find clusters that are systematically biased Questions: How well do algorithms optimize objectives? What clusters do different methods find?

[WWW 09] 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 34 Single-criterion: Modularity: m-e(m) Edges cut: c Multi-criterion: Conductance: c/(2m+c) Expansion: c/n Density: 1-m/n 2 CutRatio: c/n(n-n) Normalized Cut: c/(2m+c) + c/2(m-m)+c n: nodes in S m: edges in S c: edges pointing outside S Flake-ODF: frac. of nodes with more than ½ edges pointing outside S S

[WWW 09] 11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 35 Many algorithms to that implicitly or explicitly optimize objectives and extract communities: Heuristics: Girvan-Newman, Modularity optimization: popular heuristics Metis: multi-resolution heuristic [Karypis-Kumar 98] Theoretical approximation algorithms: Spectral partitioning

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu [WWW 09] 36 Spectral Metis LiveJournal

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu [WWW 09] 37 500 node communities from Spectral: 500 node communities from Metis:

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu [WWW 09] 38 Conductance of bounding cut Diameter of the cluster Spectral Connected Metis Disconnected Metis External/Internal conductance Metis (red) gives sets with better conductance Spectral (blue) gives tighter and more well-rounded sets Lower is good

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu [WWW 09] 39 All qualitatively similar Observations: Conductance, Expansion, Normcut, Cut-ratio are similar Flake-ODF prefers larger clusters Density is bad Cut-ratio has high variance

11/10/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu [WWW 09] 40 Observations: All measures are monotonic Modularity prefers large clusters Ignores small clusters